Skip to the content.

Course 2: Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

Week 1: Practical aspects of Deep Learning

Learning Objectives

Setting up your Machine Learning Application

Train / Dev / Test sets

Setting up the training, development (dev, also called validate set) and test sets has a huge impact on productivity. It is important to choose the dev and test sets from the same distribution and it must be taken randomly from all the data.

Guideline:

Bias / Variance

error type high variance high bias high bias, high variance low bias, low variance
Train set error 1% 15% 15% 0.5%
Dev set error 11% 16% 30% 1%

When we discuss prediction models, prediction errors can be decomposed into two main subcomponents we care about: error due to “bias” and error due to “variance”. There is a tradeoff between a model’s ability to minimize bias and variance. Understanding these two types of error can help us diagnose model results and avoid the mistake of over- or under-fitting.

To understand bias and variance better, read this essay: Understanding the Bias-Variance Tradeoff.

Bisic Recipe for Machine Learning

bias-variance-tradeoff

Regularizing your neural network

Regularization

Regularization for Logistic Regression:

reg-cost

b is just one parameter over a very large number of parameters, so no need to include it in the regularization.

regularization formula description
L2 regularization reg-cost most common type of regularization
L1 regularization reg-cost w vector will have a lot of zeros, so L1 regularization makes your model sparse

Regularization for a Neural Network:

reg-cost

For the matrix w, this norm is called the Frobenius norm. Its definition looks like L2 norm but is not called the L2 norm:

reg-cost

Regularization of gradient:

reg-nn-grad

With regularization the coefficient of w is slightly less than 1, in which case it is called weight decay.

reg-nn-weight-decay

Why regularization reduces overfitting

Implementation tips:

Without regularization term, we should see the cost function decreases monotonically in the plot. Whereas in the case of regularization, to debug gradient descent make sure that we plot J with a regularization term; otherwise, if we plot only the first term (the old J), we might not see a decrease monotonically.

Dropout Regularization

dropout

(image source: deepnotes)

Understanding Dropout

Note:

Other regularization methods

Related to orthogonalization, explained later, stay tuned!

Setting up your optimization problem

Normalizing inputs

With normalization, cost function will be more round and easier to optimize when features are all on similar scales. This is a very common topic, see more on Stack Overflow.

Vanishing / Exploding gradients

Weight Initialization for Deep Networks

A partial solution to the problems of vanishing and exploding gradients is better or more careful choice of the random initialization for neural network.

For a single neuron, suppose we have n features for the input layer, then we want Z = W1X1 + W2X2 + ... + WnXn not blow up and not become too small, so the larger n is, the smaller we want Wi to be.

A well chosen initialization can:

Implementation tips:

Numerical approximation of gradients

Numerically verify implementation of derivative of a function is correct and hence to check if there is a bug in the backpropagation implementation.

Two-sided difference formula is much more accurate:

Gradient checking

Implementation steps:

  1. Take W[1],b[1],...,W[L],b[L] and reshape into a big vector 𝜃: J(W[1],b[1],...,W[L],b[L])=J(𝜃).
  2. Take dW[1],db[1],...,dW[L],db[L] and reshape into a big vector d𝜃.
  3. For each i: d𝜃_approx[i] = (J(𝜃1,𝜃2,...,𝜃i+𝜀,...)-J(𝜃1,𝜃2,...,𝜃i-𝜀,...))/(2𝜀). (Should have d𝜃_approx[i] ≈ d𝜃[i])
  4. Check diff_ratio = norm_2(d𝜃_approx-d𝜃) / (norm_2(d𝜃_approx)+norm_2(d𝜃)) ≈ eps:
    1. diff_ratio ≈ 10^-7, great, backprop is very likely correct.
    2. diff_ratio ≈ 10^-5, maybe OK, better check no component of this difference is particularly large.
    3. diff_ratio ≈ 10^-3, worry, check if there is a bug.

Gradient checking implementation notes

Week 2: Optimization algorithms

Learning Objectives

Optimization algorithms

Mini-batch gradient descent

Vectorization allows you to process all M examples relatively quickly if M is very large, but it can still be slow. For example, m = 5,000,000 (or m = 50,000,000 or even bigger), we have to process the entire training sets of five million training samples before we take one little step of gradient descent.

We can use the mini-batch method to let gradient descent start to make some progress before we finish processing the entire, giant training set of 5 million examples by splitting up the training set into smaller, little baby training sets called mini-batches. In this case, we have 5000 mini-batches with 1000 examples each.

Notations:

In every step of the iteration loop, we need to loop for num_batches and do forward and backward computation for each batch.

  1. Forward propagation
  2. Compute cost function
  3. Backward propagation
  4. Update parameters (using parameters, and grads from backprop)

With mini-batch gradient descent, a single pass through the training set is one epoch, which in the above 5 million example, means 5000 gradient descent steps.

Understanding mini-batch gradient descent

batch size method description guidelines
=m batch gradient descent cost function decreases on every iteration;
but too long per iteration.
for a small training set (<2000).
=1 stochastic gradient descent cost function oscillates, can be extremely noisy;
wander around minimum;
lose speedup from vectorization, inefficient.
use a smaller learning rate when it oscillates too much.
between 1 and m mini-batch gradient descent somewhere in between, vectorization advantage, faster;
not guaranteed to always head toward the minimum but more consistently in that direction than stochastic descent;
not always exactly converge, may oscillate in a very small region, reducing the learning rate slowly may also help.
mini-batch size is a hyperparameter;
batch size better in [64, 128, 256, 512], a power of 2;
make sure that mini-batch fits in CPU/GPU memory.

Exponentially Weighted Averages

Moving averages are favored statistical tools of active traders to measure momentum. There are three MA methods:

MA methods calculations
simple moving average (SMA) calculated from the average closing prices for a specified period
weighted moving average (WMA) calculated by multiplying the given price by its associated weighting (assign a heavier weighting to more current data points) and totaling the values
exponential moving average (EWMA) also weighted toward the most recent prices, but the rate of decrease is exponential

For a list of daily temperatures:

london-temp-example

This data looks a little bit noisy (blue dots):

ewa1

ewa-on-temp

If we want to compute the trends, by averaging over a larger window, the above exponentially weighted average formula adapts more slowly when the temperature changes. So, there’s just a bit more latency. (See the red curve above)

ewa2

Understanding exponentially weighted averages

This topic is basically related to gradient descent optimizations.

ewa

The exponentially weighted average adds a fraction β of the current value to a leaky running sum of past values. Effectively, the contribution from the t−nth value is scaled by ewa-weight.

For example, here are the contributions to the current value after 5 iterations (iteration 5 is the current iteration)

iteration contribution
1 β^4(1−β)
2 β^3(1−β)
3 β^2(1−β)
4 β^1(1−β)
5 (1−β)

Since β<1, the contribution decreases exponentially with the passage of time. Effectively, this acts as a smoother for a function.

e-folding:

Andrew Ng also mentioned an interesting concept related to e-folding. He said:

Here 10 or 50 days is called one lifetime (1 e-folding). Generally, for an exponential decay quantity, after one lifetime (1/(1-β) iterations), 1/e ≈ 37% is remained and after two lifetime, 1/e^2 ≈ 14% is left.

For more information, check the definition of e-folding.

Bias correction in exponentially weighted averages

There’s one technical detail called biased correction that can make you computation of these averages more accurately. In the temperature example above, when we set β=0.98, we won’t actually get the green curve; instead, we get the purple curve (see the graph below).

ewa3

Because when we’re implementing the exponentially weighted moving average, we initialize it with V0=0, subsequently we have the following result in the beginning of the iteration:

As a result, V1 and V2 calculated by this are not very good estimates of the first two temperature. So we need some modification to make it more accurate, especially during the initial phase of our estimate to avoid an initial bias. This can be corrected by scaling with 1/(1-β^t) where t is the iteration number.

original correction
V1 V1c
V2 V2c

Gradient descent with momentum

Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will “oscillate” toward convergence. Using momentum can reduce these oscillations.

momentum-algo

Implementation tips:

RMSprop

RMSprop(root mean square), similar to momentum, has the effects of damping out the oscillations in gradient descent and mini-batch gradient descent and allowing you to maybe use a larger learning rate alpha.

The algorithm computes the exponentially weighted averages of the squared gradients and updates weights by the square root of the EWA.

for iteration t:
  # compute dW, db on mini-batch

  S_dW = (beta * S_dW) + (1 - beta) * dW^2
  S_db = (beta * S_db) + (1 - beta) * db^2
  W = W - alpha * dW / sqrt(S_dW + 𝜀)   # 𝜀: small number(10^-8) to avoid dividing by zero
  b = b - alpha * db / sqrt(S_db + 𝜀)

Adam optimization algorithm

V_dW = 0
V_db = 0
S_dW = 0
S_db = 0

for iteration t:
  # compute dW, db using mini-batch                
  
  # momentum
  V_dW = (beta1 * V_dW) + (1 - beta1) * dW     
  V_db = (beta1 * V_db) + (1 - beta1) * db     
  
  # RMSprop
  S_dW = (beta2 * S_dW) + (1 - beta2) * dW^2   
  S_db = (beta2 * S_db) + (1 - beta2) * db^2   
  
  # bias correction
  V_dW_c = V_dW / (1 - beta1^t)      
  V_db_c = V_db / (1 - beta1^t)
  S_dW_c = S_dW / (1 - beta2^t)
  S_db_c = S_db / (1 - beta2^t)
          
  W = W - alpha * V_dW_c / (sqrt(S_dW_c) + 𝜀)
  b = b - alpha * V_db_c / (sqrt(S_db_c) + 𝜀)

Implementation tips:

  1. It calculates an exponentially weighted average of past gradients, and stores it in variables V_dW,V_db (before bias correction) and V_dW_c,V_db_c (with bias correction).
  2. It calculates an exponentially weighted average of the squares of the past gradients, and stores it in variables S_dW,S_db (before bias correction) and S_dW_c,S_db_c (with bias correction).
  3. It updates parameters in a direction based on combining information from “1” and “2”.
hyperparameter guideline
learning rate tune
beta1 (parameter of the momentum, for dW) 0.9
beta2 (parameter of the RMSprop, for dW^2) 0.999
𝜀 (avoid dividing by zero) 10^-8

Adam paper: Adam: A Method for Stochastic Optimization

Learning rate decay

The learning algorithm might just end up wandering around, and never really converge, because you’re using some fixed value for alpha. Learning rate decay methods can help by making learning rate smaller when optimum is near. There are several decay methods:

decay factor description
0.95^epoch_num exponential decay
k/sqrt(epoch_num) or k/sqrt(t) polynomial decay
discrete staircase piecewise constant
manual decay

The problem of local optima

This is what a saddle point look like.

saddle-point

Quick notes for optimization algorithms

Recall that in Course 1 we have already known that there are several steps in the neural network implementation:

  1. Initialize parameters / Define hyperparameters
  2. Loop for num_iterations:
    1. Forward propagation
    2. Compute cost function
    3. Backward propagation
    4. Update parameters (using parameters, and grads from backprop)
  3. Use trained parameters to predict labels

When we create momentum, RMSprop or Adam optimization methods, what we do is to implement algorithms in the update parameters step. A good practice is to wrap them up as options so we can compare them during our alchemy training:

if optimizer == "gd":
    parameters = update_parameters_with_gd(parameters, grads, learning_rate)
elif optimizer == "momentum":
    parameters, v = update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)
elif optimizer == "adam":
    t = t + 1 # Adam counter
    parameters, v, s = update_parameters_with_adam(parameters, grads, v, s, t, learning_rate, beta1, beta2, epsilon)

Week 3: Hyperparameter tuning, Batch Normalization and Programming Frameworks

Learning Objectives

Hyperparameter tuning

Tuning process

Importance of hyperparameters (roughly):

importance level hyperparameters
first learning rate alpha
second momentum term beta
mini-batch size
number of hidden units
third number of layers
learning rate decay
Adam beta1, beta2, epsilon

Tuning tips:

Using an appropriate scale to pick hyperparameters

Search for hyperparameters on a log scale.

r = -4 * np.random.rand() # r in [-4,0]
alpha = 10**r             # alpha in [10^-4, 1]

It’s easy to extend to a more generalized case [a,b].

As for beta, use the same logarithmic scale method for 1-beta.

Hyperparameters tuning in practice: Panda vs. Caviar

Batch Normalization

Normalizing activations in a network

batch-norm

Fitting Batch Norm into a neural network

batch-norm-nn

Why does Batch Norm work

Batch Norm at test time

Multi-class classification

Softmax Regression

Use softmax activation function.

def softmax(z):
    return np.exp(z) / sum(np.exp(z))

z = [1,0.5,-2,1,3]
print(softmax(z)) 
# array([0.09954831, 0.0603791 , 0.00495622, 0.09954831, 0.73556806])

Training a softmax classifier

Softmax regression is a generalization of logistic regression to more than two classes.

Introduction to programming frameworks

Deep learning frameworks

Choosing deep learning frameworks:

Tensorflow

import numpy as np 
import tensorflow as tf

coefficients = np.array([[1], [-20], [25]])
w = tf.Variable([0],dtype=tf.float32)
x = tf.placeholder(tf.float32, [3,1])
cost = x[0][0]*w**2 + x[1][0]*w + x[2][0]    # (w-5)**2
train = tf.train.GradientDescentOptimizer(0.01).minimize(cost)
init = tf.global_variables_initializer()
session = tf.Session()
session.run(init) 
print(session.run(w))

for i in range(1000):
  session.run(train, feed_dict={x:coefficients})
print(session.run(w))

Notes by Aaron © 2020