Skip to the content.

Course 4: Convolutional Neural Networks

Week 1: Foundations of Convolutional Neural Networks

Learning Objectives

Convolutional Neural Networks

Computer Vision

Deep learning computer vision can now:

Deep learning for computer vision is exciting because:

For computer vision applications, you don’t want to be stuck using only tiny little images. You want to use large images. To do that, you need to better implement the convolution operation, which is one of the fundamental building blocks of convolutional neural networks.

Edge Detection Example

The convolution operation gives you a convenient way to specify how to find these vertical edges in an image.

A 3 by 3 filter or 3 by 3 matrix may look like below, and this is called a vertical edge detector or a vertical edge detection filter. In this matrix, pixels are relatively bright on the left part and relatively dark on the right part.

1, 0, -1
1, 0, -1
1, 0, -1

Convolving it with the vertical edge detection filter results in detecting the vertical edge down the middle of the image.

edge-detection

More Edge Detection

In the horizontal filter matrix below, pixels are relatively bright on the top part and relatively dark on the bottom part.

 1,  1,  1
 0,  0,  0
-1, -1, -1

Different filters allow you to find vertical and horizontal edges. The following filter is called a Sobel filter the advantage of which is it puts a little bit more weight to the central row, the central pixel, and this makes it maybe a little bit more robust. More about Sobel filter.

1, 0, -1
2, 0, -2
1, 0, -1

Here is another filter called Scharr filter:

 3, 0, -3
10, 0, -10
 3, 0, -3

More about Scharr filter.

w1, w2, w3
w4, w5, w6
w7, w8, w9

By just letting all of these numbers be parameters and learning them automatically from data, we find that neural networks can actually learn low level features, can learn features such as edges, even more robustly than computer vision researchers are generally able to code up these things by hand.

Padding

In order to fix the following two problems, padding is usually applied in the convolutional operation.

Notations:

Output size after convolution:

Convention:

Strided Convolutions

Notation:

Output size after convolution: floor((n+2p-f)/s+1) x floor((n+2p-f)/s+1)

Conventions:

Convolutions Over Volume

For a RGB image, the filter itself has three layers corresponding to the red, green, and blue channels.

height x width x channel

n x n x nc * f x f x nc –> (n-f+1) x (n-f+1) x nc'

One Layer of a Convolutional Network

Notations:

size notation
filter size f(l)
padding size p(l)
stride size s(l)
number of filters nc(l)
filter shape filter_shape
input shape input_shape
output shape output_shape
output height nh(l)
output width nw(l)
activations a[l] activations
activations A[l] activations
weights weights
bias bias

Simple Convolutional Network

Types of layer in a convolutional network:

Pooling Layers

CNN Example

nn-example

Layer shapes of the network:

layer activation shape activation size # parameters
Input (32,32,3) 3072 0
CONV1 (f=5,s=1) (28,28,8) 6272 608 =(5*5*3+1)*8
POOL1 (14,14,8) 1,568 0
CONV2 (f=5,s=1) (10,10,16) 1600 3216 =(5*5*8+1)*16
POOL2 (5,5,16) 400 0
FC3 (120,1) 120 48120 =400*120+120
FC4 (84,1) 84 10164 =120*84+84
softmax (10,1) 10 850 =84*10+10

Why Convolutions

There are two main advantages of convolutional layers over just using fully connected layers.

Through these two mechanisms, a neural network has a lot fewer parameters which allows it to be trained with smaller training cells and is less prone to be overfitting.

Week 2: Classic Networks

Learning Objectives

Case Studies

Why look at case studies

It is helpful in taking someone else’s neural network architecture and applying that to another problem.

Classic Networks

LeNet-5

LeNet-5

Some difficult points about reading the LeNet-5 paper:

AlexNet

AlexNet

VGG-16

VGG-16

ResNets

Paper: Deep Residual Learning for Image Recognition

resnet-network

resnet

Formally, denoting the desired underlying mapping as H(x), they let the stacked nonlinear layers fit another mapping of F(x):=H(x)-x. The original mapping H(x) is recast into F(x)+x. If the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart.

resnet-block

Why ResNets

About dimensions:

An example from the paper:

A plain network in which you input an image and then have a number of CONV layers until eventually you have a softmax output at the end.

resnet-plain-34

To turn this into a ResNet, you add those extra skip connections and there are a lot of 3x3 convolutions and most of these are 3x3 same convolutions and that’s why you’re adding equal dimension feature vectors. There are occasionally pooling layers and in these cases you need to make an adjustment to the dimension by the matrix W_s.

resnet-resnet-34

Practice advices on ResNet:

Networks in Networks and 1x1 Convolutions

Paper: Network in Network

conv-1x1

(image from here)

The 1×1 convolutional layer is equivalent to the fully-connected layer, when applied on a per pixel basis.

The 1x1 convolutional layer is actually doing something pretty non-trivial and adds non-linearity to your neural network and allow you to decrease or keep the same or if you want, increase the number of channels in your volumes.

Inception Network Motivation

Paper: Going Deeper with Convolutions

When designing a layer for a ConvNet, you might have to pick, do you want a 1 by 3 filter, or 3 by 3, or 5 by 5, or do you want a pooling layer? What the inception network does is it says, why shouldn’t do them all? And this makes the network architecture more complicated, but it also works remarkably well.

inception-motivation

And the basic idea is that instead of you need to pick one of these filter sizes or pooling you want and commit to that, you can do them all and just concatenate all the outputs, and let the network learn whatever parameters it wants to use, whatever the combinations of these filter sizes it wants. Now it turns out that there is a problem with the inception layer as we’ve described it here, which is computational cost.

The analysis of computational cost:

inception-computational-cost

Inception modules:

inception

Inception Network

inception-module

inception-network

The last few layers of the network is a fully connected layer followed by a softmax layer to try to make a prediction. What these side branches do is it takes some hidden layer and it tries to use that to make a prediction. You should think of this as maybe just another detail of the inception that’s worked. But what is does is it helps to ensure that the features computed even in the heading units, even at intermediate layers that they’re not too bad for protecting the output cause of a image. And this appears to have a regularizing effect on the inception network and helps prevent this network from overfitting.

Practical advices for using ConvNets

Using Open-Source Implementation

Transfering Learning

The computer vision research community has been pretty good at posting lots of data sets on the Internet so if you hear of things like ImageNet, or MS COCO, or PASCAL types of data sets, these are the names of different data sets that people have post online and a lot of computer researchers have trained their algorithms on.

Sometimes these training takes several weeks and might take many GPUs and the fact that someone else has done this and gone through the painful high-performance search process, means that you can often download open source ways that took someone else many weeks or months to figure out and use that as a very good initialization for your own neural network.

Data Augmentation

Having more data will help all computer vision tasks.

Some common data augmentation in computer vision:

Color shifting: Take different values of R, G and B and use them to distort the color channels. In practice, the values R, G and B are drawn from some probability distribution. This makes your learning algorithm more robust to changes in the colors of your images.

Implementation tips:

A pretty common way of implementing data augmentation is to really have one thread, almost four threads, that is responsible for loading the data and implementing distortions, and then passing that to some other thread or some other process that then does the training.

data-augmentation-implementation

State of Computer Vision

Data vs. hand-engineering:

Two sources of knowledge:

data vs. hand-engineering

Even though data sets are getting bigger and bigger, often we just don’t have as much data as we need. And this is why the computer vision historically and even today has relied more on hand-engineering. And this is also why that the field of computer vision has developed rather complex network architectures, is because in the absence of more data. The way to get good performance is to spend more time architecting, or fooling around with the network architecture.

Tips for doing well on benchmarks/winning competitions:

multi-crop

Use open source code:

Tips for Keras

For a full guidance read the newest tutorial on the Keras documentation:

Implementations of VGG16, ResNet and Inception by Keras can be found in Francois Chollet’s GitHub repository.

Week 3: Object detection

Learning Objectives

Detection algorithms

Object Localization

object-classification-detection

object-classification-localization

Giving the bounding box then you can use supervised learning to make your algorithm outputs not just a class label but also the four parameters to tell you where is the bounding box of the object you detected.

object-classification-localization-y

The squared error is used just to simplify the description here. In practice you could probably use a log like feature loss for the c1, c2, c3 to the softmax output.

Landmark Detection

In more general cases, you can have a neural network just output x and y coordinates of important points in image, sometimes called landmarks.

landmark-detection

If you are interested in people pose detection, you could also define a few key positions like the midpoint of the chest, the left shoulder, left elbow, the wrist, and so on.

The identity of landmark one must be consistent across different images like maybe landmark one is always this corner of the eye, landmark two is always this corner of the eye, landmark three, landmark four, and so on.

Object Detection

sliding windows detection

Disadvantage of sliding windows detection is computational cost. Unless you use a very fine granularity or a very small stride, you end up not able to localize the objects accurately within the image.

Convolutional Implementation of Sliding Windows

To build up towards the convolutional implementation of sliding windows let’s first see how you can turn fully connected layers in neural network into convolutional layers.

Turn FC into CONV layers

What the convolutional implementation of sliding windows does is it allows four processes in the convnet to share a lot of computation. Instead of doing it sequentially, with the convolutional implementation you can implement the entire image, all maybe 28 by 28 and convolutionally make all the predictions at the same time.

convolutional implementation of sliding windows

Bounding Box Predictions (YOLO)

The convolutional implementation of sliding windows is more computationally efficient, but it still has a problem of not quite outputting the most accurate bounding boxes. The perfect bounding box isn’t even quite square, it’s actually has a slightly wider rectangle or slightly horizontal aspect ratio.

YOLO

YOLO algorithm:

The basic idea is you’re going to take the image classification and localization algorithm and apply that to each of the nine grid cells of the image. If the center/midpoint of an object falls into a grid cell, that grid cell is responsible for detecting that object.

The advantage of this algorithm is that the neural network outputs precise bounding boxes as follows.

Intersection Over Union

IoU is a measure of the overlap between two bounding boxes. If we use IoU in the output assessment step, then the higher the IoU the more accurate the bounding box. However IoU is a nice tool for the YOLO algorithm to discard redundant bounding boxes.

IoU

Non-max Suppression

One of the problems of Object Detection as you’ve learned about this so far, is that your algorithm may find multiple detections of the same objects. Rather than detecting an object just once, it might detect it multiple times. Non-max suppression is a way for you to make sure that your algorithm detects each object only once.

Non-max

If you actually tried to detect three objects say pedestrians, cars, and motorcycles, then the output vector will have three additional components. And it turns out, the right thing to do is to independently carry out non-max suppression three times, one on each of the outputs classes.

Anchor Boxes

One of the problems with object detection as you have seen it so far is that each of the grid cells can detect only one object. What if a grid cell wants to detect multiple objects? This is what the idea of anchor boxes does.

Anchor box algorithm:

previous box with two anchor boxes
Each object in training image is assigned to grid cell that contains that object’s midpoint. Each object in training image is assigned to grid cell that contains object’s midpoint and anchor box for the grid cell with highest IoU.
Output y: 3x3x8 Output y: 3x3x16 or 3x3x2x8

anchor box

YOLO Algorithm

YOLO algorithm steps:

yolo-algorithm

(Optional) Region Proposals

algorithm description
R-CNN Propose regions. Classify proposed regions one at a time. Output label + bounding box. The way that they perform the region proposals is to run an algorithm called a segmentation algorithm. One downside of the R-CNN algorithm was that it is actually quite slow.
Fast R-CNN Propose regions. Use convolution implementation of sliding windows to classify all the proposed regions. One of the problems of fast R-CNN algorithm is that the clustering step to propose the regions is still quite slow.
Faster R-CNN Use convolutional network to propose regions. (Most implementations are usually still quit a bit slower than the YOLO algorithm.)

Week 4: Special applications: Face recognition & Neural style transfer

Discover how CNNs can be applied to multiple fields, including art generation and face recognition. Implement your own algorithm to generate art and recognize faces.

Face Recognition

What is face recognition

One Shot Learning

One-shot learning problem: to recognize a person given just one single image.

Siamese network

A good way to implement a similarity function d(img1, img2) is to use a Siamese network.

siamese-network

In a Siamese network, instead of making a classification by a softmax unit, we focus on the vector computed by a fully connected layer as an encoding of the input image x1.

Goal of learning:

Triplet Loss

One way to learn the parameters of the neural network so that it gives you a good encoding for your pictures of faces is to define an applied gradient descent on the triplet loss function.

In the terminology of the triplet loss, what you’re going do is always look at one anchor image and then you want to distance between the anchor and the positive image, really a positive example, meaning as the same person to be similar. Whereas, you want the anchor when pairs are compared to the negative example for their distances to be much further apart. You’ll always be looking at three images at a time:

As before we have d(A,P)=‖f(A)−f(P)‖^2 and d(A,N)=‖f(A)−f(N)‖^2, the learning objective is to have d(A,P) ≤ d(A,N). But if f always equals zero or f always outputs the same, i.e., the encoding for every image is identical, the objective is easily achieved, which is not what we want. So we need to add an 𝛼 to the left, a margin, which is a terminology you can see on support vector machines.

The learning objective:

d(A,P) + 𝛼 ≤ d(A,N) or d(A,P) - d(A,N) + 𝛼 ≤ 0

Loss function:

Given 3 images A,P,N:
L(A,P,N) = max(d(A,P) - d(A,N) + 𝛼, 0)
J = sum(L(A[i],P[i],N[i]))

You do need a dataset where you have multiple pictures of the same person. If you had just one picture of each person, then you can’t actually train this system.

Face Verification and Binary Classification

The Triplet loss is a good way to learn the parameters of a ConvNet for face recognition. Face recognition can also be posed as a straight binary classification problem by taking a pair of neural networks to take a Siamese Network and having them both compute the embeddings, maybe 128 dimensional embeddings or even higher dimensional, and then having the embeddings be input to a logistic regression unit to make a prediction. The output will be one if both of them are the same person and zero if different.

face-recognition

Implementation tips:

Instead of having to compute the encoding every single time you can pre-compute that, which can save a significant computation.

Summary of Face Recognition

Key points to remember:

More references:

Neural Style Transfer

What is neural style transfer

Paper: A Neural Algorithm of Artistic Style

neural style transfer

In order to implement Neural Style Transfer, you need to look at the features extracted by ConvNet at various layers, the shallow and the deeper layers of a ConvNet.

What are deep ConvNets learning

Paper: Visualizing and Understanding Convolutional Networks

visualizing network

Cost Function

Neural style transfer cost function:

J(G) = alpha * J_content(C, G) + beta * J_style(S, G)

Find the generated image G:

  1. Initiate G randomly, G: 100 x 100 x 3
  2. Use gradient descent to minimize J(G)

Content Cost Function

J_content(C, G) = 1/2 * ‖𝑎[𝑙](𝐶)−𝑎[𝑙](𝐺)‖^2

Style Cost Function

Style is defined as correlation between activations across channels.

style-cost1

style-cost2

style-cost3

1D and 3D Generalizations

ConvNets can apply not just to 2D images but also to 1D data as well as to 3D data.

For 1D data, like ECG signal (electrocardiogram), it’s a time series showing the voltage at each instant time. Maybe we have a 14 dimensional input. With 1D data applications, we actually use a recurrent neural network.

14 x 1 * 5 x 1 --> 10 x 16 (16 filters)

For 3D data, we can think the data has some height, some width, and then also some depth. For example, we want to apply a ConvNet to detect features in a 3D CT scan, for simplifying purpose, we have 14 x 14 x 14 input here.

14 x 14 x 14 x 1 * 5 x 5 x 5 x 1 --> 10 x 10 x 10 x 16 (16 filters)

Other 3D data can be movie data where the different slices could be different slices in time through a movie. We could use ConvNets to detect motion or people taking actions in movies.


Notes by Aaron © 2020