Skip to the content.

Course 3: Structuring Machine Learning Projects

Week 1: ML Strategy (1)

Learning Objectives

Introduction to ML Strategy

Why ML Strategy

Ideas to improve a machine learning system:

In order to have quick and effective ways to figure out which of all of these ideas and maybe even other ideas, are worth pursuing and which ones we can safely discard, we need ML strategies.

Orthogonalization

In the example of TV tuning knobs, orthogonalization refers to that the TV designers had designed the knobs so that each knob kind of does only one thing.

In a car the stearing wheel controls the angle and the accelerator and brake control the speed. If there are two controllers, each has different effect simultaneously on angle and speed, then it’s much harder to set the car to the speed and angle we want.

0.3 * angle - 0.8 * speed
2 * angle + 0.9 * speed

Orthogonal means at 90 degrees to each other. By having orthogonal controls that are ideally aligned with the things we actually want to control. It makes it much easier to tune the knobs we have to tune. To tune the steering wheel angle, and the accelerator, the brake, to get the car to do what we want.

chain of assumptions in ML tune the knobs
Fit training set well on cost function bigger network
better optimization algorithm, Adam…
Fit dev set well on cost function regularization
bigger training set
Fit test set well on cost function bigger dev set
Performs well in real world change dev set or cost function
(dev test set distribution not correct or cost function not right)

Early stopping, though not a bad technique, is a knob that simultaneously affects the training set and dev set performance, and therefore is less orthogonalized, so Andrew tend not to use it.

Setting up your goal

Single number evaluation metric

Evaluation metric allows you to quickly tell if classifier A or classifier B is better, and therefore having a dev set plus single number evaluation metric tends to speed up iterating.

metric calculation definition
Precision P = TP/(TP+FP) percentage of true positive in predicted positive
Recall R = TP/(TP+FN) percentage of true positive predicted in all real positive
F1 score F1 = 2PR/(P+R) or 1/F1 = (1/P+1/R)/2 harmonic mean of precision and recall

Satisficing and optimizing metric

If we care about the classification accuracy of our cat’s classifier and also care about the running time or some other performance, instead of combining them into an overall evaluation metric by their artificial linear weighted sum, we actually can have one thing as an optimizing metric and the others as satisficing metrics.

Train/dev/test distributions

Guideline:

Size of the dev and test sets

When to change dev/test sets and metrics

In an example of cat classification system, classification error might not be a reasonable metric if two algorithms have the following performance:

algorithm classification error issues review
Algorithm A 3% letting through lots of porn images showing pornographic images to users is intolerable
Algorithm B 5% no pornographic images classifies fewer images but acceptable

In this case, metric should be modified. One way to change this evaluation metric would be adding weight terms.

metric calculation notation
classification error clf-error L can be identity function to count correct labels
weighted classification error clf-error-weighted weights

So if you find that evaluation metric is not giving the correct rank order preference for what is actually better algorithm, then there’s a time to think about defining a new evaluation metric.

This is actually an example of an orthogonalization where I think you should take a machine learning problem and break it into distinct steps.

The overall guideline is if your current metric and data you are evaluating on doesn’t correspond to doing well on what you actually care about, then change your metrics and/or your dev/test set to better capture what you need your algorithm to actually do well on.

Comparing to human-level performance

Why human-level performance

A lot more machine learning teams have been talking about comparing the machine learning systems to human-level performance.

The graph below shows the performance of humans and machine learning over time.

human-performance

Machine learning progresses slowly when it surpasses human-level performance. One of the reason is that human-level performance can be closeto Bayes optimal error, especially for natural perception problem.

Bayes optimal error is defined as the best possible error. In other words, it means that any functions mapping from x to y can’t surpass a certain level of accuracy.

Also, when the performance of machine learning is worse than the performance of humans, we can improve it with different tools. They are harder to use once it surpasses human-level performance.

These tools are:

Avoidable bias

By knowing what the human-level performanceis, it is possible to tell when a training set is performing well or not.

performance Scenario A Scenario B
humans 1 7.5
training error 8 8
development error 10 10

In this case, the human-level error as a proxy for Bayes error since humans are good to identify images. If you want to improve the performance of the training set but you can’t do better than the Bayes error otherwise the training set is overfitting. By knowing the Bayes error, it is easier to focus on whether bias or variance avoidance tactics will improve the performance of the model.

Understanding human-level performance

Summary of bias/variance with human-level performance:

Surpassing human-level performance

Classification task performance (classification error):

performance Scenario A Scenario B
Team of humans 0.5 0.5
One human 1.0 1
Training error 0.6 0.3
Development error 0.8 0.4

There are many problems where machine learning significantly surpasses human-level performance, especially with structured data:

problem structured data
Online advertising database of what has users clicked on
Product recommendations database of proper support for
Logistics (predicting transit time) database of how long it takes to get from A to B
Loan approvals database of previous loan applications and their outcomes

And these are not natural perception problems, so these are not computer vision, or speech recognition, or natural language processing task. Humans tend to be very good in natural perception task. So it is possible, but it’s just a bit harder for computers to surpass human-level performance on natural perception task.

Improving your model performance

There are two fundamental assumptions of supervised learning.

improve-model-performance

Week 2: ML Strategy (2)

Learning Objectives

Error Analysis

Carrying out error analysis

To carry out error analysis, you should:

Cleaning up incorrectly labeled data

Some facts:

Correcting incorrect dev/test set examples:

Build your first system quickly, then iterate

Depending on the area of application, the guideline below will help you prioritize when you build your system.

Guideline:

  1. Set up development/test set and metrics
    1. Set up a target
  2. Build an initial system quickly
    1. Train training set quickly: Fit the parameters
    2. Development set: Tune the parameters
    3. Test set: Assess the performance
  3. Use bias/variance analysis & error analysis to prioritize next steps

Mismatched training and dev/test set

Training and testing on different distributions

In the Cat vs Non-cat example, there are two sources of data used to develop the mobile app.

The guideline is that you have to choose a development set and test set to reflect data you expect to get in the future and consider important to do well.

data-on-diff-dist

Bias and Variance with mismatched data distributions

Instead of just having bias and variance as two potential problems, you now have a third potential problem, data mismatch.

bias-variance-mismatched

bias-variance-mismatched-1

Addressing data mismatch

This is a general guideline to address data mismatch:

Learning from multiple tasks

Transfering learning

Transfer learning refers to using the neural network knowledge for another application.

When to use transfer learning:

Example 1: Cat recognition - radiology diagnosis

The following neural network is trained for cat recognition, but we want to adapt it for radiology diagnosis. The neural network will learn about the structure and the nature of images. This initial phase of training on image recognition is called pre-training, since it will pre-initialize the weights of the neural network. Updating all the weights afterwards is called fine-tuning.

Guideline:

transfer-learning

Multi-task learning

Multi-task learning refers to having one neural network do simultaneously several tasks.

When to use multi-tasklearning:

multi-task

End-to-end deep learning

What is end-to-end deep learning

end-to-end

Whether to use end-to-end deep learning

Before applying end-to-end deep learning, you need to ask yourself the following question: Do you have enough data to learn a function of the complexity needed to map x and y?

Pro:

Cons:


Notes by Aaron © 2020