Skip to the content.

Course 5: Sequence Models

Week 1: Recurrent Neural Networks

Learn about recurrent neural networks. This type of model has been proven to perform extremely well on temporal data. It has several variants including LSTMs, GRUs and Bidirectional RNNs, which you are going to learn about in this section.

Recurrent Neural Networks

Why sequence models

Examples of sequence data:

Notation

For a motivation, in the problem of Named Entity Recognition (NER), we have the following notation:

Words representation introduced in this video is the One-Hot representation.

For a word not in your vocabulary, we need create a new token or a new fake word called unknown word denoted by <UNK>.

Recurrent Neural Network Model

If we build a neural network to learn the mapping from x to y using the one-hot representation for each word as input, it might not work well. There are two main problems:

Recurrent Neural Networks:

rnn-forward

Instead of carrying around two parameter matrices Waa and Wax, we can simplifying the notation by compressing them into just one parameter matrix Wa.

rnn-notation

Backpropagation through time

In the backpropagation procedure the most significant messaage or the most significant recursive calculation is which goes from right to left, that is, backpropagation through time.

Different types of RNNs

There are different types of RNN:

rnn-type

See more details about RNN by Karpathy.

Language model and sequence generation

So what a language model does is to tell you what is the probability of a particular sentence.

For example, we have two sentences from speech recognition application:

sentence probability
The apple and pair salad. 𝑃(The apple and pair salad)=3.2x10-13
The apple and pear salad. 𝑃(The apple and pear salad)=5.7x10-10

For language model it will be useful to represent a sentence as output y rather than inputs x. So what the language model does is to estimate the probability of a particular sequence of words 𝑃(y<1>, y<2>, ..., y<T_y>).

How to build a language model?

Cats average 15 hours of sleep a day <EOS> Totally 9 words in this sentence.

language model

If you train this RNN on a large training set, we can do:

Sampling novel sequences

After you train a sequence model, one way you can informally get a sense of what is learned is to have it sample novel sequences.

How to generate a randomly chosen sentence from your RNN language model:

Character level language model:

If you build a character level language model rather than a word level language model, then your sequence y1, y2, y3, would be the individual characters in your training data, rather than the individual words in your training data. Using a character level language model has some pros and cons. As computers gets faster there are more and more applications where people are, at least in some special cases, starting to look at more character level models.

Vanishing gradients with RNNs

Gated Recurrent Unit (GRU)

Gate Recurrent Unit is one of the ideas that has enabled RNN to become much better at capturing very long range dependencies and has made RNN much more effective.

A visualization of the RNN unit of the hidden layer of the RNN in terms of a picture:

rnn-unit

GRU

Implementation tips:

Long Short Term Memory (LSTM)

Fancy explanation: Understanding LSTM Network

LSTM-units

One cool thing about this you’ll notice is that this red line at the top that shows how, so long as you set the forget and the update gate appropriately, it is relatively easy for the LSTM to have some value c<0> and have that be passed all the way to the right to have your, maybe, c<3> equals c<0>. And this is why the LSTM, as well as the GRU, is very good at memorizing certain values even for a long time, for certain real values stored in the memory cell even for many, many timesteps.

LSTM

One common variation of LSTM:

GRU vs. LSTM:

Implementation tips:

Bidirectional RNN

RNN-ner

BRNN

Disadvantage:

The disadvantage of the bidirectional RNN is that you do need the entire sequence of data before you can make predictions anywhere. So, for example, if you’re building a speech recognition system, then the BRNN will let you take into account the entire speech utterance but if you use this straightforward implementation, you need to wait for the person to stop talking to get the entire utterance before you can actually process it and make a speech recognition prediction. For a real type speech recognition applications, they’re somewhat more complex modules as well rather than just using the standard bidirectional RNN as you’ve seen here.

Deep RNNs

DRNN

Week 2: Natural Language Processing & Word Embeddings

Natural language processing with deep learning is an important combination. Using word vector representations and embedding layers you can train recurrent neural networks with outstanding performances in a wide variety of industries. Examples of applications are sentiment analysis, named entity recognition and machine translation.

Introduction to Word Embeddings

Word Representation

Using word embeddings

Word embeddings tend to make the biggest difference when the task you’re trying to carry out has a relatively smaller training set.

Word embedding vs. face recognition encoding:

Properties of word embeddings

Embedding matrix

Learning Word Embeddings: Word2vec & GloVe

Learning word embeddings

Word2Vec

Paper: Efficient Estimation of Word Representations in Vector Space by Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean.

The Skip-Gram model:

Model details:

Model problem:

Hierarchical softmax classifier:

How to sample context c:

CBOW:

The other version of the Word2Vec model is CBOW, the continuous bag of words model, which takes the surrounding contexts from middle word, and uses the surrounding words to try to predict the middle word. And the algorithm also works, which also has some advantages and disadvantages.

Negative Sampling

Paper: Distributed Representations of Words and Phrases and their Compositionality by Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean.

Negative sampling is a modified learning problem to do something similar to the Skip-Gram model with a much more efficient learning algorithm.

GloVe word vectors

Paper: GloVe: Global Vectors for Word Representation

Conclusion:

Applications using Word Embeddings

Sentiment Classification

comments stars
The dessert is excellent. 4
Service was quite slow. 2
Good for a quick meal, but nothing special. 3
Completely lacking in good taste, good service, and good ambience. 1

A simple sentiment classification model:

sentiment-model

A more sophisticated model:

sentiment-model-rnn

Debiasing word embeddings

Paper: Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

Word embeddings maybe have the bias problem such as gender bias, ethnicity bias and so on. As word embeddings can learn analogies like man is to woman like king to queen. The paper shows that a learned word embedding might output:

Man: Computer_Programmer as Woman: Homemaker

Learning algorithms are making very important decisions and so I think it’s important that we try to change learning algorithms to diminish as much as is possible, or, ideally, eliminate these types of undesirable biases.

Week 3: Sequence models & Attention mechanism

Sequence models can be augmented using an attention mechanism. This algorithm will help your model understand where it should focus its attention given a sequence of inputs. This week, you will also learn about speech recognition and how to deal with audio data.

Various sequence to sequence architectures

Basic Models

In this week, you hear about sequence-to-sequence models, which are useful for everything from machine translation to speech recognition.

Picking the most likely sentence

There are some similarities between the sequence to sequence machine translation model and the language models that you have worked within the first week of this course, but there are some significant differences as well.

In the example of the French sentence, "Jane, visite l'Afrique en Septembre".

Bleu Score (optional)

Paper: BLEU: a Method for Automatic Evaluation of Machine Translation by Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu.

BLEU stands for bilingual evaluation understudy.

Attention Model Intuition

Paper: Neural Machine Translation by Jointly Learning to Align and Translate by Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio.

You’ve been using an Encoder-Decoder architecture for machine translation. Where one RNN reads in a sentence and then different one outputs a sentence. There’s a modification to this called the Attention Model that makes all this work much better.

The French sentence:

Jane s’est rendue en Afrique en septembre dernier, a apprécié la culture et a rencontré beaucoup de gens merveilleux; elle est revenue en parlant comment son voyage était merveilleux, et elle me tente d’y aller aussi.

The English translation:

Jane went to Africa last September, and enjoyed the culture and met many wonderful people; she came back raving about how wonderful her trip was, and is tempting me to go too.

Attention Model

Implementation tips:



Neural machine translation with attention

Speech recognition - Audio data

Speech recognition

How to build a speech recognition?

Trigger Word Detection

Implementation tips:


Notes by Aaron © 2020