Sequence Models have been motivated by the analysis of sequential data such text sentences, timeseries and other discrete sequences data. These models are especially designed to handle sequential information while Convolutional Neural Network are more adapted for process spatial information.
The key point for sequence models is that the data we are processing are not anymore independently and identically distributed (i.i.d.) samples and the data carry some dependency due to the sequential order of the data.
Recurrent Neural network
RNN is especially designed to deal with sequential data which is not i.i.d.
RNN can tackle the following challenges from sequence data:
to deal with variablelength sequences
to maintain sequence order
to keep track of longterm dependencies
to share parameters across the sequence
Graphical Representation of RNN
The following figure exhibits the difference between the classical classic feed forward neural network architecture (left figure) and the Recurrent Neural Network:
Feed forward neural network also called vanilla neural network presents one input and provides one output without taking into account the sequence of the data.
Recurrent Neural Network introduce an internal loop wich allows information to be passed from one step of the network to the next. It apply a recurrence relation at every time step to process a sequence of data.
Unfold Representation
The unfold representation of the previous RNN is presented in the following figure which make more clear how the RNN can handle sequence data using the recurrence relation
\[\Large{\boxed{\underbrace{h^{<t>}}_{\textrm{cell state}} = f_W(\underbrace{h^{<t1>}}_{\textrm{old state}}, \underbrace{x^{<t>}}_{\textrm{input vector}})}}\]
where \(f_W\) is a function parameterized by the weight \(W\). It is important to note that at every time step \(t\) the same function \(f_W\) is used and same set of weight parameters.
Figure 8.8: Multilayer Layer RNN source link
Weight Matrices parameters in RNN Sequence Modeling Tasks
Let introduce the weight parameters in the previous figure and precise some notation.
Below are the details associated with each parameter in the network.
\(x^{<1>}, \ldots, x^{<t1>}, x^{<t>}, x^{<t+1>}, ... x^{<m>}\): the input data. In natural langage processing (NLP), for example, we can consider that the sequence input is a sentence of \(m\) words \(x^{<1>}, \ldots, x^{<t1>}, x^{<t>}, x^{<t+1>}, ... x^{<m>}\).
\(x^{<t>} \in \mathbb{R}^{V}\): input vector at time \(t\). For example each word \(x^{<t>}\) in the sentence will be input as 1hot vector of size \(\mathcal{V}\) (where \(\mathcal V\) is the vocabulary and \(V\) denotes the size).
\(h^{<t>} = \sigma(W_{hh} h^{<t1>} + W_{hx} x^{<t>})\): the relationship to compute the hidden layer output features at each timestep \(t\).
\(\sigma(\cdot)\): the nonlinearity function (sigmoid here)
\(W_{hx} \in \mathbb{R}^{D_h \times \mathcal{V} }\): weight matrix used to condition the input vector, \(x^{<t>}\). The number of neurons in each layer is \(D_h\)
\(W_{hh} \in \mathbb{R}^{D_h \times D_h}\): weights matrix used to condition the output of the previous timestep, \(h^{<t1>}\)
\(h^{<t1>} \in \mathbb{R}^{D_h}\): output of the nonlinear function at the previous timestep, \(t1\). \(h^{<0>} \in \mathbb{R}^{D_h}\) is an initialization vector for the hidden layer at timestep \(t = 0\).
Each desired output \(y^{<t>}\) is a 1hot vector of size \(\mathcal{V}\) with 1 at index of \(x^{<t+1>}\) in \(\mathcal V\) and 0 elsewhere.
\(W_{yh} \in \in \mathbb{R}^{\mathcal{V} \times D_h}\): weights matrix used to compute the prediction at each time \(t\). For example, \(\hat{y}^{<t>} = softmax (W_{yh}h_t)\) and \(\hat{y} \in \mathbb{R}^{\mathcal{V}}\).
Different types of Sequence Modeling Tasks
Different tasks can be achieved using RNN:
manytoone: Sentiment Classification, action prediction (sequence of video > action class)
onetomany: Image captioning (image > sequence of words)
manytomany: Video Captioning (Sequance of video frames > Caption)
manytomany: Video Classification on frame level
How to train a RNN ?
RNN a clearly some nice features:
can process any length input
In theory for each time \(t\) can use information from many steps back
Same weights applied on every timestep
However RNN could be very slow to train and in practice it is difficult to access information from many steps back.
Let consider training a RNN for language model example:
We first have to get access to a big corpus of text which is a sequence of words \(x^{<1>}, \ldots, x^{<t1>}, x^{<t>}, x^{<t+1>}, ... x^{<T>}\)
Forward pass into the RNN model and compute the output distribution \(\hat{y}^{<t>}\)
for every time \(t\) which represent the predict probability distribution of every word , given words so far
Compute the Loss fuction on each step \(t\). In this case the crossentropy between predicted probability distribution \(\hat{y}^{<t>}\) and the true next word \(y^{<t>}\) (onehot for \(\hat{x}^{<t+1>}\)):
\[\Large{\boxed{
L^{<t>}(\theta) = CE(y^{<t>},\hat{y}^{<t>})= \sum_{j=1}^{V} y^{<t>}_{j} \times log (\hat{y}^{<t>}_{j})}}\]
 Average this to get overall loss for entire training set:
\[\Large{\boxed{
L = \dfrac{1}{T} \sum_{t=1}^{T} L^{<t>}(\theta) =  \dfrac{1}{T} \sum_{t=1}^{T} \sum_{j=1}^{V} y^{<t>}_{j} \times log (\hat{y}^{<t>}_{j})}}
\]
Computiong the loss and the gradients accross entire corpus is too expensive. Then , in practice we use a batch of sentences to compute the loss and Stochastic Gradient Descent is exploited to compute the gradients for samll chunk of data, and then update.
Backpropagation through time
Let explicit the derivative of \(L\) w.r.t. the repeated weight matrix \(W_{hh}\).
The backpropagation over time as two summations:
 First summation over \(L\)
\[\Large{\boxed{
\frac{\partial L}{\partial W_{hh}}=\sum_{j=1}^{T} \frac{\partial L^{<j>}}{\partial W_{hh}}}}
\]
 Second summations showing that each \(L^{<t>}\) depends on the weight matrices before it:
\[\Large{\boxed{
\frac{\partial L^{<t>}}{\partial W_{hh}}=\sum_{i=1}^{t} \frac{\partial L^{<t>}}{\partial W_{hh}}}}
\]
Using the multivariate chainrule:
\[\Large{\boxed{
\dfrac{\partial L^{<t>}}{\partial W_{hh}} = \sum_{k=1}^{t} \dfrac{\partial L^{<t>}}{\partial \hat{y}^{<t>}} \dfrac{\partial \hat{y}^{<t>}}{\partial h^{<t>}}\dfrac{\partial h^{<t>}}{\partial h^{<k>}}\dfrac{\partial h^{<k>}}{\partial W_{hh}}}}\]
Now using the chain rule differentiation over all hidden layers whitin the time interval \([k,t]\):
\[\Large{\boxed{ \dfrac{\partial h^{<t>}}{\partial h^{<k>}} = \prod_{j=k+1}^{t}\dfrac{\partial h^{<j>}}{\partial h^{<j1>}} = \prod_{j=k+1}^{t}W_{hh}^T \times diag [f'(W_{hh}h^{<j1>}+W_{hx}x^{<j>})]}}\]
where each \(\dfrac{\partial h^{<j>}}{\partial h^{<j1>}}\) is the Jacobian matrix for \(h\):
\[
\dfrac{\partial h^{<j>}}{\partial h^{<j1>}} = \left[\dfrac{\partial h^{<j>}}{\partial h^{<j1>}_{1}},\ldots,\dfrac{\partial h^{<j>}}{\partial h^{<j1>}_{D_h}}\right] =
\begin{bmatrix}
\dfrac{\partial h^{<j>}_{1}}{\partial h^{<j1>}_{1}} & . & . & . & \dfrac{\partial h^{<j>}_{1}}{\partial h^{<j1>}_{D_h}} \\
. & . & & & . \\
. & & . & & . \\
. & & & . & . \\
\dfrac{\partial h_{j,D_h}}{\partial h_{j1,1}} & . & . & . & \dfrac{\partial h_{j,D_h}}{\partial h_{j1,D_n}} \\
\end{bmatrix}
\]
Repeated matrix multiplication is required to get \(\dfrac{\partial h^{<t>}}{\partial h^{<k>}}\). It leads to a vanishing or exploding gradients issues.
Vanishing and Exploding gradient
Remind us that the RNN is based on this recursive relation
\[h^{<t>} = \sigma(W_{hh} h^{<t1>} + W_{hx} x^{<t>}).\]
Let consider the identidy function \(\sigma(x)=x\), to help to understand why we have an issue with the computation of the gradient.
\[\begin{eqnarray*}
\dfrac{\partial h^{<t>}}{\partial h^{<t1>}}&=&diag(\sigma^{'}(W_{hh}h^{<t1>}+W_{hx}x^{<t>}))W_{hh}\\
&=&IW_{hh}=W_{hh}
\end{eqnarray*}\]
Let consider the gradient of the loss \(L^{<i>}\) on time \(i\), with respect to the hidden state \(h^{<j>}\) on some previous time \(j\) and denote \(\ell=ij\). Then,
\[\begin{eqnarray*}
\dfrac{\partial L^{<i>}}{\partial h^{<j>}}&=&\dfrac{\partial L^{<i>}}{\partial h^{<i>}} \prod_{j<t\leq i}\dfrac{\partial h^{<t>}}{\partial h^{<t1>}}\\
&=&\dfrac{\partial L^{<i>}}{\partial h^{<i>}} \prod_{j<t\leq i} W_{hh}=\dfrac{\partial L^{<i>}}{\partial h^{<i>}} \prod_{j<t\leq i}W_{hh}^{\ell}
\end{eqnarray*}\]
Then if \(W_{hh}\) is “small”, then this term gets exponentially problematic as \(\ell\) becomes large.
Indeed, if the weight matrix \(W_{hh}\) is diagonalizable:
\[W_{hh}=Q^{1}\times \Delta \times Q,\]
where \(Q\) is composed of the eigenvectors and \(\Delta\) is a diagonal matrix with the eigenvalues on the diagonal. Computing the power of \(W_{hh}\) is then given by:
\[W^{\ell}_{hh}=Q^{1}\times \Delta^{\ell} \times Q.\]
Thus eigenvalues lower than 1 will lead to vanishing gradient while eigenvalues greater than 1 will lead to exploding gradient.
Long Short Term Memory (LSTM)
LSTM is used to take into account longterm dependencies and was introduced by Hochreiter and Schmidhuber in 1997 to offer a solution to the vanishing gradients problem.
LSTM models make each node a more complex unit with gates controlling what information is passed through rather each node being just a simple RNN cell.
Main ideas
At each time \(t\), LSTM provides a hidden state (\(h^{(t)}\)) and a cell state (\(c^{(t)}\)) which are both vectors of length \(n\). The cell has the ability to stores longtern information. Further, the LSTM model can erase, write and read information from the cell.
Gates are defined to get the ability to selection which information to either erased, written or read. The gates are also vectors of length \(n\). At each step \(t\) the gates can be open (1), closed (0) or somewhere inbetween. Note that the gates are dynamic.
Graphical Representation of LSTM
Figure 8.12: Understand LSTM respresentation source link
From a sequence of inputs \(x^{(t)}\), LSTM computes a sequence of hidden states
\(h^{(t)}\), and cell state \(c^{(t)}\):

Forget gate: controls what is kept versus forgotten, from previous cell state
\[f^{(t)}=\sigma\left(W_fh^{(t1)}+U_fx^{(t)}+b_f\right)\]

Input gate: controls what parts of the new cell content are written to cell
\[i^{(t)}=\sigma\left(W_ih^{(t1)}+U_ix^{(t)}+b_i\right)\]

Output gate: controls what parts of cell are output to hidden state
\[o^{(t)}=\sigma\left(W_o(r^{t}\circ h^{(t1)})+U_ox^{(t)}+b_o\right)\]

New cell content: this is the new content to be written to the cell
\[\tilde{c}^{(t)}=\textrm{tanh}\left(W_ch^{(t1)}+U_c(x^{t}+b_c\right)\]

cell state: erase (“forget”) some content from last cell state, and write (“input”) some new cell content
\[c^{(t)}=f^{(t)}\circ c^{(t1)}+i^{(t)}\circ\tilde{c}^{(t)} \]

Hidden state: read (“output”) some content from the cell
\[h^{(t)}=o^{(t)}\circ \textrm{tanh}\ c^{(t)} \]
Gated Recurrent Units (GRU)
GRU has been proposed by Cho et al. in 2014 as an alternative to the LSTM.
For GRU model, there is no cell state and at each times step \(t\) there are an input \(x^{(t)}\) and hidden state \(h^{(t)}\).
Figure 8.13: Difference between LSTM and GRU source link

Update gate: controls what parts of hidden state are updated vs preserved
\[u^{(t)}=\sigma\left(W_uh^{(t1)}+U_ux^{(t)}+b_u\right)\]

Reset gate: controls what parts of previous hidden state are used to compute new content
\[r^{(t)}=\sigma\left(W_rh^{(t1)}+U_rx^{(t)}+b_u\right)\]

New hidden state content: reset gate selects useful parts of previous hidden state. Use this and current input to compute new hidden content.
\[\tilde{h}^{(t)}=\textrm{tanh}\left(W_r(r^{t}\circ h^{(t1)})+U_hx^{(t)}+b_h\right)\]

Hidden state: update gate simultaneously controls what is kept from previous hidden state, and what is updated to new hidden state content
\[h^{(t)}=(1u^{(t)})\circ h^{(t1)}+u^{(t)}\circ\tilde{h}^{(t)} \]
Autoencoders for language translation
Machine translation (MT) is one of the main active research area in Natural Language
Processing (NLP).
The goal is to provide a fast and reliable computer program that translates a text in one language (source) into another language (the target)
Using neural network model, the main architecture used for MT is the encoder–decoder model:
encoder part which summarizes the information in the source sentence
decoder part based on the encoding, generate the targetlanguage
output in a stepbystep fashion
The encoder–decoder architecture has been introduced by Cho et al. (2014). A variant called sequencetosequence model has been introduced by Sutskever et al., (2014)
Cho’s model
Principle. In this model the encoder and decoder are both GRUs. The final state of the encoder is used as
the summary \(c\) and then this summary is accessed by all steps in the decoder
Sutskever’s model
Principle. In this model the encoder and decoder are multilayered LSTMs. The final state of the encoder becomes the initial state of the decoder. So the source sentence has to be reversed.
Improvement with Attention layer
Limitation of these models are exposed in the paper “On the Properties of Neural Machine Translation: EncoderDecoder Approaches”. They showed that the performance of the encoderdecoder network degrades rapidly as the length of the input sentence increases.
The main drawback of the previous models is that the encoded vector need to capture the entire phrase (sentence) and might skipped many important details. Moreover the information needs to “flow” through many RNN steps which is quite difficult for long sentence.
Bahdanau et al. (2015) have proposed to include attention layer which consist to include attention mechanisms to give more importance to some of the input words compared to others while translating the sentence.
A survey of the different implementations of attention model is presented by Galassi et al. (2019)