Mathematical Engineering

of Deep Learning

Chapter 7

Benoit Liquet, Sarat Moka and Yoni Nazarathy

February 28, 2024

Contents

Preface 3

1 Introduction 1

1.1 The Age of Deep Learning ............................ 1

1.2 A Taste of Tasks and Architectures ....................... 7

1.3 Key Ingredients of Deep Learning ........................ 12

1.4 DATA, Data, data! ................................ 17

1.5 Deep Learning as a Mathematical Engineering Discipline ........... 20

1.6 Notation and Mathematical Background .................... 23

Notes and References .................................. 25

2 Principles of Machine Learning 27

2.1 Key Activities of Machine Learning ....................... 27

2.2 Supervised Learning ............................... 32

2.3 Linear Models at Our Core ........................... 39

2.4 Iterative Optimization Based Learning ..................... 48

2.5 Generalization, Regularization, and Validation ................ 52

2.6 A Taste of Unsupervised Learning ....................... 62

Notes and References .................................. 72

3 Simple Neural Networ ks 75

3.1 Logistic Regression in Statistics ......................... 75

3.2 Logistic Regression as a Shallow Neural Network ............... 82

3.3 Multi-class Problems with Softmax ....................... 86

3.4 Beyond Linear Decision Boundaries ....................... 95

3.5 Shallow Autoencoders .............................. 99

Notes and References .................................. 111

4 Optimization Algorithms 113

4.1 Formulation of Optimization .......................... 113

4.2 Optimization in the Context of Deep Learning ................ 120

4.3 Adaptive Optimization with ADAM ...................... 128

4.4 Automatic Diﬀerentiation ............................ 135

4.5 Additional Techniques for First-Order Methods ................ 143

4.6 Concepts of Second-Order Methods ....................... 152

Notes and References .................................. 164

5 Feedforward Deep Networks 167

5.1 The General Fully Connected Architecture ................... 167

5.2 The Expressive Power of Neural Networks ................... 173

5.3 Activation Function Alternatives ........................ 180

5.4 The Backpropagation Algorithm ........................ 184

5.5 Weight Initialization ............................... 192

Contents

5.6 Batch Normalization ............................... 194

5.7 Mitigating Overﬁtting with Dropout and Regularization ........... 197

Notes and References .................................. 203

6 Convolutional Neural Networks 205

6.1 Overview of Convolutional Neural Networks .................. 205

6.2 The Convolution Operation ........................... 209

6.3 Building a Convolutional Layer ......................... 216

6.4 Building a Convolutional Neural Network ................... 226

6.5 Inception, ResNets, and Other Landmark Architectures ........... 236

6.6 Beyond Classiﬁcation ............................... 240

Notes and References .................................. 247

7 Sequence Models 249

7.1 Overview of Models and Activities for Sequence Data ............. 249

7.2 Basic Recurrent Neural Networks ........................ 255

7.3 Generalizations and Modiﬁcations to RNNs .................. 265

7.4 Encoders Decoders and the Attention Mechanism ............... 271

7.5 Transformers ................................... 279

Notes and References .................................. 294

8 Specialized Architectures and Paradigms 297

8.1 Generative Modelling Principles ......................... 297

8.2 Diﬀusion Models ................................. 306

8.3 Generative Adversarial Networks ........................ 315

8.4 Reinforcement Learning ............................. 328

8.5 Graph Neural Networks ............................. 338

Notes and References .................................. 353

Epilogue 355

A Some Multivariable Calculus 357

A.1 Vectors and Functions in R

........................... 357

A.2 Derivatives .................................... 359

A.3 The Multivariable Chain Rule .......................... 362

A.4 Taylor’s Theorem ................................. 364

B Cross Entropy and Other Expectations with Logarithms 367

B.1 Divergences and Entropies ............................ 367

B.2 Computations for Multivariate Normal Distributions ............. 369

Bibliography 399

Index 401

7 Sequence Models

Many forms of data such as text data in the context of natural language processing appear

sequentially. In such a case we require deep learning models that can operate on sequences

of arbitrary length, and are well adapted to model temporal relationships in the data. The

simple ﬁrst model of this form is the recurrent neural network (RNN) which can be presented

as a variation of the feedforward neural network of Chapter 5. In this chapter we explore

such models together with many more advanced variants of these models including long

short term memory (LSTM) models, gated recurrent unit (GRU) models, models based on

the attention mechanism, and in particular transformer models. An archetypical application

is end to end natural language translation and we see how encoder-decoder architecture

with sequence models can be used for this purpose. The various forms of models including

RNN, LSTM, GRU, or transformers can also be used in such an application among others.

These models also form the basis for large language models (LLMs) that have shown to be

extremely powerful for general tasks.

In Section 7.1 we consider various forms and application domains of sequence data. As a

prime example we consider textual data and ways of encoding textual data as a sequence.

In Section 7.2 we introduce and explore basic recurrent neural networks which are naturally

suited to deal with sequence data. We present the basic auto-regressive structure of such

models and discuss aspects of training. In Section 7.3 we explore generalizations of recurrent

neural networks including, stacking and reversing approaches, and importantly long short

term memory (LSTM) models, and gated recurrent unit (GRU) models. Prior to the

appearance of transformers, LSTMs and GRUs marked the state of the art for sequence

modelling. We continue in Section 7.4 where we focus on machine translation applications,

and explore how encoder-decoder architectures can be used for end-to-end translation. In the

process we intro duce the attention mechanism which has become a central pillar of modern

sequence models. An encoder-decoder architecture based on attention is also presented. In

Section 7.5 we dive into the powe rful workhorse of contemporary sequence models, the

transformer architecture. Transformers, relying heavily on attention, are presented in detail,

culminating with a transformer enco der-decoder architecture.

7.1 Overview of Models and Activities for Sequence Data

Sequence mo dels have been motivated by the analysis of sequential data including text

sentences, time-series, and other discrete sequence data such as DNA. These mo dels are

especially designed to handle sequential information while convolutional neural networks

of Chapter 6 are sp ec ialized for processing spatial information. Naturally, most interesting

input samples carry some statistical dependence between elements due to the sequential

nature of the data. Classical statistical models in time-series s uch as auto-regressive models

are naturally tailored for such data when the sample at each datapoint is a scalar or a low

dimensional vector. In contrast, the deep learning models that we cover here allow one to

work with high-dimensional samples as appearing in textual data and similar domains.

249

7 Sequence Models

Forms of Sequence Data

We denote a data sequence via

È1Í

,...,x

ÈT Í

,wherethesuperscripts

ÈtÍ

indicate time

or position, and capture the order in the sequence. Each

ÈtÍ

is a

-dimensional numerical

data point (or vector). The number of elements in the sequence,

, is sometimes ﬁxed, but

is also often not ﬁxed and can be essentially unbounded. A classical example is a numerical

univariate data sequences (

=1) arising in time-series of economic, natural, or weather

data. Similarly, multivariate time-series data (

1 but typically not huge) also arise in

similar settings.

Most of the motivational examples in this chapter are from the context of textual data. In

this case,

is typically not the time of the text but rather the index of the word or token

within a text sequence. One way to encode text is that each

ÈtÍ

represents a single word

using an embedding vector in a manner that we discuss below. If for example

is the text

associated with the Bible then

is large,

whereas if

is the text associated with a movie

review as per the IMDB movie dataset (see Figure 1.6 (d) in Chapter 1), then

is on

average 231 words. In data formats similar to the latter case, the data involves a collection

of data sequences

(1)

,...,x

(n)

}

where each

(i)

is an individual movie review and

denotes the total number of movie reviews. While in practice, such data formats often arise,

for simplicity, our discussion in this chapter mostly assumes a single (typically long) text

sequence x.

To help make the discussion concrete, assume momentarily that we encode input text in the

simplest possible manner, where the embedding vector just us es a technique called one-hot

encoding. With this approach we consider the number of words in the dictionary, vocabulary,

or lexicon as

(e.g.,

000) and set

. We then associate with each possible

word, a unit vector

,...,e

which uniquely identiﬁes the word. At this point, an input data

sequence (text) is converted into a sequence of vectors, where

ÈtÍ

whenever the

-th

word in the sequence is the

-th word in lexicographic order in the dictionary. This approach

is very simplistic and may appear ineﬃcient. Yet it illustrates that textual data may be easily

represented as a numerical input. With more advanced word embedding methods discussed

below, the dimension of each x

ÈtÍ

can be signiﬁcantly reduced.

Tasks Involving Sequence Data

There are plenty of tasks and applications involving sequence data. In the context of deep

learning s uch tasks are handled by neural network models. The more classical forms of neural

networks for sequence data are generally called recurrent neural networks (RNN), while

more modern forms are called transformers. We focus on the more classical RNN forms in

the ﬁrst sections of this chapter and later visit transformers. The basic forms of RNNs are

introduced in Section 7.2. At this point assume that each of these models processes an input

sequence

to create some output

ˆy

, where the creation of the output is sequential in nature.

For our discussion, let us focus on text based applications and highlight a few of the tasks

and applications in this context. The ideas may then be adapted to domains such as time-

In a complete treatme nt of textual data analysis or natural language processing (NLP) one requires to

deﬁne and analyze tokenizers which break up text into natural “words” or parts of words known as tokens.

These details are not our focus and we use “word” and “token” synonymously.

By some counts, there are about half a million sequential words in the old testament of the Bible and

more when one considers the new testament and its many variants.

250

7.1 Overview of Models and Activities for Sequence Data

Unit Unit

...

Unit Unit

...

Unit

ˆx

hT +1i

ˆx

hT +2i

...

ˆx

hT +⌧ i

Predictions

Warmup Phase

hT +1i

hT +2i

...

hT +⌧ i

h1i

h2i

...

hT i

Inputs Sequence

Labels

(a)

Unit

Encoded Input

Unit

Encoded Input

Unit

Encoded Input

reading

Unit

Encoded Input

this

Unit

Encoded Input

terribly

Unit

Encoded Input

instructive

Unit

Encoded Input

book

Softmax

Sentiment

(b)

Encoder

Decoder

Input

Output

Unit

Enjoyed

Unit

Writing

Unit

This

Unit

Book

Unit

nous

<start>

Unit

avons

nous

Unit

aimé

avons

Unit

écrire

aimé

Unit

écrire

Unit

livre

Unit

<end>

livre

(c)

Unit

Three

Unit

good

Unit

friends

Unit

having

Unit

fun

CNN

<Start>

(d)

Figure 7.1:

Use of recurrent neural network models (or generalizations) for various sequence data

and language tasks. The basic building block, called a unit, is recursively used in the computation.

(a) Lookahead prediction of the sequence. (b) Classiﬁcation of a sequence or sentiment analysis. (c)

Machine translation. (d) Image captioning.

251

7 Sequence Models

series, signal processing, and others. Figure 7.1 illustrates schematically how RNNs (or their

generalizations) can be used. The building blocks of these types of models are called units,

and they are recursively used in the computation of input to output. A basic task presented

in (a) is look-ahead prediction which in the application context of text, implies predicting

the next word (or collection of words) in a sequence. Another type of task presented in (b)

is sequence regression or classiﬁcation which can be used for applications such as sentiment

analysis. An additional major task illustrated in (c) is machine translation where we translate

the input sequence from one language to another (e.g., Hebrew to Arabic). Another type

of task illustrated in (d) involves deco ding an input into a sequence. One such example

application is image captioning where text is generated to describe the input image. Let us

now focus on (a)–(d) in more detail.

Consider Figure 7.1 (a) illustrating lookahead prediction. The simple application in the

context of text is to predict the next word (or next few words) based on the sequence

of input words. Thus, the output is the sequence of inputs shifted by one and the model

attempts to predict the next word at any time

. In the context of time-series this is often

referred to as an auto-regressive analysis. After a wa rmup phase, the model predicts the

next value in the time series which is also used for predicting the subsequent values until a

desired horizon. Hence for an input sequence

È1Í

,...,x

ÈtÍ

we have a future prediction

ˆy

Èt+1Í

,...,ˆy

Èt+·Í

for some time horizon

. Note that the typical use of large language

models follows this task as well since an input text is given and a response is returned.

Consider now Figure 7.1 (b) illustrating an input sequence processed to produce a single

scalar or vector output. An archetypical application in the context of text is sentiment

analysis where the sentiment or “general vibe” of a sentence is determined. This output may

be a vector of probabilities over possible classes, e.g.,

{happy

sad

indifferent}

, and in

such a case the output is amenable to classiﬁcation. Hence

ˆy

is a vector of probabilities and

it can also be converted to a categorical output

‚

Y as in (3.34) of Chapter 3.

Moving onto Figure 7.1 (c), consider the application of machine translation where the

input sentence is from one language and the output sentence is from another language.

The architecture of such a model can be composed of two RNNs (or two other types of

sequence models) in an encoder-decoder architecture. Here the encoder model encodes the

input sentence from one language into a context vector in a latent space and the decoder

model decodes from the latent space into a sentence in another language. We describe

architectures for such tasks in Section 7.4 and spe ciﬁc transformer models of this form in

Section 7.5. Observe that with this task, the input

is a sequence of a certain length while

the output

ˆy

is a sequence of a potentially diﬀerent length. Note that the notion of a latent

space was ﬁrst introduced in a diﬀerent context of autoencoders in Section 3.5.

Figure 7.1 (d) illustrates the task of image captioning. Here for an input image, we wish to

output a sentence describing the image. A common way to achieve this is with a convolutional

neural network as in Chapter 6 creating a context vector in a latent space. This context vector

is then fed into an RNN (or similar sequence model) which acts as a decoder, somewhat

similarly to the decoder in the machine translation case. In this application

is an image,

and ˆy represents an output sentence.

252

7.1 Overview of Models and Activities for Sequence Data

Unit

(a)

Unit

(b)

Unit

(c)

Unit

(d)

Figure 7.2:

Input output paradigms of sequence models. (a) One-to-many. (b) Many-to-one. (c)

Many-to-many with partial inputs and outputs. (d) Many-to-many with complete inputs and

outputs.

With the tasks and applications highlighted, we see various forms of input x and output ˆy.

Sometimes

and

ˆy

are sequences and at other times they are not. It is often common to

describe tasks and models as one to many, many to one, or many to many;seeFigure7.2.

In the one to many case,

is simply

È1Í

while

ˆy

is a sequence,

ˆy

È1Í

, ˆy

È2Í

,...

.Inthemany

to one case,

È1Í

È2Í

,...

is a sequence while

ˆy

is a single output (scalar or vector).

Finally in the many to many case, both

and

ˆy

are sequences. Returning to Figure 7.1,

observe that the lookahead prediction task (a) falls in either the many to one or the many

to many case, depending on if the time horizon

is 1 or greater, respectively. The sentiment

253

7 Sequence Models

analysis task (b) is a many to one case. The machine translation task (c) is a many to many

case, while the image captioning task (d) is a one to many case.

Word Embedding

One-hot encoding which is the simplest way to encode a word results in a very sparse

vector of high dimensionality, with the dimension b e ing the size of the lexicon,

popular alternative that has become standard in any application involving text is to use

word embeddings, where we represent each word (or token) by a vector of real numbers of

dimension p, and with p much smaller than d

The essence of word embedding techniques is that words from similar contexts have corre-

sponding vectors which are relatively close. Such closeness is often measured via the cosine

of the angle

between the two vectors in Euclidean space. As an hypothetical example

with

=4, the word

king

could b e represented by the vector (0

, ≠

3) and the

word

queen

can be represented by a relatively s imilar vector such as (0

6).

Then a completely diﬀerent word such as

mean

might be represented by a vector such as

(

≠

, ≠

8). One can now verify in this example, that the cosine of the angle between

king

and

queen

is about 0

729 while the cosine of the angle between

mean

and the other

two words is lower, and is at about ≠0.04 for king and 0.156 for queen,respectively.

Hence with such an embedding, beyond the value of reducing the dimension of each

ÈtÍ

from

(in the order of tens of thousands) to

(in the order of hundreds), we also get

the beneﬁt of similarity and context groupings. Having such a contextual representation of

words plays a positive role in deep learning models since it allows the models to use context

more eﬃciently.

Simple word embedding techniques map individual words into vectors, while more advanced

techniques are context aware and yield a representation of the words based on the context

within the rest of the text, enabling models to better deal with homonyms. For example the

word

mean

inside the phrase

mean value

,isverydiﬀerent than the same word in side the

phrase

mean person

. Hence an advanced word embedding technique will encode each of the

occurrences of mean diﬀerently.

A popular early word embedding technique is word2vec. T he creation of this embedding relies

on a neural network trained on very large corpora, to build the embedding vectors. The basic

idea is to train the neural network for a task, and then use an internal layer of the network

as the word embedding. Such an approach of a derived feature vector is common throughout

deep learning. There are two common variations of the word2vec training algorithm with

one approach called the bag of words model, seeking to predict a word from its neighboring

words, while the other approach, the skip-gram model, seeks to predict the words of the

context from a central word. In practice, both with word2vec, and with more advanced

algorithms,

one may choose if to use a ﬁxed pre-trained version of the word embedding, or

if to ﬁne tune and adjust the word embedding when used as part of a larger model.

See (A.1)inAppendixA.

See references to other word embedding approaches in the notes and references at the end of the chapter.

254

7.2 Basic Re current Neural Networks

7.2 Basic Recurrent Neural Networks

Recurrent neural networks are speciﬁcally designed for se quences of data and have the

ability to: (i) deal with variable-length sequences, (ii) maintain sequence order, (iii) keep

track of long-term dependencies, and (iv) share parameters across an input sequence. In

order to achieve all of these goals, the recurrent neural network (RNN) model introduces an

internal loop which allows information to be passed from one step of the network to the

next. The RNN maintains a hidden state, also termed cell state, which allows the model to

keep memory as an input sequence is processed. This state evolves over time, as the input

sequence is fed into the model. See Figure 7.3 for a schematic illustration of b oth a recursive

graph representation and an unfolded graph representation of the model. In the ﬁgure we

schematically see how an RNN transforms an input sequence to an output sequence, with

the blue nodes representing units of the model; details follow.

hti

ˆy

hti

ht1i

ˆy

ht1i

hti

ˆy

hti

ht+1i

ˆy

ht+1i

Figure 7.3:

A recursive neural network RNN can be represented (left) via a recursive graph and

(right) via an unfolded representation of that graph. An input sequence

È1Í

È2Í

...

is transformed

to an output sequence

ˆy

È1Í

, ˆy

È2Í

...

, where in each step

, the unit, with cell state represented via

ÈtÍ

, performs the computation.

Recurrent neural networks apply a recurrence relation where at every time step the next

input is combined with the previous state to update the new state and a new output.

This internal loop is the key diﬀerence between traditional feedforward neural networks

of Chapter 5 and RNNs. In the traditional mo dels , the ﬂow of information from input to

output via hidden layers is only in the forward direction, whereas in RNNs, the input plays

a role as information ﬂows. More speciﬁcally, in the traditional models, there is no cyclic

connection between neurons; contrast Figure 5.1 of Chapter 5 with Figure 7.3. Moreover,

the traditional feedforward neural networks work with ﬁxed length input and ﬁxed length

output while the RNN input sequence and output sequence are each allowed to be of variable

(essentially unbounded) length.

The neurons inside RNNs implement the cell state and are typically denoted via

ÈtÍ

for the

state at time

. Mathematically, the state evolution can be represented via the recurrence

relation,

ÈtÍ

¸˚˙˝

current state

= f

◊

,◊

( h

Èt≠1Í

¸˚˙˝

old state

ÈtÍ

¸˚˙˝

input vector

255

7 Sequence Models

acting on the sequence of input data

È1Í

È2Í

,...

, to create a sequence of ce ll states

È1Í

È2Í

,...

, where the initial state

È0Í

is typically taken as a zero vector. It is important

to note that at every time step

the same function

◊

,◊

(

) is used with the same ﬁxed

(over time) sets of parameters ◊

and ◊

The output sequence

ˆy

È1Í

, ˆy

È2Í

,...

is deﬁned at each time step by an another function

◊

(·) with,

ˆy

ÈtÍ

= g

◊

ÈtÍ

where again, the parameters ◊

are ﬁxed over time.

The recursive loop enables us to express the cell state at time t,

ÈtÍ

, in terms of the

ﬁrst

inputs, namely, omitting the parameter subscripts from f

◊

,◊

(·),wehave,

ÈtÍ

= f(

Èt≠1Í

...f(

È2Í

f(f(h

È0Í

È1Í

)

¸ ˚˙ ˝

È1Í

È2Í

),x

È3Í

)

¸ ˚˙ ˝

È3Í

...,x

ÈtÍ

). (7.1)

Thus, since at time

, the output

ˆy

ÈtÍ

is a function of

ÈtÍ

, we can also express the output as

a function of the inputs up until time step t. Namely,

ˆy

ÈtÍ

= g

(t)

◊

È1Í

È2Í

È3Í

,...,x

ÈtÍ

where the function

(t)

◊

(

) is speciﬁc to time

and captures the unrolling of the state as in

(7.1)

or Figure 7.3 (b). This highlights the ability of RNN to deal with variable length input

and output sequences. Here

◊

, ◊

) is the collection of all learnable parameters

of the RNN.

The functions

◊

,◊

(

) and

◊

(

) are concretely deﬁned via aﬃne transformations and

non-linear activations similarly to other common neural network models. Speciﬁcally,

ÈtÍ

= S

Èt≠1Í

+ W

ÈtÍ

+ b

)

ˆy

ÈtÍ

= S

ÈtÍ

+ b

(7.2)

The parameters

◊

, ◊

) are captured via weight matrices and bias vectors.

Further,

(

) and

(

) are vector activation functions typically composed of element-wise

scalar activations

‡

(

), similarly to Chapter 5. We denote the dimension of

ÈtÍ

,the

dimension of

ÈtÍ

, and the dimension of the cell state of

ÈtÍ

. Hence

œ R

m◊p

œ R

m◊m

, W

œ R

q◊m

, b

œ R

, and b

œ R

One variant is to feed the output of the previous time step,

ˆy

Èt≠1Í

, into the input so that the

input at every time is not

ÈtÍ

but rather some merging of

ÈtÍ

and

ˆy

Èt≠1Í

. We can denote

this variant via,

ÈtÍ

= S

Èt≠1Í

+ W

ÈtÍ

+ T

ˆy

Èt≠1Í

)+b

)

ˆy

ÈtÍ

= S

ÈtÍ

+ b

(7.3)

The actual mapping of the weight and bias vectors to each of (

◊

, ◊

) is not important. Speciﬁcally,

can be viewed as either part of ◊

or ◊

256

7.2 Basic Re current Neural Networks

where

is an abstraction

of some transformation which results in a vector of dimension

p (like x

ÈtÍ

Note that the forms in

(7.2)

and

(7.3)

are suitable for many-to-many mappings, as in

Figure 7.2 (d), since each input

ÈtÍ

has an associated

ÈtÍ

and

ˆy

ÈtÍ

. If we use these recursions

for one-to-many tasks, then we only use a ﬁrst initial

È1Í

and then continue the recursion

with 0 values for

ÈtÍ

,...

until some stopping criterion is met. A typical criterion

is to have a special

<stop>

token appear within

ˆy

ÈtÍ

. A similar adaptation can be done for

many-to-one tasks, where we simply ignore all ˆy

ÈtÍ

for t<T.

One often refers to the me chanism of computation described in

(7.2)

as a gate,asimple gate,

an RNN cell, or an RNN unit. Figure 7.4 depicts this computation where

and

◊

are

the usual vector/matrix addition and multiplication operations respectively. In the sequel

we see that the gate structure as in Figure 7.4 can be modiﬁed to more complicated forms

such as LSTM and GRU gates appearing in Figure 7.8. In terms of applications, the simple

structure of RNN gates has already proven useful for many basic tasks such as for example

dealing with short sentences for next word prediction as well as for sentiment analysis.

⇥

⇥ + S

(·)

⇥

(·)

ˆy

hti

ht1i

hti

Figure 7.4:

An RNN unit, also known as a gate, operating on input

ÈtÍ

, and previous cell state

Èt≠1Í

. The output vector of the unit is ˆy

ÈtÍ

. The unit also determines the cell state h

ÈtÍ

A Simple Concrete Toy Example

To illustrate the application of RNNs let us consider a simple concrete toy example of

lookahead text prediction. For simplicity we resort to one-hot encoding (in contrast to more

advanced word embedding methods). In our toy example assume a lexicon with

words, appearing here in lowercase alphabetical order as,

In practice, this variant is often useful i n decoders, described in sections 7.4 and 7.5,where

ÈtÍ

is often

set to 0 except for an initial

<start>

token, and the transformation

(

) typically transforms an output

embedding into the desired token (e.g., via argmax) and then transforms the token back into an input word

embedding.

257

7 Sequence Models

deep

engineering

learning

machine

mathematical

statistics

the

where each word is represented by a unit vector in R

. For example, the text,

the mathematical engineering of deep learning,

is represented via a sequence

È1Í

È2Í

È3Í

È4Í

È5Í

È6Í

. Here for example

È1Í

1) because the ﬁrst word in the sequence,

the

,isthe8-th word in the

lexicon, and similarly

È2Í

=(0

0) because the second word in the sequence,

mathematical

,isthe5-th word in the lexicon, etc. Observe that here with one-hot encoding,

p = d

For a lookahead prediction application we set the network output to be of dimension

since each output is of the size of the lexicon. Here when the network is fed a partial input

È1Í

,...,x

ÈtÍ

, the output at time (step)

, denoted via

ˆy

ÈtÍ

should ideally be equal to or

be close to the one-hot encoded target

ÈtÍ

Èt+1Í

(the next word). Similarly to the

classiﬁcation examples arising in multinomial regression in Section 3.3, our RNN will output

ˆy

ÈtÍ

vectors that are probability vectors over the lexicon. In this case, using maximum a

posteriori probability decisions as in

(3.34)

of Chapter 3, the coordinate of

ˆy

ÈtÍ

with the

highest probability can be taken as a prediction

‚

ÈtÍ

which is an index into the lexicon,

determining the predicted word.

Now following the RNN evolution equations deﬁned in

(7.2)

, we present concrete dimensions

for this illustrative example. A design choice is the size of the hidden state

,whichin

this case we arbitrarily take as

= 20. Hence, the weight matrix

dealing with state

evolution is 20

◊

20, the weight matrix

is 20

◊

8 dimensional as it conve rts the inputs to

the state, and the bias vector

is a 20 dimensional vector. The vector activation function

(

) is composed of scalar activations, which can be of any of the standard types (e.g.,

sigmoid); see Section 5.3. Further, the matrix translating state to output,

is 8

◊

20 since

=8, and ﬁnally

is an 8 dimensional vector. Importantly, in this case, we take the output

activation function

(

) as a softmax function since it converts the aﬃne transformation

of the state into a probability vector. Hence in summary, such a toy network would have

20 ◊ 20 + 20 ◊ 8 + 20 + 8 ◊ 20 + 8 = 748 parameters to learn.

258

7.2 Basic Re current Neural Networks

h1i

the

mathematical

0.2

0.05

0.1

0.4

0.1

0.05

h2i

mathematical

engineering

0.01

0.55

0.09

0.02

0.08

0.1

0.05

h3i

engineering

0.05

0.15

0.05

0.1

0.04

0.35

0.2

0.06

h4i

engineering

deep

0.3

0.37

0.03

0.05

0.02

0.03

0.09

0.11

h5i

deep

learning

0.02

0.52

0.03

0.12

0.03

0.21

0.04

deep:

engineering:

learning:

machine:

mathematical:

of:

statistics:

the:

One-hot encoded:

Input:

Prediction:

Target:

Figure 7.5:

A schematic of RNN cells unrolled for language modeling. In this illustration, the

input sentence

the mathematical engineering of deep learning

, yields lookahead prediction

mathematical engineering of engineering learning, with a single error.

Figure 7.5 presents a schem atic illustration of the unrolling of this toy network. When the

model is used in training, the shifted sequence

È2Í

È3Í

,...

serves as the sequence of

target labels for comparison; these are one-hot encoded vectors. Then for given weight and

bias parameters, we predict

ˆy

È1Í

, ˆy

È2Í

,...

as the predicted labels, where each

ˆy

ÈtÍ

is a

vector of probabilities over the lexicon for output word at step

. We then use categorical

cross entropy (see equation

(3.30)

in Chapter 3) to compute the loss. We train using gradient

based learning similarly to all other deep learning models, yet for evaluation of the gradient,

we use a variant of backpropagation called backpropagation through time which is described

below.

In pro duction, the way that the model is used, is by selecting the word at time

,withthe

highest probability in

ˆy

ÈtÍ

. We denote the index of this selection via

‚

ÈtÍ

. In the illustration

of Figure 7.5, most of the words are properly predicted with the exception of the fourth

word at t =4which is predicted as deep while the target is engineering.

Training an RNN with Backpropagation Through Time

In general when training an RNN, the loss function is accumulated over all time steps

In particular, during the execution of gradient descent or some generalization of gradient

259

7 Sequence Models

descent, such as ADAM described in Chapter 4, we compute the loss and its derivatives

with respect to weight and bias parameters, i.e., ◊. A general expression for the loss is

C(◊)=

t=1

ÈtÍ

(◊). (7.4)

Here

ÈtÍ

(

◊

) denotes the individual loss associated with time

. For example, continuing

with the text language model from above, we may se t,

ÈtÍ

(◊)=≠

k=1

ÈtÍ

log ˆy

ÈtÍ

, (7.5)

similarly to the categorical cross entropy in

(3.30)

of Chapter 3. Here, keep in mind that

ÈtÍ

is one-hot encoded of dimension

, and hence the summation in

(7.5)

has a

single non-zero summand at the index

for which

ÈtÍ

=1. Further, note that even if word

embeddings are used for the input

ÈtÍ

, then the output

ˆy

ÈtÍ

still represents a probability

vector over the lexicon of size d

so that it is comparable to the target output y

ÈtÍ

Gradient computation of

(

◊

) with respect to the various weight matrices and bias vectors

◊

is somewhat similar to the backpropagation algorithm described in Section 5.4; see also

Section 4.4 for automatic diﬀerentiation basics. However, a key diﬀerence lies in the fact that

the same weight and bias parameters are used for all time

; see the unfolded representation

of the RNN in Figure 7.3.Thisdiﬀerence as well as the fact that inputs to RNN are of

arbitrary size

, imposes some hardships on gradient computation. The basic algorithm is

called backpropagation through time.

One way to view the algorithm is to momentarily return to the feedforward networks of

Chapter 5 and assume that the weight matrices and bias vectors of layers are all constrained

to be the same with a single set of parameters,

and

, for all layers. Further, momentarily

assume that the network depth

, is ﬁxed at the sequence input length

. This form of a

feedforward network is e ss entially an unfolded RNN if we consider every recursive step of

the RNN as a layer in the feedforward network, and if we ignore inputs to the RNN beyond

the ﬁrst input x

È1Í

, and impose loss on the RNN only for the last output.

One can also modify feedforward networks to have additional external inputs at each of the

hidden layers. In our feedforward analogy, assume now that an external input to the

-th

layer is

È¸Í

,wherethelayer

and the time

play the same role. Further, one may impose

loss functions on feedforward networks that not only take the neurons at the last layer as

arguments, but rather use all layers, similarly to

(7.4)

. If we also employ such a loss function

on the feedforward network analogy then we see that we can treat the unfolded recurrent

neural network as a feedforward network, where for simplicity we treat the transformation

from

ÈtÍ

(7.2)

as the identity transformation. With this, let us return to the details

of the backpropagation algorithm in Section 5.4 and see how it can be adapted for recurrent

neural networks.

At ﬁrst a forward pass is carried out to populate the neuron values. In feedforward networks

these were denoted

[¸]

whereas in the unfolded recurrent neural network they are denoted via

ÈtÍ

. Then a backward pass is used to compute the adjoint elements, denoted

’

ÈtÍ

, similarly

260

7.2 Basic Re current Neural Networks

h1i

h2i

hT 1i

hT i

⇣

h1i

⇣

h2i

⇣

hT 1i

⇣

hT i

h1i

h2i

hT 1i

hT i

Gradient

Shared Parameter

Forward

Backward

Loss : C

h1i

,...,y

hT i

ˆy

h1i

,...,ˆy

hT i

Figure 7.6:

The variables and ﬂow of information in the backpropagation through time algorithm.

The shared parameter

inﬂuences the recursive forward pass computation of all cell states

È1Í

,...,h

ÈT Í

. Once the backwards pass computation is carried out for all adjoints

’

ÈT Í

,...,’

È1Í

they are all used to compute the gradient of the loss, g

to the adjoints deﬁned in (5.24) of Chapter 5. In the RNN context these are,

’

ÈtÍ

ˆC(◊)

ˆh

ÈtÍ

The essence of backpropagation is computing

’

Èt≠1Í

based on

’

ÈtÍ

. This computation follows

similar lines to (5.26) of Chapter 5, adapted here to be,

’

ÈtÍ

]

[

·=1

È·Í

),t= T,

ˆh

Èt+1Í

ˆh

ÈtÍ

’

Èt+1Í

,t= T ≠ 1,...,1,

(7.6)

where

È·Í

(

È·Í

) is the derivative of

(7.5)

with respect to the prediction

ˆy

È·Í

for which we

are given computable expressions.

Once the backpropagation through time recursion

(7.6)

is carried out, we use the computed

adjoint sequence

’

ÈT Í

,...,’

È1Í

to evaluate the gradient of the loss with respect to components

◊

. For simplicity let us focus only on

as appearing in

(7.2)

, denoted here as

for

261

7 Sequence Models

brevity. Speciﬁcally, we are interested in evaluating the m ◊ m derivative matrix,

ˆC

ˆW

t=1

ˆC

ÈtÍ

; W

ˆW

, (7.7)

similarly to the notation in the feedforward case as in

(5.22)

of Chapter 5. A noticeable

diﬀerence between

(7.7)

and

(5.22)

is that due to the loss function structure in

(7.4)

a direct function of the cell state at all times

(all internal layers of the unfolded graph).

However, a more important diﬀerence is due to the fact that all time steps (unfolded layers)

share the same parameter

, and thus the computational graph connecting

and the

loss dictates that all adjoints aﬀect

. See Figure 7.6 and contrast it with Figure 5.7 of

Chapter 5.

While the feedforward case in Chapter 5 with individual parameters per layer has an easy

translation of an adjoint into a gradient, as in the right hand side of

(5.25)

,herethe

translation of adjoints to a gradient is more complicated and more computationally costly.

Speciﬁcally, using the multivariate chain rule, we can be informally

represent the gradient

as,

t=1

·=1

ˆh

È·Í

ˆW

ˆC

ˆh

È·Í

¸˚˙˝

’

È· Í

. (7.8)

To understand the internal summation in

(7.8)

, recall that the output

ˆy

ÈtÍ

,usedinthe

individual loss

ÈtÍ

, depends on all cell states

È1Í

,...,h

ÈtÍ

, where each cell state is parame-

terized by a common

. Hence the computational graph for this loss component, needs to

be taken into account when applying the chain rule. This is also illustrated in the top part

of Figure 7.6.

Note that formally the expression

ˆh

È· Í

ˆW

(7.8)

is a derivative of a vector with respect to a

matrix, and we do not handle such objects in this book. An alternative is to represent each

scalar component of

ÈtÍ

separately. Using

(7.2)

and assuming the vector activation function

(·) is composed of scalar activation functions ‡(·),wehave,

È·Í

= ‡([W

È·≠1Í

+ W

È·Í

+ b

]

Now using

(A.15)

from Appendix A, we have that the derivative of the scalar

È·Í

with

respect to the weight matrix W

(abbreviated as W ) is given by the matrix,

ˆh

È·Í

ˆW

=˙‡(W

È·≠1Í

+ W

È·Í

+ b

) e

È·≠1Í

€

, (7.9)

where

is the

-dimensional unit vector with 1 at the

-th coordinate, and

˙‡

(

) is the

derivative of the scalar activation function; see also Section 5.3.

Continuing with the approach of treating individual neurons

È·Í

,letusnowpresenta

more precise version of

(7.8)

. For this consider the fact that in com puting each individual

loss,

È·Í

, we rely on the neurons with the cell states

È1Í

,...,h

È·Í

, for all

,...,m

.In

turn, each of these neurons is inﬂuenced by

(shorthand for

), as in

(7.9)

. Now (also

The representation in (7.8)isinformalbecausethevector-matrixderivative

ˆh

È· Í

ˆW

is not a matrix.

262

7.2 Basic Re current Neural Networks

summing up over all individual losses for

,...,T

), we use the multivariate chain rule to

arrive at,

t=1

·=1

j=1

ˆh

È·Í

ˆW

’

È·Í

, (7.10)

which is fully computable using the backpropagated adjoints from (7.6) and (7.9).

To summarize backpropagation through time, we ﬁrst carry out a forward pass to populate

È1Í

,...,h

ÈT Í

using

(7.2)

or the

(7.3)

variant. We then carry out a backward pass to populate

the adjoints

’

ÈT Í

,...,’

È1Í

using

(7.6)

. We then compute the gradient

via

(7.10)

.This

summary is for our simpliﬁed case focusing only on

and ignoring the fact that

ÈtÍ

is generally not

ÈtÍ

, but rather constructed via the second equation in

(7.2)

. Hence in

our simpliﬁed presentation we focused on the essence and ignored less complicated details

for the complete set of ◊ parameters.

Let us also consider the Jacobian

ˆh

Èt+1Í

ˆh

ÈtÍ

appearing in

(7.6)

. Again, assuming that the vector

activation function

(

) of

(7.2)

is composed of element wise scalar activation functions

‡(·), this Jacobian can be re presented as,

ˆh

Èt+1Í

ˆh

ÈtÍ

= W

€

diag

˙‡(W

ÈtÍ

+ W

ÈtÍ

+ b

)

, (7.11)

where the derivative of the activation function is denoted via

˙‡

(

) and is applied element

wise to the components of its input.

Computational Challenges

We discussed vanishing and exploding gradient phenomena in Section 5.4, where in equations

(5.34)

and

(5.35)

we saw how both the forward pass and the backwards pass involve actions

of repeated matrix multiplication. More speciﬁcally, the backpropagation based equation

(5.35)

is based on a simpliﬁcation of a feedforward neural network that ignores the eﬀec t of

activation functions, ignores the bias, and assumes that each layer of the network has the

same weight matrix. In such a case, it is evident that for deep networks (

large), vanishing

or exploding gradient phenomena are likely to occur.

In recurrent neural networks, such phenomena are even more problematic than typical deep

feedforward networks because the input size

(paralleling the depth of the feedforward

network L), can be large. Unrolling (7.6) we get for t =1,...,T ≠ 1,

’

ÈtÍ

ˆh

Èt+1Í

ˆh

ÈtÍ

ˆh

Èt+2Í

ˆh

Èt+1Í

···

ˆh

ÈT ≠1Í

ˆh

ÈT ≠2Í

ˆh

ÈT Í

ˆh

ÈT ≠1Í

’

ÈT Í

Now using

(7.11)

and for simplicity ignoring the action of the activation function (treating

it as an identity function), ignoring the input x

ÈtÍ

, and ignoring the bias term, we obtain,

’

ÈtÍ

€

T ≠t

’

ÈT Í

. (7.12)

This representation is similar to

(5.35)

of Chapter 5, and is even more realistic since in

recurrent neural networks, the weight matrices of all unrolled layers are the same, whereas in

the Chapter 5 analysis of feedforward networks ﬁxing the weight matrix was a simpliﬁcation.

263

7 Sequence Models

Hence, in recurrent neural networks trained on inputs with large sizes

,itisverylikely

that during backpropagation, the adjoint values

’

ÈtÍ

vanish or explode. This follows from

the matrix power in

(7.12)

, since in most situations, the maximal eigenvalue of

is likely

to not be at or near unity (see also discussion on the eﬀect of eigenvalues on vanishing and

exploding phenomena in Section 5.4).

Considering

(7.12)

and assuming

€

has a maximal eigenvalue less than unity in absolute

value, then if

is large, for s mall

’

ÈtÍ

0. One way to e xpress this is to consider some

such for example if

= 300,set

= 250, and then for

t<T

assume

’

ÈtÍ

=0.In

this case the gradient computation (7.10) can be roughly represented as,

t=T

·=T

j=1

ˆh

È·Í

ˆW

’

È·Í

. (7.13)

Now considering the inﬂuence of the input via

(7.13)

and

(7.9)

, we see that the gradient

is only updated based on “near eﬀects”, and not based on “long-term eﬀects” since inputs

to the sequence

ÈtÍ

for

t<T

do not play a role. For example in language modelling, the

contribution of faraway words to predicting the next word at time-step diminishes when the

gradient vanishes early on. As an example consider the text

Slava grew up in Ukraine before he moved around the world, first to the

United States, and then to Australia. He loves teaching languages and is an

avid teacher of his own mother tongue _.’

In this case, completion of the end of the text, marked via

, requires information from the

start of the text. Models presented in the sections below, were also designed to overcome

such diﬃculties.

Further, with recurrent neural networks, computation of the loss and of the gradients across

an entire corpus is generally infeasible or too expensive. In practice , a batch of sentences is

used to compute the loss to limit the sequence size

. Note also that in cases where

€

has

eigenvalues greater than unity in absolute value an exploding gradient phenomena is likely

to occur. For this, gradient clipping may be employed as described at the end of Section 5.4.

Another technique is to use truncated backpropagation through time (TBPTT) which limits

the number of time steps the signal can backpropagate in each forward pass.

Other Aspects of Training

Some practices of training re current neural networks are very similar to training feedforward

or convolutional networks. For example, one uses similar weight initialization techniques to

those introduced in Section 5.5 in the context of feedforward networks. However, there are

some diﬀerences as well. An important aspect to keep in mind is that unlike the supervised

setting that prevailed with the models of Chapter 5 and C hapter 6,withrecurrentmodelswe

are often able to train with self-supervision. Sp ec iﬁcally, as already discussed in the example

of Figure 7.5 we may use a shifted sequence

È2Í

È3Í

,...

as the desired output for

the loss, and simply train the model for one step lookahead prediction.

Note however, that not all training is of the self-supervise d form. In some cases, often arising

in machine translation applications described in the sequel, we are naturally presented with an

input sequence

È1Í

È2Í

,...

which may result from word embedding of one natural language

264

7.3 Generalizations and Modiﬁcations to RNNs

(e.g., English) and a corresponding output sequence

È1Í

È2Í

,...

, of one-hot encoded vectors,

associated with another natural language (e.g., Arabic). Hence recurrent neural networks

can be trained in a supervised setting as well.

In both the self-supervised and supervised settings, in cases where we use the formulation

(7.3)

, where the output

ˆy

Èt≠1Í

is fed into the input, we sometimes use a training technique

called teacher forcing. The idea of teacher forcing is to use the actual (correct) one-hot

encoded label

Èt≠1Í

in place of the model generated (predicted) probability vector

ˆy

Èt≠1Í

during training. That is, the recursion

(7.3)

has now inputs that are based on the actual

labels instead of the predictions. Note that in this case,

(

) in

(7.3)

, can be viewed as

also converting the probability vec tor into a word embedding, if needed. This technique

accelerates training by removing the errors in the labels. We revisit the teacher forcing

technique both at the end of Section 7.4 in the context of encoder-decoder models, and at

the end of Section 7.5 in the context of transformers where it is extremely powerful due to

parallelization.

7.3 Generalizations and Modiﬁcations to RNNs

The basic recurrent networks of Section 7.2,whilepowerful,stillsuﬀer from some drawbacks

in terms of training, vanishing and exploding gradient, and expressivity. In this section we

highlight a few generalizations and modiﬁcations to RNNs that enable more powerful models

for sequence data. An underlying concept is the connection of gates in various creative ways

that enable more expressive and robust models. The notion of a gate was already illustrated

in Figure 7.4. In this section we see how such gates can be connected in diverse ways, as

well as how the internals of the gate can be extended to yield more powerful models.

Stacking and Reversing Gates

Basic extensions to recurrent neural networks are possible by interconnecting gates in

more complicated forms than just a forward direction of data ﬂow. In particular, common

approaches are to either stack the gates to form deeper networks, reverse the gates, or combine

the two approaches. See Figure 7.7 for a schematic representation of such interconnections

of gates.

Let us ﬁrst consider reversing of gates as in Figure 7.7 (a) to create a bidirectional recurrent

neural network. For such a modiﬁcation we extended the RNN evolution equation (7.2) to,

]

[

ÈtÍ

= S

Èt≠1Í

+ W

ÈtÍ

+ b

)

ÈtÍ

= S

Èt+1Í

+ W

ÈtÍ

+ b

)

ˆy

ÈtÍ

= S

ÈtÍ

+ W

ÈtÍ

+ b

(7.14)

where now for every time

there are two cell states,

ÈtÍ

and

ÈtÍ

, representing the forward

direction and reverse direction respectively. Observe in

(7.14)

that

ÈtÍ

evolves based on the

input

ÈtÍ

and

Èt≠1Í

,while

ÈtÍ

evolves based on the input

ÈtÍ

and

Èt+1Í

. Naturally with

such an extension there are more trained parameters, superscripted via

and

respectively

in (7.14).

265

7 Sequence Models

Unit

---

Unit

(a)

Unit

...

Unit

(b)

Figure 7.7:

Alternative conﬁgurations and extensions of recurrent neural networks. (a) Stacked

RNN. (b) Bidirectional RNN.

As is evident from

(7.14)

, the forward se quence of cell states

È1Í

È2Í

,...,h

ÈT Í

and the reverse

sequence of cell states

ÈT Í

Èt≠1Í

,...,h

È1Í

, evolve without interaction. Once computed, these

sequences are then combined to obtain the output sequence. Such bidirectional data ﬂow

enables the model to be more versatile, especially for cases where the entire input sequence

is available. This is the setup in applications such as handwritten text recognition, machine

translation, speech recognition, and part-of-speech tagging, among others.

Let us now consider stacking of gates as in Figure 7.7 (b) to create a deepe r model, also

sometimes known as a stacked recurrent neural network. With this paradigm we extend the

evolution equations (7.2) to,

]

[

ÈtÍ

[1]

= S

[1]

Èt≠1Í

[1]

+ W

[1]

ÈtÍ

+ b

[1]

)

ÈtÍ

[2]

= S

[2]

Èt≠1Í

[2]

+ W

[2]

ÈtÍ

[1]

+ b

[2]

)

ÈtÍ

[L]

= S

[L]

Èt≠1Í

[L]

+ W

[L]

ÈtÍ

[L≠1]

+ b

[L]

)

ˆy

ÈtÍ

= S

ÈtÍ

[L]

+ b

(7.15)

where we now use notation such as [1]

,...,

[

] to signify the depth of individual components

and

is the number of stacked layers, similarly to the notation of Chapter 5. Observe that

the cell state at time

and depth

, denoted via

ÈtÍ

[¸]

is computed based on the cell state at

depth

¸ ≠

1 and the same time

using the matrix

[¸]

(where the notation

here in the

subscript implies the previous level). It is also computed using the cell state at the same

depth, ¸, and the previous time, t ≠ 1 using the matrix W

[¸]

266

7.3 Generalizations and Modiﬁcations to RNNs

Such stacked RNN mode ls are clearly more expressive and thus they generally outperform

single-layer recurrent neural networks when trained with enough data. However, they are

harder to train as the number of parameters clearly grows proportionally to the number of

layers. We also mention that combinations of stacking and reversing are also possible.

Long Short Term Memory Models

Long short term memory (LSTM) models are generalizations of basic recurrent neural

networks that are designed to preserve information over many time steps. To understand

the idea behind LSTM, it is constructive to think in terms of logical operations that

are approximated via multiplication of vectors. In particular, as we see below, diﬀerent

components of LSTM interplay in a way that can heuristically be described as computation

of a logical circuit. More speciﬁcally, some of the neurons inside LSTM can be called internal

gates and are represented as values in the range [0

1] and these are then multiplied by other

neurons with arbitrary real values. In particular, when a vector of neurons in such an internal

gate, say

, has elements in the range [0

1], and another vector of neurons, say

has general

real values, then the element wise multiplication

g § c

can be viewed as a restriction which

approximately zeros out (forgets) entries of

when the corresponding entry of

is near 0.

We informally say that the entry of the internal gate is “open” when it is approximately

at 1 and similarly “closed” when it is approximately 0. Using internal gates for this type

of “approximate logical masking” is common in these models as well as the gated recurrent

units described in the sequel.

A key concept in LSTM is to extend the hidden units of RNNs by separating the information

ﬂow between units into two groups where one group is called the cell state and denoted

ÈtÍ

while the other group is called the hidden state and denoted

ÈtÍ

. The model is designed so

that long term dep endancies are generally retained through

ÈtÍ

while short term dependencies

are carried by

ÈtÍ

. The interaction between these groups of neurons is enabled via additional

groups of neurons, namely the internal gates, which are generally vectors with entries in the

range [0

1], denoted via

ÈtÍ

, and

ÈtÍ

. An additional internal group of neurons, denoted

via ˜c

ÈtÍ

, is sometimes called the internal cell state.

For the basic recurrent neural network models of Section 7.2,weused

for the number of

neurons and this is also the dimension of information ﬂow between successive units. However,

for LSTM, only some of the neurons are used for information ﬂow between units, namely

ÈtÍ

and

ÈtÍ

. In terms of dimension, we retain

as the number of neurons, and assume that

m =6˜m where the dimensions of all vectors c

ÈtÍ

, h

ÈtÍ

, ˜c

ÈtÍ

, g

ÈtÍ

, g

ÈtÍ

, and g

ÈtÍ

is ˜m.

267

7 Sequence Models

hti

. .

hti

˜c

hti

ht1i

hti

Tanh

(·)

Sig

(·)

Tanh

(·)

Sig

(·)S

Sig

(·)

(a)

1

hti

ht1i

hti

Sig

(·) S

Sig

(·)

Tanh

(·)

(b)

Figure 7.8:

Representation of the LSTM and the GRU units. Internal gates are represented in

yellow and internal states are in gray. The output

ˆy

ÈtÍ

is not presented. (a) In LSTM there are

three internal gates and the internal state is called the internal cell state. (b) In GRU there are two

internal gates and the internal state is called the internal hidden state.

A basic LSTM unit is illustrated in Figure 7.8 (a) which summarizes the evolution associated

with this unit. Like the simpler RNN counterpart in Figure 7.4, and equations

(7.2)

,the

268

7.3 Generalizations and Modiﬁcations to RNNs

evolution equations of LSTM describe how the pair (

ÈtÍ

) evolves as a function of the

previous pair (

Èt≠1Í

) and the input

ÈtÍ

. Further, the output

ˆy

ÈtÍ

evolves based on

(

ÈtÍ

),directlyvia

ÈtÍ

and indirectly based on

ÈtÍ

. Unlike the RNN

(7.2)

, the LSTM

evolution is more complex since it also involves the internal gates and neurons.

The LSTM evolution equations are,

]

[

ÈtÍ

= g

§ c

Èt≠1Í

+ g

§ S

Tanh

˜ch

Èt≠1Í

+ W

˜cx

ÈtÍ

+ b

˜c

¸ ˚˙ ˝

˜c

ÈtÍ

(cell state)

ÈtÍ

= g

§ S

Tanh

ÈtÍ

)(hidden state)

ˆy

ÈtÍ

= S

ÈtÍ

+ b

(7.16)

where for clarity we omit the time superscripts from the internal gates and denote them via

, g

, and g

. Importantly, at every time t these internal gates are computed as,

]

[

= S

Sig

Èt≠1Í

+ W

ÈtÍ

+ b

(forget gate)

= S

Sig

Èt≠1Í

+ W

ÈtÍ

+ b

(input gate)

= S

Sig

Èt≠1Í)

+ W

ÈtÍ

+ b

. (output gate)

(7.17)

Note that to restrict the value of internal gates to the range [0

1], sigmoid activation functions

are typically used and we denote the associated vector activation function via

Sig

(

).The

hidden state, the cell state, and the internal cell state information is not generally restricted

to [0

1] and a typical activation function is tanh where we denote the associated vector

activation function via S

Tanh

(·).

As evident from

(7.16)

and

(7.17)

, for an LSTM with input of dimension

and output of

dimension

, the trained LSTM parameters include the following. First there are four

˜m ◊ ˜m

weight matrices

˜ch

, and

. Further there are the four

˜m ◊p

weight matrices

˜cx

, and

. In addition there is the

q ◊ ˜m

weight matrix

, as well as the

ﬁve associated bias vectors.

The speciﬁc structure of an LSTM unit interconnects the internal gates in a way that

enables using both long term and short term memory, captured in

ÈtÍ

and

ÈtÍ

respectively.

Speciﬁcally, the internal gates help select which information is “forgotten”, “used as input”,

or “used as output”. At each time step

the entries in the internal gate vectors

ÈtÍ

and

ÈtÍ

can be “open”, “closed”, or somewhere in-between where entries that are near 1

are considered open and entries that are near 0 are considered closed. The forget gate

ÈtÍ

is multiplied element wise with the previous cell state

Èt≠1Í

to “forget” information from

the previous cell state or not, depending on b e ing closed or open respectively. Similarly, the

input gate,

ÈtÍ

controls what parts of the new cell content are written to the cell and this is

applied to the internal cell state

˜c

ÈtÍ

which models the “selected information” based on the

current input and the previous short term memory. Finally the output gate,

ÈtÍ

, controls

what parts of the cell are written to the hidden state

ÈtÍ

which is then used both for output

ˆy

ÈtÍ

and the short term memory passed onto the next unit.

It is interesting to consider the magnitudes of the LSTM elements, speciﬁcally in the ﬁrst

equation of

(7.16)

.Attime

, the previous cell state

Èt≠1Í

may have entries with general

269

7 Sequence Models

values (not limited to [

≠

1]). These values may then be “forgotten” if multiplied by

ÈtÍ

cases where it is approximately at 0. Further, new long term memory is accumulated when

ÈtÍ

is approximately at 1. Observe that since the tanh activation function’s range is [

≠

1],

the accumulation of this new memory is limited at every time step. Speciﬁcally, based on

the internal cell state,

˜c

ÈtÍ

, the memory m ay increase or decrease by at most 1 per time step.

The interconnection of LSTM units follows the same principles as the interconnection

of recurrent neural network units outlined above. Speciﬁcally, one may view an unrolled

representation of LSTM in the same manner as an unrolled representation of basic recurrent

neural networks, presented in Figure 7.3.Thediﬀerence is that both

ÈtÍ

and

ÈtÍ

are passed

between time

and time

+1, and not just

ÈtÍ

as in basic RNN. With this, the same

extensions that one may consider for basic RNNs can be applied to LSTM. Speciﬁcally,

LSTM can be reversed as in Figure 7.7 (a) or stacked into a deeper architecture as in

Figure 7.7 (b). In reversing LSTM, the reverse direction LSTM passes (

Èt+1Í

) into

the unit computing (

ÈtÍ

). In stacking LSTMs, the hidden state

ÈtÍ

[¸]

of layer

is passed

as an input (similar to

ÈtÍ

) for the unit above at layer

+1but not the cell state. Note

that in stacked LSTM we may view

ÈtÍ

[1]

,...,c

ÈtÍ

[L]

as a representation of long term memory

across all layers at step t. This long term memory is passed to the next step, t +1.

Gated Recurrent Unit Models

An alternative to the LSTM architecture is the gated recurrent unit (GRU) architecture, with

a unit illustrated in Figure 7.8 (b). While LSTMs make an explicit separation of neurons to

be long term or short term, with GRUs we return to a somewhat simpler architecture with

only one key set of neurons

ÈtÍ

, again called the hidden state. Gated recurrent units store

both long-term dependencies and short-term m em ory in the single hidden state. Like LSTMs,

gated recurrent units use internal gates with values in the range [0

1], this time called the

reset gate

ÈtÍ

and the update gate

ÈtÍ

. Similarly to LSTMs that maintain an internal cell

state, GRUs maintain an internal hidden state

ÈtÍ

.Setting

˜m

as the number of neurons

in each of these groups, the total number of neurons in a GRU is

˜m

. Hence with 4

components instead of 6 components, gated recurrent units provide a simpler architecture in

comparison to LSTM as there are only two internal gates (in comparison to three) and a

single group of states passed between time units (in comparison to two).

The basic evolution equation for gated recurrent units is,

]

[

ÈtÍ

=(1≠ g

) § h

Èt≠1Í

+ g

§ S

Tanh

§ h

Èt≠1Í

)+W

ÈtÍ

+ b

¸ ˚˙ ˝

ÈtÍ

ˆy

ÈtÍ

= S

ÈtÍ

+ b

(7.18)

where for clarity we omit the time superscripts from the internal gates and denote them via

and g

.Ateverytimet, these internal gates are computed as,

= S

Sig

Èt≠1Í

+ W

ÈtÍ

+ b

(reset gate)

= S

Sig

Èt≠1Í

+ W

ÈtÍ

+ b

. (update gate)

(7.19)

A key attribute of the ﬁrst equation of

(7.18)

is that new entries of the cell state

ÈtÍ

are

computed as a convex combination of the entries of the previous cell state

Èt≠1Í

and the

270

7.4 Encoders Decoders and the Attention Mechanism

hidden cell state

ÈtÍ

. This convex combination is determined by the entries of

ÈtÍ

where an

entry near 1 implies “update” of the cell state based on the internal cell state, and an entry

near 0 implies retaining the previous value (not updating).

As evident from

(7.18)

and

(7.19)

, for a GRU with input of dimension

and output of

dimension

, the trained parameters include the following. First there are three

˜m ◊ ˜m

weight matrices

, and

. Further there are the three

˜m ◊ p

weight matrices

, and

. In addition there is the

q ◊ ˜m

weight matrix

, as well as the four

associated bias vectors. Again as evident, the number of parameter groups is smaller than

that of LSTM.

To gain some intuition about the GRU architecture we may observe that the update gate

plays a role similar to both the forget gate,

, and input gate,

in LSTM. Speciﬁcally

compare the ﬁrst equation in

(7.18)

with the ﬁrst equation in

(7.16)

. The simpliﬁcation

oﬀered by GRU is to use a convex combination (1

≠ g

) instead of a general linear

combination (

) as in LSTM. In both architectures this operation controls what parts of

long term memory information are updated versus preserved. One may also observe that

GRU’s internal hidden state

ÈtÍ

is updated via a slightly more complex mechanism than

LSTM’s internal cell state

˜c

ÈtÍ

. The innovation in GRUs is to use the reset gate,

. Practice

has shown that with such an architecture, GRUs are able to maintain both long term and

short term memory inside the hidden state sequence, h

È1Í

È2Í

,....

Note that the interconnection of GRUs can follow the exact same lines as other recurrent neu-

ral network architectures. Again, bi-directional connections as well as stacked conﬁgurations

are possible; see Figure 7.7.

7.4 Encoders Decoders and the Attention Mechanism

One of the great application successes of sequence models is in the domain of machine

translation tasks, namely the translation of one human language to another. For this, a

general paradigm involving an encoder neural network and a decoder neural network is

common. Other applications of encoders and decoders include, image to text models and text

to image models. Yet, the main motivation we consider here is machine translation, since

this application was the main driver in the development of encoder-decoder architectures

within sequence models.

An important machine learning concept that has advances machine translation and other

tasks, is the attention mechanism. This idea is incorporated in transformer mod els that

currently drive state of the art large language models. Transformers are the topic of the

next section and in this current section, we ﬁrst introduce general ideas of encoder-dec oder

architectures with the motivation of machine translation. We then formally deﬁne the

attention mechanism. Finally, we see an encoder-decoder architecture that incorporates the

attention mechanism at the interface of the encoder and the decoder.

Encoder-Decoder Architectures for Machine Translation

Recall from Section 7.1 that in general, when considering natural language, the input text is

converted into a sequence of word embeddings denoted

È1Í

È2Í

,...

.Withsuchasequence,

at some point, an e mbedding of a word or token such as

<stop>

appears and marks the end

271

7 Sequence Models

of the text. In a machine translation application, our goal is to convert this input sequence

to an output sequence

ˆy

È1Í

, ˆy

È2Í

,...

, also containing a

<stop>

token representation at its

end. Clearly the input is in one natural language, e.g., French, and the output is an another

natural language, e.g., Telugu.

Machine translation handled via an encoder-decoder architecture, uses a setup similar to

Figure 7.9 (a). First, an encoder model which is a recurrent neural network, or a variant such

as LSTM, or GRU, is used to convert the input sequence into the latent space by creating a

context vector also known as the code, denoted via

. Ideally this code encompasses the

meaning and style of the input text. Then, a second sequence model, known as the decoder,

takes the code

as input and converts it to the output sentence. Clearly the encoder model

in this setup is conﬁgured as a many to one model, while the decoder model is conﬁgured as

a one to many model. In the decoder, the output at each time fed into the input for the next

time as in

(7.3)

. Note that the code

is a vector of ﬁxed dimension. Further the dimension

of each

ÈtÍ

, the dimension

of each

ˆy

ÈtÍ

typically diﬀer and each typically has their own

encoding. The input and output sequences are of arbitrary length, where the length of the

input sequence and the length of the output sequence may diﬀer.

Unit

enjoyed

Unit

writing

Unit

this

Unit

book

Context Vector

Unit

<start>

Nous

Unit

Nous

avons

Unit

avons

aimé

Unit

aimé

écrire

Unit

écrire

Unit

livre

Unit

livre

<end>

Encoder Decoder

Input

Output

(a)

Unit

enjoyed

Unit

writing

Unit

this

Unit

book

Context Vector

Unit

<start>

Nous

Unit

Nous

avons

Unit

avons

aimé

Unit

aimé

écrire

Unit

écrire

Unit

livre

Unit

livre

<end>

Encoder Decoder

Input

Output

(b)

Figure 7.9:

Unrolling of basic encoder-decoder architectures for machine translation. (a) A basic

architecture where the encoder output context vector

is computed and fed as the initial state to

the decoder. (b) A more advanced architecture where

is also presented at the input and output

at each time step of the decoder.

272

7.4 Encoders Decoders and the Attention Mechanism

This basic type of encoder-decoder architecture, as in Figure 7.9 (a), has already proven

quite useful for early attempts of machine translation using deep sequence models. The

choice between basic RNN, LSTM, GRU, or stacked combinations of one of these types

of units is a modelling choice that one can make. No matter what type of unit is used, a

key weakness is that the impact of the code

on the output

ˆy

È1Í

, ˆy

È2Í

,...

, decreases as

grows within the predicted output. Nevertheless, this architecture is a starting point for

more advanced architectures.

A natural improvement is to make the context vector accessible for all steps in the decoder.

With such a setup, at each time the decoder is fed the concatenation of the previous output

and the code vector

. A second improvement is to also present the code vector

for

the computation of the output,

where at this point the output computation is based on

a concatenation of the cell state,

ÈtÍ

, and the code

. This architecture is depicted in

Figure 7.9 (b).

In the context of machine translation it is often useful to modify the encoder-decoder pair

such that the encoder accepts the text in reverse order. As an example, assume that the

input text is,

I am going to read another chapter.

Then with the text in reverse order paradigm, this input is fed to the encoder as,

chapter another read to going am I.

The training process then uses reverse order inputs as above, yet outputs are expected in the

normal order. Clearly when the model is used operationally, the input text is also reversed.

The beneﬁt of this approach is in keeping inputs and their respective outputs closer on

average. For example, assume that we are translating from English to French where the

output should be (the non-reversed Fre nch text),

je vais lire un autre chapitre.

Note that with this approach the (reversed) input phrase

going am I

is near the output

phrase

je vais

(which means “I am going”), and similarly with other pairs. Whereas if the

text was not reversed, then generally (assuming inputs and outputs of the sam e length) the

distance betwee n an input and the respective output is in the order of the length of the text.

Even when employing techniques such as reversal of the text, a key drawback of all of the

above encoder-decoder architectures is that performance degrades rapidly as the length of

the input sentence increases. This is because the encoded vector needs to capture the entire

input text, and in doing so, it might skip many important details. The attention mechanism

that we describe below, and its application in machine translation architectures, overcome

many of these diﬃculties.

Having the code available to the output is in a sense a residual connection similar to the ResNets

discussed in Section 6.5.Itisparticularlyusefulifastackedarchitectureisusedinthedecoder.

273

7 Sequence Models

The Att ention Mechanism

As we saw ab ove, one of the key considerations in encoder-decoder architectures has to do

with the way in which the encoder output enters as input to the decoder. Towards this,

we now introduce a general paradigm called the attention mechanism.Onemayviewthe

attention mechanism as a me thod for “annotating” elements of the input or intermediate

sequences, which require more focus, or attention, than others. We ﬁrst deﬁne attention

mathematically, and later we see how it can interplay within an encoder-decoder architecture.

The attention mechanism deﬁned here is also central to transformer models which encompass

Section 7.5 below, and are used for most contemporary large language models.

In general, an attention mechanism can be viewed as a transformation of a sequence of

vectors to a sequence of vectors. The input sequence has

vectors and the output sequence

has

out

vectors. We assume the vectors are

dimensional and denote the input sequence

via v

È1Í

È2Í

,...,v

ÈT

and the output sequence via u

È1Í

È2Í

,...,u

ÈT

out

As an aid to the attention mechanism, we also have two sequences of vectors, which we

call the proxy vectors. These are denoted via

È1Í

,...,z

ÈT

out

, and

È1Í

,...,z

ÈT

,where

the dimension of each vector in the ﬁrst sequence is

, and similarly

for the second

sequence. The notation using subscripts

and

stems from query and key re s pectively.

These terms, query and key, are more common in the application of transformer mo dels in

the next section.

One of the components of the attention mechanism is a score function, also known as an

alignment function,

◊R

æ R

which when applied to a pair of proxy vectors,

and

and denoted via

(

), measures the similarity between the two proxy vectors. A typical

simple score function, suitable when

, is the inner product. Yet other possibilities,

also potentially with learned parameters, can be employed, and in some instances this

component is known as an alignment model. It is typical to have normalization as part of

the score function with a factor such as



max(m

)

. This normalization maintains score

values at a reasonable range for numerical stability.

At the heart of the attention mechanism, for any time

,...,T

out

, we apply the score

function on a ﬁxed

ÈtÍ

against all

È1Í

,...,z

ÈT

and then using softmax we obtain a vector

of attention weights, also known as alignment scores. This vector, denoted

–

ÈtÍ

, is of length

, and is computed via

–

ÈtÍ

= S

softmax

s(z

ÈtÍ

È1Í

)

s(z

ÈtÍ

È2Í

)

s(z

ÈtÍ

ÈT

)

, for t =1,...,T

out

. (7.20)

Here the vector to vector softmax function,

softmax

(

) is as deﬁned in

(3.25)

. The attention

weights associated with time

, can also be written as

–

ÈtÍ

–

ÈtÍ

,...,–

ÈtÍ

. Thus in general,

for any

t œ {

,...,T

out

}

, the attention weight vector

–

ÈtÍ

captures similarity between the

proxy vector z

ÈtÍ

and all of the proxy vectors z

È1Í

,...,z

ÈT

With attention weights available, the attention mechanism operates on the input sequence

È1Í

,...,v

ÈT

. The mechanism produces an output sequence

È1Í

,...,u

ÈT

out

where each

274

7.4 Encoders Decoders and the Attention Mechanism

ÈtÍ

is computed via the linear combination,

]

[

ÈtÍ

·=1

–

ÈtÍ

È·Í

(Non-causal attention)

ÈtÍ

·=1

–

ÈtÍ

È·Í

. (Causal attention)

(7.21)

Note that in the causal form the output at time

ÈtÍ

, only depends on the inputs up to

time

, while in the non-causal form

ÈtÍ

depends on inputs at all times

t œ {

,...,T

}

Also note that the causal form is only possible when

out

Æ T

(this is the case in the next

section where in particular we use T

out

= T

As we see below, this general mechanism is applied in various sequence model architectures

where in each case , the proxy vector sequences

È1Í

,...,z

ÈT

out

, and

È1Í

,...,z

ÈT

, and the

score function can be deﬁned diﬀerently. Note that the attention mechanism can also be

employed for graph neural networks (GNN); see Section 8.5.

Encoder-Decoder with an Attention Mechanism

As a ﬁrst application of the attention mechanism let us see an encoder-decoder framework.

This architecture is described in Figure 7.10 where an attention mechanism is used to

tie the encoder output with the decoder. A key attribute of the attention mechanism is

to provide more importance to some of the input words, in comparison to others, during

machine translation. There are two main ideas in this architec ture. The ﬁrst idea is to use

a bi-directional encoder, and importantly the second idea is to incorporate an attention

mechanism.

275

7 Sequence Models

hti

= Attention[z

hti

= s

ht1i

=(v

h1i

,...,v

),v =(v

h1i

,...,v

)]

Unit

...

Unit

... ...

Encoder

Decoder

h1i

h2i

1i

<start>

ˆy

ht1i

ˆy

hti

<stop>

h1i

ht1i

hti

out

Figure 7.10:

An encoder-decoder model with a bi-directional encoder and tying of the encoder

and the decoder via an attention mechanism. The output sequence of the encoder is used as input

to the attention mechanism. In the attention mechanism the previous decoder state is used as a

query and the encoder outputs are used as keys.

The encoder is constructed via a bidirectional recurrent neural network, similar to Fig-

ure 7.7 (a) and the ﬁrst two recursions in

(7.14)

. Speciﬁcally for an input

È1Í

,...,x

ÈT

,we

obtain the sequences of encoder hidden states

È1Í

,...,h

ÈT

and

È1Í

,...,h

ÈT

,resulting

from the forward directional and reverse directional recursions, respectively. Note that LSTM

or GRU alternatives can be used in place of these forward and reverse recursions as well.

Now the output of the encoder is taken as a concatenation of the forward and reverse

direction. Speciﬁcally, we denote this encoder output via

È1Í

,...,v

ÈT

where

ÈtÍ

is a

concatenation of

ÈtÍ

and

ÈtÍ

. Note that the encoder output is a sequence of vectors of

length T

, namely the same length of the input sequence to the whole architecture.

The decoder is a variant of a recurrent neural network with hidden state

ÈtÍ

, following the

recursion,

ÈtÍ

= f

decoder

Èt≠1Í

, ˆy

Èt≠1Í

ÈtÍ

ˆy

ÈtÍ

= f

decoder-out

ÈtÍ

(7.22)

where

ÈtÍ

marks the input to the decoder and is computed via an attention m echanism as

we describe below. The other two inputs are

Èt≠1Í

, the hidden decoder state at the previous

time step, and ˆy

Èt≠1Í

, the output of the decoder at the previous time step.

276

7.4 Encoders Decoders and the Attention Mechanism

To see how the attention mechanism is used for computing

ÈtÍ

, observe our notation where

the encoder output is

ÈtÍ

and the decoder input at time

ÈtÍ

. This notation agrees with

the attention mechanism described above and in this architecture, an attention mechanism

tying the encoder and the decoder, converts

ÈtÍ

. Sp e ciﬁcally, non-causal attention as

(7.21)

is used. For the attention computation, the proxy vectors determining the attention

weights via (7.20) are set as z

ÈtÍ

= s

Èt≠1Í

and z

ÈtÍ

= v

ÈtÍ

Observe that when the next decoder output token,

ˆy

ÈtÍ

is created using

(7.22)

, it is based on

the decoder state

ÈtÍ

which is based on the previous decoder state, on the previous decoder

output, and importantly, on the attention output

ÈtÍ

. This attention output is a linear

combination of all of the previous decoder outputs, weighted by the attention weights,

–

ÈtÍ

of time t.

The strength of this attention base d architecture is that any input token can receive attention

during the construction of the output sequence, even if the construction is at a location in

the sequence far away from the input token. The application of a bi-directional architecture

for the encoder enables the model to capture e arlier and later information which help

to disambiguate the input embedded word. The application of an attention mechanism

introduces a form of a dynamic context vector between the encoder and decoder, in place of

the static context vector

used in the simpler architectures above. Namely, architectures

as depicted in Figure 7.9 have a ﬁxed

which does not change during the operation of the

decoder. The attention based approach replaces this ﬁxed

with the sequence

È1Í

È2Í

,...

which itself depends on the decoder state (through the proxy z

ÈtÍ

= s

Èt≠1Í

An Illustration of Attention Weights

Let us see parts of an illustrative toy example of English to French translation using the

encoder-decoder with attention architecture as in Figure 7.10. Assume the following:

Input: we love deep learning <stop>

Output: nous aimons l’ apprentissage en profondeur <stop>

Observe that in this case, we explicitly use the

<stop>

token, and assume that with our

tokenization and word embedding setup, each word or the

<stop>

token is a single vector.

Also note that

l’

is considered a word. In this case we have

=5and as resulting from

the model, T

out

=7.

When the input is processed via the architecture of Figure 7.10, ﬁrst the encoder creates

the sequences

È1Í

,...,h

ÈT

and

È1Í

,...,h

ÈT

. Then, a sequence where each element is a

concatenation of

ÈtÍ

and

ÈtÍ

is fed into the attention mechanism. The attention and the

decoder operation run together, where with each additional time step of the output,

(7.22)

applied, and the attention computation of

(7.21)

takes place. Importantly, with each

,the

attention weights

–

ÈtÍ

are computed via

(7.20)

where the proxy vector

ÈtÍ

is taken as the

previous decoder state, and the proxy sequence

È1Í

,...,z

ÈT

is taken as the concatenated

output of the enco der, summarizing all of the input text.

277

7 Sequence Models

0.93

0.44

0.23

0.01

0.02

0.01

⌧ =1

0.02

0.46

0.26

0.04

0.02

0.01

love

⌧ =2

0.01

0.04

0.23

0.45

0.39

0.67

0.01

deep

⌧ =3

0.02

0.05

0.25

0.47

0.55

0.28

0.01

learning

⌧ =4

0.02

nous

t=1

0.01

aimons

t=2

0.03

t=3

0.03

apprentissage

t=4

0.03

t=5

0.02

profondeur

t=6

0.96

<stop>

t=7

<stop>

⌧ =5

[↵

hti

⌧

Figure 7.11:

An attention matrix where each row is the attention vector

–

ÈtÍ

. When used in an

enco der-decoder, we may consider the attention in a row associated with time

as based on the

previous output. So for example for time

=4, the available output from the decoder so far is an

enco ding of

nous aimons l’

and the attention vector

–

È4Í

dictates which encoded English words

to focus on (in this case deep at –

È4Í

and learning at –

È4Í

Figure 7.11 illustrates possible values of the attention weights, as they are computed via the

machine translation process. The input is in English and the output is in French. Speciﬁcally,

each row of this matrix is an hypothetical vector of attention weights,

–

ÈtÍ

, computed at time

of the decoder; namely

,...,T

out

. Observe that e ach row sum is 1 and entries with

higher probabilities are emphasized. This sequence of attention weight vectors shows how

attention weights can adjust according to context. For example at time

=4in the decoder,

an encoding of the partial sequence

nous aimons l’

is already available via

È3Í

. For

creation of the next word,

apprentissage

(which directly means “learning” in English),

most of the attention is put on the encoder hidden states associated both with

deep

and

with learning.

Variants of the Score Function

If we use the inner product score function, then we are constrained that the dimension of

the decoder hidden state,

ÈtÍ

be twice the dimension of each of the directional encoder

hidden states (this is the dimension of

ÈtÍ

and

ÈtÍ

). However, as already stated, other score

functions are also possible. In each alternative, the score function operates on

œ R

and

œ R

. These are common alternatives,

s(z

]

[

€

, (General)

€

tanh(

]), (Concatenation)

€

tanh(W

+ W

). (Additive)

278

7.5 Transformers

Each of these score function alternatives has parameters that are learned during training.

In the ﬁrst case the parameter matrix is

œ R

◊m

. In the second case, the parameter

vector

˜m

dimensional, and the parameter m atrix is

œ R

˜m◊(m

)

. Note that in

this case [

] denotes a concatenation of the two vectors. In third case,

œ R

˜m◊m

and

œ R

˜m◊m

and again

is a

˜m

dimensional vec tor. In each of these cases

tanh

applied element wise.

Training Encoder-Decoder Models

Continuing with the application of m achine translation we now consider various approaches

for training encoder-decoder models. Assume we are training models as in the architectures

of Figure 7.9 (a) and (b) as well as Figure 7.10. Our discussion here is also relevant for

training transformer enco der-de coder models presented in the next section (Figure 7.16).

As input data we have

sequences in the source language (e.g., English), where each

sequence, denoted via

(i)

is already encoded into vectors

i,È1Í

,...,x

i,ÈT

, using some

word embedding technique (e.g., word2vec or some more advanced variant). Similarly we

have

sequences in the target language (e.g., French), where in this case, each sequence

(i)

is one-hot encoded into vectors y

i,È1Í

,...,y

i,ÈT

, according to the lexicon of the target

language. Clearly we ass ume that for any

, the pair of sequences

(i)

and

(i)

have the same

semantic meaning.

For a given set of parameters, when feeding an input sequence

(i)

into the encoder-decoder

architecture, we are presented with an output sequence

ˆy

(i)

where each element

ˆy

i,ÈtÍ

is a

probability vector in the lexicon of the target language. We then use cross-entropy loss,

comparing

ˆy

(i)

, summing over all elements of the sequence, similarly to

(7.4)

and

(7.5)

; see also the discussion around Figure 7.5 in Section 7.2. One may also use mini-batches

over multiple sequences, where for each mini-batch backpropagation (or backpropagation

through time) is applied, and then a variant of gradient descent is used, similarly to any

other deep learning model.

Teacher forcing, as discussed at the end of Section 7.2 is very commonly used when training

encoder-decoder models. For example in the encoder-decoder with an atttention architecture

of Figure 7.10, during training we replace

ˆy

Èt≠1Í

(7.22)

, and similarly for the

other encoder-decoder models.

Note also, that once an encoder-decoder model is trained, we may sometimes ﬁne tune either

the encoder, or decoder, by freezing the layers of one of the components while training the

other component. Also, it is common to freeze both the encoder and decoder, and only ﬁne

tune an output layer on top of the decoder.

7.5 Transformers

We now introduce a family of models called transformers. The transformer architecture

was originally introduced to handle machine translation and has since found applications

in many other paradigms including large language models, but also non sequence data

Aminortechnicalissueisthatoftenthelengthsof

ˆy

(i)

and

(i)

may diﬀer. In such a case, the shorter

sequence is padded with

<empty>

tokens and no loss is incurred at time

if both the predicted and training

sequences have an <empty> token at time t.

279

7 Sequence Models

domains such as images. The approach we present here continues to focus on the machine

translation application, yet the reader should keep in mind that transformers have much

wider applicability.

Transformers mark a paradigm shift in dealing with sequence data, as the architecture is no

longer of a recurrent nature, but rather works using parallel computation. That is, while

recurrent neural networks, LSTMs, GRUs, and other variants discussed in earlier sections

may seem natural for sequence data, with transformers we return to ﬁxed length input-output

schemes. Similarly to the feedforward networks of Chapter 5 or the convolutional networks

of Chapter 6, transformers operate on inputs of a ﬁxed length, and yield outputs of a ﬁxed

length. Nevertheless, note that transformer decoders are used in an auto-regressive manner

with a variable number of iterations.

In the context of sequence data, when using transformers, sequences are converted to have

ﬁxed length by padding with representations of

<empty>

tokens at the end of the sequence,

when needed. Similarly if the input or output exceed the dimensions, mechanisms external

to the transformer are used to raise an error, break up the computation, truncate the input,

or carry out similar workarounds.

As a simple illustration, return to the English to French translation example from the

previous section and as a toy example, assume that the transformer input and output

dimensions are both 9. In this case, we can expect the padded input and output to be,

Input: we love deep learning <stop> <empty> <empty> <empty> <empty>

Output: nous aimons l’ apprentissage en profondeur<stop> <empty> <empty>

While the abandonment of variable length inputs and outputs may seem like a step back from

recurrent neural networks, transformers have shown great b e neﬁts in performance. In addition

to yielding state of the performance on many language benchmarks, these architectures

enable parallel computation which is not p oss ible with recurrent neural networks.

The transformer architecture that we introduce here inherits the encoder-decoder pattern

used in the previous section, and is well suited for machine translation. Note however that

for other tasks, one may sometimes only use the encoder part of the transformer, the decoder

part, or slight variants of the architecture that we present here. The key mechanism used in

transformers is attention, with various forms of the attention mechanism interconnected in a

novel way.

In our overview of transformers, we ﬁrst describe the notion of self attention.Wethen

describe multi-head self attention, often called multi-head attention in short. We then

describe positional embeddings and then move onto introducing the transformer block which

is the basic building blo ck of the transformer architecture both for the encoder and the

decoder. We close the section with an outline of the transformer encoder-decoder architecture

followed by a discussion of how transformers are used in production and training.

Self Attention

We have already seen the general attention mechanism in equation

(7.21)

which transforms

a sequence

È1Í

,...,v

ÈT

to an output sequence

È1Í

,...,u

ÈT

out

, using linear combinations

with attention weights

–

ÈtÍ

. The attention weights are deﬁned in equation

(7.20)

and are

280

7.5 Transformers

based on the score function applied to the proxy vectors, denoted

È1Í

,...,z

ÈT

out

, and

È1Í

,...,z

ÈT

In the context of the transformer architecture, we refer to the proxy vectors

ÈtÍ

as queries,

we refer to the proxy vectors

ÈtÍ

as keys, and we refer to the input vectors

ÈtÍ

as values.

This terminology is rooted in information retrieval systems and captures the fact that when

we compute an attention vector

–

ÈtÍ

, we are “searching” via

(7.20)

for a query represented

via

ÈtÍ

against all keys

È1Í

,...,z

ÈT

. Then the attention weights are used to combine the

“search results” via (7.21).

The mechanism of self attention, illustrated in Figure 7.12, is a form of attention where we

convert an input sequence

È1Í

,...,x

ÈT

to an output

È1Í

,...,u

ÈT

out

,with

out

In the simplest form (ignore blue in Figure 7.12) we set the queries, the keys, and the values

directly as elements of the input. Namely,

ÈtÍ

= x

ÈtÍ

= x

ÈtÍ

= x

ÈtÍ

, (Simple self attention)

and this implies that in the causal form (ignore red in Figure 7.12), with score function

s(·, ·), the self attention mechanism yields output for any time t œ {1,...,T}, via,

ÈtÍ

·Æt

–

ÈtÍ

È·Í

, with –

ÈtÍ

s(x

ÈtÍ

È· Í

)

s(x

ÈtÍ

Èt

)

. (7.23)

281

7 Sequence Models

hti

↵

hti

h1i

⇥

↵

hti

h2i

⇥

...

↵

hti

⌧

h⌧ i

⇥

...

↵

hti

⇥

......

non-

casual

Attention

Weights:

Input:

Figure 7.12:

The ﬂow of information in self attention. Ignoring the blue and the red, this is causal

simple self attention where the output at time

ÈtÍ

is a linear combination of

È1Í

,...,x

ÈtÍ

with

attention weights

–

ÈtÍ

,...,–

ÈtÍ

. Considering the blue, this is more versatile self attention where each

attention weight

–

ÈtÍ

is computed using weighting matrices

and

of the input, and where the

linear combination is of weighted inputs with W

. Considering also the red, it is non-causal.

A more versatile form of self attention, this time involving learned parameters, has queries,

keys, and values that are not directly taken as the input, but are rather linear transformations

of the input. Namely,

ÈtÍ

= W

ÈtÍ

= W

ÈtÍ

= W

ÈtÍ

, (More versatile self attention)

where the learnable parameter matrices

, and

are each

p ◊p

dimensional,

with

the dimension of each

ÈtÍ

. Hence in this case, the attention mechanism has output for

any time t œ {1,...,T}, via,

ÈtÍ

·Æt

–

ÈtÍ

È·Í

, with –

ÈtÍ

s(W

ÈtÍ

È· Í

)

s(W

ÈtÍ

Èt

)

. (7.24)

Observe that that in

(7.23)

and

(7.24)

we use the causal form from

(7.21)

. An alternative

non-causal form is also applicable (consider also the red part of Figure 7.12). In such a

case, the summations are not on

· Æ t

but are rather on

· œ {

,...,T}

, similarly to the

non-causal form appearing in (7.21).

Note that here we use the same dimension for the input, the output, and the proxy vectors. More

generally one may set diﬀerent dimensions for these entities, as we do in the case of multi-head attention

below.

282

7.5 Transformers

Multi-Head Self At tention

A generalization of self attention is to use multiple self attention mechanisms in parallel and

then combine the outputs of these mechanisms. With this parallelism, we can treat each

individual attention mechanism as se arching for a diﬀerent set of features in the input, and

then have information content of the output as a combination of the derived features.

h1i

h2i

··· x

hti

··· x

hT i

h1i

h2i

··· u

hti

··· u

hT i

Head 1

Head H

1,h1i

··· u

1,hti

··· u

1,hT i

H,h1i

··· u

H,hti

··· u

H,hT i

........

Figure 7.13:

Multi-head self attention is the parallel application of

attention heads, where head

has parameter matrices

, and

. Each head

operates on the full input,

È1Í

,...,x

ÈT Í

and results in output for the head,

h,È1Í

,...,u

h,ÈT Í

. When determining the output of the whole

multi-head self attention mechanism, each output at time

, denoted

ÈtÍ

combines

1,ÈtÍ

,...,u

H,ÈtÍ

weighted by W

,...,W

Figure 7.13 illustrates multi-head self attention. Speciﬁcally, assume we have

self attention

mechanisms, where mechanism

h œ {

,...,H}

has its own set of parameter matrices

, and

. Here

œ R

m◊p

œ R

m◊p

, and

œ R

◊p

,where

is the dimension

of the query and the key (

is the dimension of the input

ÈtÍ

as previously,

and m

is the dimension of the value (and the output of the individual attention head).

At ﬁrst, each attention head

operates independently similarly to

(7.24)

, yielding an output

sequence u

h,È1Í

,...,u

h,ÈT Í

of m

dimensional vectors. This operation is via,

h,ÈtÍ

·=1

–

h,ÈtÍ

È·Í

, with –

h,ÈtÍ

s(W

ÈtÍ

È· Í

)

s(W

ÈtÍ

Èt

)

. (7.25)

Note that in

(7.25)

we use a non-causal form of attention, in contrast to the causal form

appearing in

(7.24)

. Below we comment on how one may practically convert such a non-causal

form to a causal form via a mechanism called masked self attention.

Now for any time index

,wehave

vectors

1,ÈtÍ

,...,u

H,ÈtÍ

which we use to produce

the output of the multi-head attention for the speciﬁc time index

. For this we apply an

additional linear transformation to each vector, converting it from dimension

back to

dimension

.Wethensumupover

,...,H

. Thus, the output of the multi-head self

283

7 Sequence Models

attention mechanism is

ÈtÍ

h=1

h,ÈtÍ

, (7.26)

where for each

, the matrix

p ◊ m

dimensional. Each

captures the transformation

from the single attention output

h,ÈtÍ

of dimension

to dimension

. The combination of

these H matrices can also capture weightings between the attention heads.

Multi-head self attention plays a central role in transformer blocks, both in the encoder

and the decoder. In certain cases such as the encoder, we use the non-causal form, as in

(7.25)

. However, in other cases, such as the decoder, we use a causal form. Practically we

may enforce a causal form via masked self attention, also known as masking for short. With

this approach, for a computation of attention weights at time

, we set entries of the (key)

proxy vectors of times after time t to negative inﬁnity. That is,

Èt+1Í

= ≠Œ,z

Èt+2Í

= ≠Œ,...,z

ÈT Í

= ≠Œ. (Masking) (7.27)

Then with this masking, through the softmax computation, the

≠Œ

values yield zeros for

each of

–

ÈtÍ

t+1

, –

ÈtÍ

t+2

,...,–

ÈtÍ

, and thus, when computing the output at time

, no attention is

given to future times.

Implementing causality via masking is especially important when one considers a matrix

representation of the multi-head self attention m echanism. Speciﬁcally, one may treat the

input as a matrix

of dimension

p ◊T

and then represent all of the attention op erations as

matrix on matrix operations. While we omit the details of such a representation, the reader

should keep in mind that unlike rec urrent neural networks where time

implies a step in

the computation, for transformers time

is simply a dimension of the input matrix

and

operations can be parallelized based on the equations above.

Positional Embeddings

Sequence models have a natural time index,

,where

ÈtÍ

followed by

Èt+1Í

, embodies some

relationship betwe en the two vectors in the sequence. As a simple example consider some

input text,

deep learning

, and the reverse text,

learning deep

. These two short sequences

have diﬀerent semantic meaning, since the order m atters. However, as is evident from the

multi-head self attention me chanism

(7.25)

, the order in the input sequence

È1Í

,...,x

ÈT Í

not captured by the mechanism at all. This stands in stark contrast to previous sequence

models such as recurrent neural networks and their generalizations, where the recurrent

nature of the model makes use of sequence order.

Hence, for using non-causal multi-head self attention eﬀectively, we require a mechanism for

encoding the order of the sequence in the input data. Such mechanisms are generally called

positional embeddings. A basic and primitive form of positional embedding is to e xtend each

input vector

ÈtÍ

with an additional one-hot encoded vector that captures its position in the

sequence. For example for a sequence of length

=4,weextend

È1Í

with

=(1

0),

extend

È2Í

with

=(0

0), and so forth. Then the input sequence is no longer taken

as having vectors of length p, but rather as having vectors of length ˜p = p + T .

To further illustrate the point using the toy one-hot encoding positional embedding example

with

=2and

=4, assume the ﬁrst vector is

È1Í

=(0

, ≠

3) and assume that as a

284

7.5 Transformers

matter of coincidence the last vector

È4Í

has the same values. Then after applying such

positional embedding, the vectors are transformed to

˜x

È1Í

=(0.2, ≠1.3, 1, 0, 0, 0

¸ ˚˙ ˝

) and ˜x

È4Í

=(0.2, ≠1.3, 0, 0, 0, 1

¸ ˚˙ ˝

and then when these positionally embedded vectors are processed by non-causal multi-head

self attention, the model can distinguish between

˜x

È1Í

and

˜x

È4Í

. Even in the (more common)

case where diﬀerent vectors will not have repeate d values, the order in the sequence is still

encoded and this enhances performance.

However, this type of encoding is clearly wasteful and ineﬃcient, yielding an excessively

large

˜p

. A more advanced form is to encode the vectors and the embeddings jointly. For

example, one might use a transformation such as

˜x

ÈtÍ

= W

ÈtÍ

+ W

+ b

œ R

˜p

, (7.28)

where

œ R

˜p◊˜p

œ R

˜p◊p

, and

œ R

˜p◊T

are learnable weight matrices,

b œ R

˜p

is a

learnable bias vector, and

(

) is some vector activation function such as for example ReLU

applied element wise. In this case, we may just use ˜p = p, yet larger ˜p are also possible.

With this type of encoding, after training the parameters, positional embeddings are ideally

encoded within the word vectors. This is similar to how vector representations encode words

from a dictionary into a lower dimensional space when using word embeddings.

With the introduction of transformers, a diﬀerent type of positional encoding, motivated

by Fourier analysis, was popularized. With this approach we set some

˜p>p

and denote

˜p ≠ p

. That is,

is the increase of dimension from the original encoding in

to the

new encoding in

˜p

. Assume also that

is even. Unlike the trained positional encoding in

(7.28), here we just use sines and cosines without any trainable parameters.

Speciﬁcally for i œ {0,...,

≠ 1}, and t œ {1,...,T} we set,

ÈtÍ

=sin

2i/p

ÈtÍ

2i+1

= cos

2i/p

, (7.29)

where a common value for M is 10, 000. The vector with positional embedding is then,

˜x

ÈtÍ

=(x

ÈtÍ

,...,x

ÈtÍ

sin

˙˝¸˚

ÈtÍ

cos

˙˝¸˚

ÈtÍ

sin

˙˝¸˚

ÈtÍ

,...,

cos

˙˝¸˚

ÈtÍ

≠1

¸ ˚˙ ˝

positional embedding

285

7 Sequence Models

(a) (b)

Figure 7.14:

The sine and cosine based positional embedding as in

(7.29)

. (a) A heat map of

ÈtÍ

with

=768and

=4000plotted for

only over the ﬁrst 60 time indexes for

only over the ﬁrst

200 positions in the embedding vector. Positive values are yellow and negative values are blue. (b)

A comparison of the positional embeddings at the time index

=50and

=51via a plot of the

diﬀerence. As is evident, there is a signiﬁcant diﬀerence marked by around the ﬁrst hundred vector

positions.

One may then also reduce the dimension back from

˜p

to a lowe r dimension in similar vein to

(7.28). Speciﬁcally one common simplistic approach which has empirically worked well is,

˜x

ÈtÍ

=(x

ÈtÍ

+ r

ÈtÍ

È2Í

+ r

ÈtÍ

,...,x

ÈtÍ

+ r

ÈtÍ

p≠1

) œ R

, (7.30)

where here p

= p.

The beneﬁt of using positional embedding such as

(7.30)

ﬁrst arises from the fact that

sines and cosines are bounded within [

≠

1] and is further aided by the idea of a Fourier

representation of the position. In particular, consider Figure 7.14 where (a) illustrates

embedding values from

(7.29)

. By adding a vertical slice (for ﬁxed

) of the embedding values

(7.29)

to the original

ÈtÍ

vector, we enable the model to distinguish the time value from

typical other values. This is particularly evident by considering (b) where we compare two

neighbouring time steps by plotting their diﬀerence.

The Transformer Block

The basic building block of the transformer architecture is a unit called a transformer block

which is used multiple times within a transformer, interconnected in series, both in the

encoder and the deco der. There are several variations of transformer blocks and here we

focus on the basic block use d in the encoder. Later when we describe the decoder architecture

we highlight diﬀerences.

286

7.5 Transformers

h1i

out

--- a

hT i

out

h1i

[3]

--- u

hT i

[3]

Feed Forward

Network

h1i

[2]

--- u

hT i

[2]

h1i

[1]

--- u

hT i

[1]

Multi-Head

Self Attention

h1i

--- a

hT i

Layer Norm

(a)

h1i

out

---- a

hT i

out

h1i

[5]

hT i

[5]

Feed Forward

Network

h1i

[4]

---- u

hT i

[4]

h1i

[3]

---- u

hT i

[3]

Multi-Head

Cross Attention

h1i

[2]

---- u

hT i

[2]

h1i

[1]

hT i

[1]

Multi-Head

Self Attention

h1i

---- a

hT i

Layer Norm

(b)

Figure 7.15:

Architecture of transformer blocks. (a) A single transformer block. The input

passes through a multi-head self attention layer, followed by a feedforward layer. Layer normalization

and residual connections are applied in these steps as well. (b) A transformer decoder block. In

addition to the components of a transformer block, it also has a multi-head cross attention layer

that is fed the context vector z

We denote the input of a transformer block as

and the output as

out

where both

and

out

are

p ◊ T

matrices. Thus in general, we can view the block as a function

◊

p◊T

æ R

p◊T

,where

◊

represents trained parameters of the block and

out

◊

(

For example, the encoder of the transformer architecture has a ﬁrst transformer block that

operates on input,

˜x

È1Í

,...,˜x

ÈT Í

287

7 Sequence Models

where each column is a positional encoding vector as in

(7.30)

. Then the output of this

block,

out

, is fed into a second transformer block in the encoder, and so forth. Common

architectures have an encoder composed of a sequence of half a dozen or more transformer

blocks. Thus for example the input of the second transformer block has

set as the

out

of the ﬁrst blo ck, etc.

A transformer block has several internal layers with the two main layers being a multi-head

self attention layer, and downstream to it, a feedforward layer. Each of these also utilizes a

residual connection and a normalization layer. A schematic of a typical transformer block

is in Figure 7.15 (a). The main idea of the transformer block is to enhance multi-head self

attention with further connections using the feedforward layer. The residual connections

enhance training performance, similar to ResNets discussed in Section 6.5. Normalization, in

the form of layer normalization, stabilizes training and production performance by ensuring

that values remain in a sensible dynamic range. We now present the details of a block.

Let us denote the columns of the input

È1Í

,...,a

ÈT Í

. The ﬁrst step of the block with

multi-head self attention is to employ

(7.25)

for every head

,...,H

, and then

(7.26)

where in these equations

ÈtÍ

is replaced by

ÈtÍ

. The output of this multi-head self attention

layer is then denoted

È1Í

[1]

,...,u

ÈT Í

[1]

,whereweusethesubscript[1] to indicate it is an output

of the ﬁrst layer.

We then apply residual connections and layer normalization to e ach of

È1Í

[1]

,...,u

ÈT Í

[1]

yielding

È1Í

[2]

,...,u

ÈT Í

[2]

. This step can be summarized as.

ÈtÍ

[2]

= LayerNorm

ÈtÍ

+ u

ÈtÍ

[1]

¸ ˚˙ ˝

Residual connection

; “, —

. (7.31)

Here, the LayerNorm(·) operator is deﬁned for z œ R

with parameters “, — œ R

, via,

LayerNorm(z ; “, —)=“ §

(z ≠ µ

)



‡

+ Á

+ —,

where

i=1

, and ‡

i=1

≠ µ

)

Á >

0 is a small ﬁxed quantity that ensures that we do not divide by zero, and the addition,

division, and square root operations are all element-wise operations.

Layer normalization is somewhat similar to batch normalization, outlined in Section 5.6, and

group normalization outlined in Section 6.4. A major diﬀerence between layer normalization

and batch normalization is that for batch normalization we obtain the m ean and standard

deviation per feature over a mini batch, whereas with layer normalization we use a single

sample, yet compute statistics over all features (of a single feature vector). In both cases,

the normalization forces feature values to remain at normalized values, further aided by the

learnable parameter vectors “ and —.

The next step in the transformer block is the application of a fully connected neural network

on each

ÈtÍ

[2]

to yield

ÈtÍ

[3]

. Note that the same learnable network parameters are used for each

288

7.5 Transformers

t œ {

,...,T}

. Commonly this network has a single hidden layer with non-linear activation,

followed by a layer with linear (identity) activation,

yet there are other possibilities as

well. Sticking with the commonly used architecture we have,

ÈtÍ

[3]

= W

[2]

S(W

[1]

ÈtÍ

[2]

+ b

[1]

+ b

[2]

where

(

) is commonly an element wise application of ReLU. Here we denote the dimension

of the inner layer as

and the learnable parameters are

[1]

œ R

[2]

œ R

[1]

œ R

◊p

and W

[2]

œ R

p◊N

Finally, to yield the output of the transformer block, we apply residual connections and

layer normalization in the same manner as (7.31). Speciﬁcally we use,

ÈtÍ

out

= LayerNorm

ÈtÍ

[2]

+ u

ÈtÍ

[3]

;˜“,

—

where here ˜“,

— œ R

are trainable parameters for this layer normalization.

Note that when considered as a variation of a neural network, one may view a transformer

block as “wide and shallow”. Even though such a single transformer block is not deep, the

residual connections provide direct access to the previous levels of abstraction and enable

the levels above to infer more ﬁne grained features without having to remember or store

previous ones.

Let us summarize the learned parameters of a single transformer block with dimensions

for the ve ctor length,

for the sequence length,

for the number of self attention heads,

for the dimension inside e ach self attention block, and

for the dimension of the query

and key inside each self attention block. In this case, the total number of parameters is,

4 p

¸˚˙˝

“,—,˜“,

—

+2N

¸˚˙˝

[1]

[2]

+ N

+ p

¸ ˚˙ ˝

[1]

[2]

+ H ◊ (2mp

¸˚˙˝

and W

+2m

¸˚˙˝

and W

). (7.32)

As a quantitative example, agreeing with the ﬁrst transformer architecture introduced in

2017,

let us consider a case with

= 512,

= 2048,

= 64, and

=8.In

this case there are just ove r 3 million learnable parameters for such a transformer block.

Speciﬁcally

(7.32)

evaluates to 3

150

336. As we see now, multiple transformer blocks are

typically connected, yielding architectures with many millions of parameters.

Putting the Bits Together Into an Encoder-Decoder Framework

Now that we have seen the design of the transformer block, we are ready to interconnect

such blocks in an encoder, and also interconnect variations of this block in a decoder.

We now describe a basic transformer encoder-decoder architecture using such blocks and

interconnections. It is useful to recall the more classical encoder-decoder architectures as

appearing in (a) and (b) of Figure 7.9. A transformer architecture is somewhat s imilar to

the architecture in (b), since the output of the encoder is fed into each of the decoder steps.

Incidentally, this two-layer architecture with an activation function only in the single hidden layer, is

the same network used in Theorem 5.1 of Chapter 5.

See the “Attention is all you need” paper [

410

], as well as other notes and references at the end of the

chapter.

289

7 Sequence Models

However, unlike the architectures in Figure 7.9, transformers do not proc es s one word at a

time. A schematic of a transformer encoder-decoder architecture is in Figure 7.16.

h1i

out

hT i

out

ˆy

h1i

out

ˆy

hT i

out

Transformer

Block

Transformer

Decoder Block

Transformer

Block

Transformer

Decoder Block

Transformer

Block

Transformer

Decoder Block

˜x

h1i

˜x

hT i

˜y

h1i

out

˜y

hT i

out

h1i

hT i

h1i

out

hT i

out

Positional Embedding Positional Embedding

Encoder Decoder

Figure 7.16:

An encoder-decoder transformer architecture. The encoder is composed of multiple

transformer blocks, and the decoder is composed of multiple transformer decoder blo cks. Each block

in the decoder is fed the code

. The loop from the output of the decoder going back into the input

of the decoder illustrates the auto-regressive application of transformer decoders.

Transformer encoders simply stack the transformer blocks in series, where each block has

exactly the sp e ciﬁcations described above as in Figure 7.15 (a). The ﬁrst block is fed with

the positional encoded input, and the output of that block goes into the second block, and so

forth. The output of the encoder is

È1Í

out

,...,a

ÈT Í

out

resulting from the last transformer block.

We also denote this output via

as it describes a code, and thus for each position

denote the encoder output via z

ıÈtÍ

Transformer decoders use a variation of the transformer block which we call the trans-

former decoder block, illustrated in Figure 7.15 (b). This block architecture diﬀers from t he

transformer block speciﬁed above in two ways. First, the multi-head self attention layer is

causal. This is implemented via masking as in

(7.27)

. Such causality prevents attendance

of future positions as suitable for auto-regressive prediction. Second, an additional layer,

called a cross attention layer, is used between the causal multi-head self attention and the

290

7.5 Transformers

feedforward layer. The new cross attention layer is handled with layer normalization and

residual connections, similar to the other two layers.

In addition to the ﬂow of information within the transformer decoder block, the cross

attention layer is fed with the encoder output

. This is similar to the encoder-decoder

architecture in Figure 7.9 (b), and it allows the transformer decoder block to directly

incorporate the encoder’s code. Transformer decoders are constructed by stacking several

transformer decoder blocks in sequence, similarly to the stacking in the encoder, where

each block gets the same

. Similar to the encoder, the ﬁrst block operates on positional

embedded inputs. Further discussion of what these inputs are, is in the following subsection.

On top of the ﬁnal transformer deco der block (not illustrated in Figure 7.16), we add an

additional layer transforming each output vector to a token. This is similar to the outputs

of other encoder-decoder architectures. Often it is simply a linear layer with a softmax (i.e.,

a multinomial regression as in Section 3.3).

The cross attention layer inside each transformer decoder block is in fact a multi-head cross

attention layer and follows equations similar to

(7.25)

and

(7.26)

,withthediﬀerence b eing

that the key and value inputs are the decoder output

. Speciﬁcally, if we denote the input

to the multi-head cross attention layer from earlier layers as

˜u

È1Í

,...,˜u

ÈT Í

,thentheself

attention equation

(7.25)

is now modiﬁed to have cross attention (between

and the input

to the layer ˜u). This attention computation for head h is then,

h,ÈtÍ

·=1

–

h,ÈtÍ

ıÈ·Í

, with –

h,ÈtÍ

s(W

˜u

ÈtÍ

ıÈ· Í

)

s(W

˜u

ÈtÍ

ıÈt

)

, (7.33)

followed by a combination of the heads using (7.26).

Observe that this pattern of cross attention is similar to the earlier encoder-decoder with

an attention mechanism architecture presented in Figure 7.10. In the earlier architecture,

the proxy vector

ÈtÍ

used the previous decoder state, somewhat similarly

˜u

ÈtÍ

(7.33)

Further, in the earlier architecture, the proxy vector

ÈtÍ

is the same vector used as input to

the attention mechanism. This agrees with using the decoder output,

as key and value

inputs in (7.33).

Now after highlighting the diﬀerences between a transformer block as in Figure 7.15 (a),

and the transformer decoder block in Figure 7.15 (b), we can brieﬂy summarize the layers

and steps of the transformer decoder block. T he matrix of inputs to each block, with each

input vector denoted

ÈtÍ

, is ﬁrst processed with causal multi-head self attention to yield

È1Í

[1]

,...,u

ÈT Í

[1]

. Now exactly as in

(7.31)

, layer normalization and residual connections yield

È1Í

[2]

,...,u

ÈT Í

[2]

. Then this sequence of vectors (or matrix) is processed via the multi-head cross

attention layer using

(7.33)

and

(7.26)

,where

˜u

ÈtÍ

[2]

, and the encoder output

is put

to use. The result is

È1Í

[3]

,...,u

ÈT Í

[3]

. Then layer normalization and residual connections are

applied again yielding

È1Í

[4]

,...,u

ÈT Í

[4]

. This sequence is now fed into the feedforward layer, to

yield

È1Í

[5]

,...,u

ÈT Í

[5]

. Finally, each

ÈtÍ

[5]

is again applied with layer normalization and residual

connections to yield the output

ÈtÍ

out

. There are 6 steps here, in comparison to the 4 steps

used in the transformer blo ck of the encoder.

We count layer normalization and residual connection as a single step.

291

7 Sequence Models

The parameters of each transformer decoder block include those of the transformer block

from Figure 7.15, as well as parameters resulting from the multi-head cross attention layer

and its normalization. Speciﬁcally, for the decoder, the parameter count in

(7.32)

needs to

be augmented with an additional

H ◊

) term for the multi-head cross attention

as well as 2

for the additional normalization parameters. As an example, using the same

dimensions as above, we now have over 4 million parameters, or 4

199

936 exactly, for a

transformer decoder block.

If we now consider an e ncoder-decoder transformer architecture with 6 transformer blocks in

the encoder and 6 transformer decoder blocks in the dec oder, then the number of parameters

in the encoder is about 19 million, and the number of parameters in the decoder is about

25 million. As mentioned above, we also require an additional layer on top of the decoder,

transforming each output vector to a token (for example to a natural language word or part

of it). This is simply a linear layer with a softmax (i.e., a multinomial regression). This

type of layer is needed at the end of any pipeline that generates text. If we assume that

the number of word tokens

000, then the number of parameters of the ﬁnal

multinomial regression is

◊p

, which is about 19 million parameters in our case. Hence

putting the pieces together we have about 63 million parameters in the whole model.

The encoder-decoder transformer model is the cornerstone of large language models, and

indeed by 2023, models with up to half a trillion parameters are already in use. Such

parameter count is about 100 times more parameters than the size of the transformer

discussed above.

Using the Encoder-Decoder in Production and Training

In our description of the encoder-decoder architecture above, except for indicating that

positional embedding is applied, we did not specify the decoder inputs. The way that decoder

inputs are used depends on the task at hand, and the form of the inputs varies between

production and training. We now present the details.

First in production (inference or test-time), note that we use the decoder in an auto-regressive

manner as in Figure 7.17. Speciﬁcally, the code from the encoder

is presented to the

decoder and we iterate executions of the decoder until a

<stop>

token (or word) is realized.

In the ﬁrst iteration we set the input sequence to the decoder to only have a

<start>

token

embedding, and then with every iteration we present the output sequence up to the previous

iteration as input to the decoder. That is, at iteration

, the decoder computation can be

represented as

ˆy

ÈtÍ

= f

decoder

, (

ˆy

È1Í

,...,

ˆy

Èt≠1Í

)

, (7.34)

where

is the co de vector from the encoder, and

ˆy

ÈtÍ

is an embedded and positionally

embedded vector, resulting from the decoder output token

‚

ÈtÍ

. The transformation from the

decoder output

ˆy

ÈtÍ

to the decoder output token

‚

ÈtÍ

, can naively be done via an argmax as

(3.34)

in Chapter 3, or by sampling tokens according to the probability output

ˆy

ÈtÍ

. More

advanced multi-step techniques such as beam search can also be employed, where several

consecutive tokens are considered together; we omit the details. In summary, as we see in

(7.34), the transformer decoder output at time t is a function of its previous outputs while

the ﬁrst input

ˆy

È1Í

is an embedded and positionally embedded representation of <start>.

This is a common tokenization dimension and was used in the original transformers paper.

The ﬁrst introduced transformer architecture, [

410

], is estimated to have used about 65 million

parameters. Such relatively small discrepancies are due to implementation.

292

7.5 Transformers

Transformer

Encoder

We love deep learning <stop>

Transformer

Decoder

<start>

<start> nous

Transformer

Decoder

<start> nous

<start> nous aimons

Figure 7.17:

Auto-regressive application of a transformer decoder block for machine translation.

The transformer encoder runs once on the whole input English sentence, creating the code

. This

co de is then used with every iteration of the transformer decoder block, where with each iteration an

additional word (or token) is created, and the previous output is fed as input in an auto-regressive

manner. Generation stops (not illustrated in ﬁgure) when a <stop> token appears at the output.

Now considering training, a natural naive approach is to use

(7.34)

directly in each forward

pass and backward pass iteration, when computing gradients. Namely, to use backward

propagation through time. However,

(7.34)

naturally lends itself to use teacher forcing,

similarly to the use of teacher forcing when training other encoder-decoder models, as

discussed at the end of Section 7.4. With this approach, (7.34) is converted to

ˆy

ÈtÍ

= f

decoder

, (˜y

È1Í

,...,˜y

Èt≠1Í

)

where now the training data one-hot encoded lab e ls

È1Í

,...,y

ÈtÍ

, in their embedded and

positionally embedded form,

˜y

È1Í

,...,˜y

ÈtÍ

, are used as input to the transformer decoder

instead of the predictions. This teacher forcing technique accelerates training by removing

error during early phases of the process. Further, an important diﬀerence in teacher forcing

of transformers vs. teacher forcing of recurrent encoder-decoder frameworks, is that with

transformers we can exploit parallelization. Speciﬁcally, for the forward pass, we may compute

each of these in parallel:

]

[

ˆy

È2Í

= f

decoder

, (˜y

È1Í

)

ˆy

È3Í

= f

decoder

, (˜y

È1Í

, ˜y

È2Í

)

ˆy

È4Í

= f

decoder

, (˜y

È1Í

, ˜y

È2Í

, ˜y

È3Í

)

ˆy

ÈT Í

= f

decoder

, (˜y

È1Í

, ˜y

È2Í

,...,˜y

ÈT ≠1Í

)

Note that while transformers were introduced for machine translation and are currently

the power house of large language models for generative text mo deling, adaptations of

transformers have also been successful for other non-language tasks, including image tasks. In

fact, transformer models compete with convolutional models, and in certain cases outperform

convolutional models on images, especially in the presence of huge training datasets.

293

7 Sequence Models

Notes and References

A useful applied introductory text about time-series sequence data analysis is [

198

] and a more

theoretical book is [

]. Yet, while these are texts about sequence models, the traditional statistical

and forecasting time-series focus is on cases where each

ÈtÍ

is a scalar or a low dimensional vector.

Neural network models, the topic of this chapter, are very diﬀerent, and for an early review of

recurrent neural networks and generalizations see chapter 10 of [

142

] and the many references

there-in, where key references are also listed below.

As the most common application of sequence models is textual data, let us mention early texts on

natural language processing (NLP). General early approaches to NLP are summarized in [

280

] and

[

216

] where the topic is tackled via rule-based approaches based on the statistics of grammar. A

much more mode rn summary of applications is [

228

] and a review of applications of neural networks

for NLP is in [

138

], yet this ﬁeld is quickly advancing at the time of publishing of this current book.