Mathematical Engineering

of Deep Learning

Chapter 4

Benoit Liquet, Sarat Moka and Yoni Nazarathy

February 28, 2024

Contents

Preface 3

1 Introduction 1

1.1 The Age of Deep Learning ............................ 1

1.2 A Taste of Tasks and Architectures ....................... 7

1.3 Key Ingredients of Deep Learning ........................ 12

1.4 DATA, Data, data! ................................ 17

1.5 Deep Learning as a Mathematical Engineering Discipline ........... 20

1.6 Notation and Mathematical Background .................... 23

Notes and References .................................. 25

2 Principles of Machine Learning 27

2.1 Key Activities of Machine Learning ....................... 27

2.2 Supervised Learning ............................... 32

2.3 Linear Models at Our Core ........................... 39

2.4 Iterative Optimization Based Learning ..................... 48

2.5 Generalization, Regularization, and Validation ................ 52

2.6 A Taste of Unsupervised Learning ....................... 62

Notes and References .................................. 72

3 Simple Neural Networ ks 75

3.1 Logistic Regression in Statistics ......................... 75

3.2 Logistic Regression as a Shallow Neural Network ............... 82

3.3 Multi-class Problems with Softmax ....................... 86

3.4 Beyond Linear Decision Boundaries ....................... 95

3.5 Shallow Autoencoders .............................. 99

Notes and References .................................. 111

4 Optimization Algorithms 113

4.1 Formulation of Optimization .......................... 113

4.2 Optimization in the Context of Deep Learning ................ 120

4.3 Adaptive Optimization with ADAM ...................... 128

4.4 Automatic Diﬀerentiation ............................ 135

4.5 Additional Techniques for First-Order Methods ................ 143

4.6 Concepts of Second-Order Methods ....................... 152

Notes and References .................................. 164

5 Feedforward Deep Networks 167

5.1 The General Fully Connected Architecture ................... 167

5.2 The Expressive Power of Neural Networks ................... 173

5.3 Activation Function Alternatives ........................ 180

5.4 The Backpropagation Algorithm ........................ 184

5.5 Weight Initialization ............................... 192

Contents

5.6 Batch Normalization ............................... 194

5.7 Mitigating Overﬁtting with Dropout and Regularization ........... 197

Notes and References .................................. 203

6 Convolutional Neural Networks 205

6.1 Overview of Convolutional Neural Networks .................. 205

6.2 The Convolution Operation ........................... 209

6.3 Building a Convolutional Layer ......................... 216

6.4 Building a Convolutional Neural Network ................... 226

6.5 Inception, ResNets, and Other Landmark Architectures ........... 236

6.6 Beyond Classiﬁcation ............................... 240

Notes and References .................................. 247

7 Sequence Models 249

7.1 Overview of Models and Activities for Sequence Data ............. 249

7.2 Basic Recurrent Neural Networks ........................ 255

7.3 Generalizations and Modiﬁcations to RNNs .................. 265

7.4 Encoders Decoders and the Attention Mechanism ............... 271

7.5 Transformers ................................... 279

Notes and References .................................. 294

8 Specialized Architectures and Paradigms 297

8.1 Generative Modelling Principles ......................... 297

8.2 Diﬀusion Models ................................. 306

8.3 Generative Adversarial Networks ........................ 315

8.4 Reinforcement Learning ............................. 328

8.5 Graph Neural Networks ............................. 338

Notes and References .................................. 353

Epilogue 355

A Some Multivariable Calculus 357

A.1 Vectors and Functions in R

........................... 357

A.2 Derivatives .................................... 359

A.3 The Multivariable Chain Rule .......................... 362

A.4 Taylor’s Theorem ................................. 364

B Cross Entropy and Other Expectations with Logarithms 367

B.1 Divergences and Entropies ............................ 367

B.2 Computations for Multivariate Normal Distributions ............. 369

Bibliography 399

Index 401

4 Optimization Algorithms

The ﬁeld of optimization is an important sub-ﬁeld of applied mathematics. In the context

of deep learning it is just as important. Any form of training process for a deep learning

model requires optimization to minimize the loss. The decision variables are the learned

model parameters, and the data is typically considered ﬁxed from the point of view of

optimization. In this chapter we explore general optimization methods and results, focusing

on the essential tools for optimization in the process of training deep learning models.

In Section 4.1 we formulate optimization problems in a general setup, discuss their forms,

and review local extrema, global extrema, and convexity. In Se ction 4.2 we specialize to

optimization of learned parameters for deep learning. We discuss the nature of such problems,

and common techniques including stochastic gradient descent, tracking performance measures,

and early stopping. In Section 4.3 we discuss various forms of ﬁrst-order methods which

extend the basic gradient descent algorithm. The focus is on the popular ADAM optimizer

which has grown to become very popular for deep learning. In Section 4.4 we introduce

automatic diﬀerentiation, a computational paradigm for eﬃcient evaluation of derivatives.

Here we present both forward and backward mode automatic diﬀerentiation in a general

context, and later in the next chapter we specialize to deep learning. We continue with

Section 4.5 where additional ﬁrst-order methods are presented be yond gradient descent

and ADAM. In Section 4.6 we introduce and present an overview of second-order methods.

These powerful methods are sometimes used in deep learning, and when applicable they can

perform very well. Note that on a ﬁrst reading of the book, one may focus on Sections 4.1,

4.2, 4.3, and 4.4 b e fore continuing to understand the deep learning models of Chapter 5.

4.1 Formulation of Optimization

Optimization algorithms, such as gradient descent described in Algorithm 2.1 of Chapter 2,

primarily aim to minimize an objective . We have already seen in previous chapters that

“learning”

“optimization”. Hence to do well in deep learning, it is important to understand

the implementation of optimization algorithms as they play a crucial role in minimizing the

training loss.

The General Setup

Consider a multivariate real-valued function

æ R

. An optimization problem can be

written as

minimize

◊

C(◊)

sub ject to ◊ œ ,

(4.1)

113

4 Optimization Algorithms

where 

™ R

is called the feasible set or the constraint set. In the context of learning, 

represents the parameter space and the objective function C(◊) is the loss function.

In other words, the aim of the optimization problem is to ﬁnd a point

◊ œ

 where

(

◊

) is

minimum. Such a point, denoted

◊

, is called a solution or minimizer of the optimization

problem (4.1). That is,

C(◊

) Æ C(◊), for all ◊ œ . (4.2)

There can be multiple solutions to this problem and the set of all the solutions

◊

is usually

denoted by

argmin

◊œ

C(◊).

The formulation

(4.1)

is general, in the sense that any optimization problem can be written

in this form. Particularly, the maximization problem

maximize

◊

(

◊

) subject to

◊ œ



can be stated in the form

(4.1)

by replacing

(

◊

) in

(4.1)

with

≠C

(

◊

). We say that the

optimization problem is unconstrained if  =

; otherwise, it is a constrained optimization

problem.

Our focus is primarily on unconstrained optimization and hence the “subject to” component

(4.1)

may be omitted. This is the c ommon case in the context of deep learning, and

in particular it is the case for the optimization problems that we already encountered

in the context of linear regression (Section 2.3), logistic regression (sec tions 3.1 and 3.2),

multinomial regression (Section 3.3), and most cases of autoencoders (Section 3.5).

General optimization theory and practice comes in many shapes and forms where diﬀerent

methods can be used for diﬀerent sub-types of the general problem

(4.1)

. In the context

of deep learning we typically make use of multivariate calculus

and assume (or construct)

the ob jective

(

) to have desirable properties, such as continuity, diﬀerentiability (almost

everywhere), and other smoothness properties. With this type of optimization, the gradient

ÒC

and Hessian

are typically used. The former is the key component of ﬁrst-order

methods as desc ribed in sections 4.3 and 4.5. The latter sometimes plays a role in second-

order methods as described in Section 4.6. Throughout this chapter when we make use of

the gradient or Hessian then we are implicitly assuming that these objects exist and are well

deﬁned for the problem at hand.

Local and Global Minima

A solution of an unconstrained minimization problem is also known as a global minimum

because of

(4.2)

. However, there can be other points that are not solutions to the optimization

problem but exhibit similar properties within their vicinity. Such points are known as local

minima. In particular, a point

◊ œ R

is called a local minimum of a continuous function

if there is an

0 such that

(

◊

)

Æ C

(

„

) for all

„ œ B

(

◊,r

),where

(

◊,r

) is an open ball

with center ◊ and radius r>0 deﬁne d by

B(◊,r)={„ œ R

: Î◊ ≠ „Î <r}.

Refer to Appendix A for a review.

114

4.1 Formulation of Optimization

In other words, if

◊

is a local minimum of

, we cannot decrease the value of

by moving

an inﬁnitesimal distance from

◊

in any direction; see Figure 4.1 for illustrations of global

and local minima.

Supp ose that the ob jec tive function

is diﬀerentiable at a local minimum

◊

.Then,itis

well known that the gradient

ÒC(◊)=0

. This is a necessary condition for a point to be a

local minimum, but not suﬃcient. For example, the condition

ÒC

(

◊

)=0also holds if

◊

is a

local maximum of

, that is,

◊

is a local minimum of

≠C

. Further the condition

ÒC(◊)=0

can hold even at a point

◊

that is neither a local minimum nor a local maximum, and such

a point is called a saddle point where the objective function slopes up in one direction and

slopes down in another direction (for smo oth functions we deﬁne it below using the Hessian

matrix). See Figure 4.1 for examples.

(a)











-1

(b)

Figure 4.1:

Two dimensional smooth functions. (a) The function

(

◊

is convex with

a unique global minimum. (b) A well-known function called the peaks function is an example

non-convex function with several local minima, local maxima, and saddle points. Around each local

minimum, the function is locally convex. On the ﬁgure, two saddle points are marked with red dots.

Convexity and Saddle Points

There are classes of optimization problems for which a local minimum is also guaranteed to

be a global minimum. One such popular class is the family of convex problems. While most

deep learning based optimization problems are not convex, understanding convexity helps to

better understand optimization concepts.

We can establish stronger necessary conditions for a point to be a local minimum via

convexity.Aset

™ R

is said to be a convex set if for any ◊, „ œ 

and ⁄ œ [0, 1],

⁄◊ +(1≠⁄)„ œ 

That is, for a convex set, the line segment connecting any two points in the set, remains in

the set. An example of a convex set is an open ball B(◊,r) for any ◊ œ R

and r>0.

We are now ready to deﬁne convex functions. Suppose that 

™ R

is a convex set. Then,

a function

æ R

is called a convex function on 

if for every

◊, „ œ



and

⁄ œ

1],

115

4 Optimization Algorithms

we have

⁄◊ +(1≠⁄)„

Æ ⁄C(◊)+(1≠ ⁄)C(„). (4.3)

Furthermore, we say that the function

is strictly convex on 

if the inequality in

(4.3)

holds with strict inequality for all ⁄ œ (0, 1) and for every choice of ◊, „ œ 

with ◊ ”= „.

If the function

is convex on the entire space

, then all the local minima are global

minima. Furthermore, if

is strictly convex on

, then there is a unique global minimum.

See Figure 4.1 (a) for an example of a convex function.

If we further assume that

is twice diﬀerentiable, then we can deﬁne convexity and strict

convexity in terms of the Hessian matrix. In particular,

is convex on 

if and only if the

Hessian

(

◊

) is positive semideﬁnite

at every inte rior point on 

. Furthermore,

strictly convex on 

if and only if the Hessian

(

◊

) is positive deﬁnite at every interior

point on 

Note that if a point

◊

is a local minimum of a twice diﬀerentiable function

,thenthere

exists some

0 such that

is convex on the open ball

(

◊,r

), and hence we sometimes

say that

is locally convex around

◊

. See Figure 4.1 (b) for an example of a function with

several local minima with local convexity around each minimum point.

For twice diﬀerentiable functions, it is easy to deﬁne saddle points. A point

◊

with

ÒC

(

◊

)=0

is called a saddle point if the Hessian

(

◊

) at

◊

is neither positive semideﬁnite nor negative

semideﬁnite.

Objective Functions in Deep Learning

In the context of deep learning, the typical form of the ob je ctive function is

C(◊):=C(◊ ; D)=

i=1

(◊), (4.4)

where

is the number of data samples in a given dataset

. We have already seen this

form in

(2.11)

where

(

◊

) is the loss computed for the

-th data sample. Speciﬁc forms for

this

-th data sample loss include the square loss, cross entropy loss, and other forms that

have already appeared in chapters 2 and 3. In this chapter the data samples play a more

minor role and hence from now onwards in this chapter we suppress

from the notation

and use C(◊).

The optimization problem that corresponds to the objective function

(4.4)

is often called

the ﬁnite sum problem. For such problems, it is relatively easy to study the properties

(

◊

) by studying the corresponding properties of each

(

◊

), its gradient

ÒC

(

◊

) and

Hessian

(

◊

). Most importantly, the form

(4.4)

plays an important role in ﬁnding the

descent direction for optimization methods such as stochastic gradient descent, described

in Section 4.2. When each

(

◊

) is twice diﬀerentiable, the gradient and Hessian of the

See (A.12)inAppendixA for deﬁnition of positive semideﬁnite and positive deﬁnite matrices.

116

4.1 Formulation of Optimization

objective C(◊) are respectively given by

ÒC(◊)=

i=1

ÒC

(◊), and Ò

C(◊)=

i=1

(◊). (4.5)

A lo cal minimum of

(

) at

◊

can b e characterized via

(

◊

) b e ing positive sem ideﬁnite

in addition to

ÒC

(

◊

)=0. It is useful to note that

(

◊

) is positive semideﬁnite at

◊

each Ò

(◊) is positive semideﬁnite at ◊, because for any „ œ R

„

€

C(◊)„ =

i=1

„

€

(◊)„

. (4.6)

Convexity of Certain Shallow Neural Networks

While most deep learning optimization problems are not convex, the loss functions of the

logistic regression and multinomial regression models of Chapter 3 are convex. The same

is true for the linear regression model of Chapter 2. For completeness, we now establish

convexity properties of these simple models. The details of this subsection may be skipped

on a ﬁrst reading.

We begin with the linear regression model of Section 2.3 with the standard quadratic loss.

In this case the associated optimization problem is

(2.13)

and we may compute the Hessian

for the objective C(◊)=Îy ≠ X◊Î

to be

C(◊)=2X

€

where we use the notation of the design matrix

from

(2.10)

. The Hessian is thus a Gram

matrix and this implies it is positive s em ideﬁnite.

Hence the optimization problem is convex.

Further if

has linearly independent columns then

(

◊

) is positive deﬁnite and the

problem is strictly convex and has a unique optimal p oint.

We now continue to logistic regression of sections 3.1 and 3.2. An expression for the gradient

is in

(3.17)

and we may follow up with an expression for the Hessian. Speciﬁcally, the

d ◊ d

Hessian matrix of each C

(◊) is

(◊)=‡

Sig

(b + w

€

(i)

)

1 ≠ ‡

Sig

(b + w

€

(i)

)

(i)

1 x

(i)

€

, (4.7)

which is evidently a product of a strictly positive s calar and a Gram matrix. Thus it is always

positive semideﬁnite following the same argument used for linear regression with quadratic

loss.

Now using the argument at

(4.6)

, the matrix

(

◊

) is also positive semideﬁnite and

thus the total loss function

(

◊

) is convex. Further, in cases where the vectors (1

(i)

) for

,...,n

span

(this can only hold when

d Æ n

) then we cannot ﬁnd a non-zero vector

„ œ R

orthogonal to all the vectors (1

(i)

). Thus, for any vector

„ ”

=0, there exists at

least one

such that

„

€

(◊) „ >

0 and hence

(

◊

) is positive deﬁnite, making the

total loss function C(◊) strictly convex.

For any matrix

,theassociatedGrammatrixis

€

and thus

„

€

A„

ÎA„Î

0 for any vector

„

.Furtherwithlinearlyindependentcolumns

A„ ”

=0for any

„ ”

=0and thus

ÎA„Î

0 implying that

positive deﬁnite.

Note however that

(◊)

is never (strictly) positive deﬁnite because

(

◊

) is a rank one matrix.

117

4 Optimization Algorithms

We now move onto multinomial regression as well as linear regression with other loss functions

introduced in Section 2.3. The optimization problems of these models also enjoy convexity

prop e rties, yet expressions of the Hessian are not easy to work with. We thus use other

means to argue for convexity.

For multinomial regression recall C

(◊) from (3.33) which we can represent as,

(◊) = log

j=1

€

(j)

(i)

≠ (b

+ w

€

(k)

(i)

), (4.8)

when

(i)

. Recall here that

◊

includes both bias terms (scalars)

,...,b

and weight

vectors w

(1)

,...,w

(K)

;see(3.23).

As a building block we ﬁrst use the log-sum-exp function, h : R

æ R of the form

h(v) = log

j=1

We can show that

(

) is convex by e xplicitly c alculating its Hessian expression and seeing

that it is positive semideﬁnite

for all

v œ R

. Further using ﬁrst principles of the deﬁnition

of convexity,

(4.3)

, we can show that for any convex function

æ R

(

◊

€

,...,◊

€

)

is convex in

◊ œ R

for any ﬁxed

,...,u

œ R

.Since

◊

is the vector consisting of all the

weights and the biases, by using multiple zeros in each

œ R

, we can construct

from

the p-dimensional vector x

(i)

such that

= ◊

€

= b

+ w

€

(j)

(i)

As a consequence, the ﬁrst term of

(4.8)

is convex in

◊

. Further, the second term is also

convex as it is an aﬃne function of ◊. Hence (4.8) is a sum of convex functions and is thus

convex.

We now again consider linear regression, but this time with the absolute error loss

(2.18)

and Huber loss (2.19). In both of these cases, like the standard quadratic loss, we have,

(◊)=g (y

(i)

≠ ◊

€

(i)

where

R æ R

is a convex function; see Figure 2.6 (a) in Chapter 2. Here again we use

the fact that composition of a convex function and an aﬃne function is convex and this

establishes convexity of C

(◊) and thus of C(◊) for all these loss function alternatives.

Before we depart from this discussion of convexity we suggest that the reader consider the

plots of Figure 3.4 of Chapter 3. In that ﬁgure, it is evident that logistic regression with “the

wrong” loss function, namely squared loss in this case, may yield a non-convex problem. We

highlight this here to emphasize that the convexity properties discussed above are more of

an exception than the rule. Indeed, most deep learning models of this book have non-convex

objectives.

The gradient of the log-sum-exp function is the softmax function for which we present the Jacobian in

(5.19)

,andtheJacobianofthegradientofafunctionisequaltotheHessianofthatfunction.Italsoturns

out that the matrix in

(5.19)

is the covariance matrix of a categorical distribution (multinomial distribution

with one trial) implying that it is positive semideﬁnite.

118

4.1 Formulation of Optimization

General Approach of Descent Direction Methods

We have seen that the loss functions of logistic regression and multinomial regression are

convex, and thus they have global minima. Unfortunately, even for these shallow neural

networks, despite being convex, there is no known analytical solution for the minimizer.

Further, in more general deep neural networks, we do not have such convexity properties

and the ob jective functions typically exhibit numerous local minima. Thus in general, it is

impossible to think of an analytical solution for the optimization problems in deep learning,

and thus we have to relay on more advanced numerical optimization methods.

All the popular optimization methods used in deep learning can be put under one general

framework called the descent direction method. It is an iterative method that attempts to

reduce the objective in each iteration. This might seem obvious, howe ver the reader should

keep in mind that in general optimization methods beyond those covered in this book, steps

may sometimes increase the objective before further reduction. In contrast, with the descent

direction method, while increases may occur in certain iterations, the overall goal is to have

a reduction of the objective in each iteration. This method is presented in Algorithm 4.1.

Algorithm 4.1: The descent direction method

Input: Dataset D = {(x

(1)

),...,(x

(n)

)},

objective function C(·)=C(·; D), and

initial parameter vector ◊

init

Output: Approximately optimal ◊

1 ◊ Ω ◊

init

2 repeat

3 Determine the descent step ◊

4 ◊ Ω ◊ + ◊

5 until termination condition is satisﬁed

6 return ◊

As is evident, Algorithm 4.1 relies on the sp ec iﬁc way in which we compute the descent step

and the termination condition. For each iteration

,...

of Algorithm 4.1, denote

the values of the parameter vector

◊

and the descent step

◊

(t)

and

◊

(t)

,respectively.

With this notation, the sequence of p oints ◊

(1)

, ◊

(2)

,..., evolves via,

◊

(t+1)

= ◊

(t)

+ ◊

(t)

, with ◊

(0)

= ◊

init

until

◊

(t)

satisﬁes the speciﬁed termination condition. Note that the magnitude of the descent

step Î◊

(t)

Î denotes the step size in the t-th iteration.

The descent step

◊

(t)

at the

-th iteration is determined by information available in or until

the iteration

, where the exact manner in which it is computed depends on the variant of

the algorithm used. One speciﬁc example that we have already seen is the gradient descent

method, Algorithm 2.1 of Section 2.4, which is a special case of Algorithm 4.1. With gradient

descent, the descent step is set at

◊

≠–ÒC

(

◊

) where

–

is the learning rate and since

– >

0, at each step the gradient descent method moves along the steepest descent direction.

Note that some authors use the term steepest descent to denote a more general class of algorithms of

which gradient descent can be taken as a special case. However, in this book “steepest descent” and “gradient

descent” are considered synonymous.

119

4 Optimization Algorithms

Other variants encountered in the sequel are stochastic gradient descent (SGD) presented in

Section 4.2, ADAM presented in Section 4.3, second-order methods presented in Section 4.6,

and other variants.

When the goal is to minimize the objective function

, ideally, we would like to terminate

the algorithm only after reaching a minimum point. This convergence is diﬃcult to guarantee

in general as the updates of the algorithm can slow down or oscillate in the vicinity of the

minimum. Therefore, in practice there are several termination conditions that allow some

tolerance so that the algorithm can output a point in the vicinity of the minimum. We now

list a few traditionally popular termination criteria. However, note that the next section

presents optimization in the context of deep learning where other criteria are often used.

Most termination conditions use a small tolerance value,

Á >

0. One standard termination

condition tracks absolute improvement, where we terminate when the change in the function

value is smaller than the tolerance. Namely,

C(◊

(t)

) ≠ C(◊

(t+1)

)

< Á.

An alternative is to use the gradient magnitude or step size where we terminate respectively

if,

ÎÒC(◊

(t+1)

)Î < Á, or Î◊

(t)

Î < Á.

When the region around the (local) minimum is shallow, it is sometimes useful to use the

relative improvement criterion where we terminate based on the relative change in the

function value. Speciﬁcally, we terminate if

|C(◊

(t)

) ≠ C(◊

(t+1)

|C(◊

(t)

< Á.

In addition to these criteria which track the state of the optimization procedure, an alternative

is to simply run for a ﬁxed number of iterations

,...,T

, or similarly to limit the running

(clock) time of the algorithm to a desired predeﬁned value.

Similarly, the termination condition at Step 5 of Algorithm 4.1 can be diﬀerent for diﬀerent

methods. The optimal output point obtained by the algorithm may not nece ss arily minimize

the objective function

(

). Rather it depends on the termination condition. For instance,

as we see in Section 4.2, the ﬁnal

◊

returned by Algorithm 4.1 could be a minimizer of the

validation set accuracy, but not a minimizer of the objective function C(·).

4.2 Optimization in the Context of Deep Learning

In the previous section we saw that certain optimization problems are tractable in the sense

that a local minimum is also the global minimum. Speciﬁcally, the loss functions of linear

regression, logistic regression, and multinomial regression are convex. However, in most

deep learning models, starting with the models that we cover in the next chapter, the loss

functions are not as “blessed”. For such deep learning models, the number of parameters is

huge and the problems are non-convex, often having a large number of local minima.

Nevertheless, basic optimization techniques are very useful for deep learning with the

compromise of not ﬁnding the global minimum of the loss, but rather ﬁnding a point close

to some local minimum with satisfactory performance. That is, the obtained p oint has

120

4.2 Optimization in the Context of Deep Learning

performance which is in general not far oﬀ from the performance at the global minimum.

Indeed, in deep learning, we use optimization algorithms trying to minimize the training

loss function

(

◊

) while our ultimate goal is ﬁnding

◊

that has good performance on the

unseen data; see Section 2.5 for a discussion of performance on unseen data. That is, the

goal is to actually minimize expectations or averages of performance functions

(

◊

),such

as the validation set error rate. We then hope that this minimization resembles performance

on unseen data. Hence, as the reader studies the optimization methods of this chapter, they

should ke ep in mind that minimization of the loss

(

◊

) is often a proxy for minimization

of P(◊).

Challenges Posed by Basic Gradient Descent

Almost all cases where optimization is used for deep learning involve some variant of gradient

descent; see Algorithm 2.1 of Section 2.4 for an introduction to gradient descent. However,

basic gradient desc ent poses a few important challenges that can make the algorithm

impractical in the deep learning context.

The primary challenge is that in each iteration, the gradient descent algorithm uses the

entire dataset to compute the gradient

ÒC

(

◊

). In the context of deep learning, the sample

size

of the training data can be large or the dimension

of the parameter

◊

can be huge,

or both. As a consequence, the op e ration of computing the gradient using the entire dataset

for every update can be very slow, or even intractable in terms of memory requirements.

Further, as already mentioned above, loss functions of deep neural networks tend to be

highly non-convex, with multiple lo c al minima. For such functions, the apriori selection

of a good learning rate

–

is very diﬃcult. A low learning rate can lead to an increas ed

computational cost as a result of slow convergence of the algorithm, and it can lead to the

algorithm getting stuck at local minima or ﬂat sections of the loss landscape, eliminating

the chance to explore other local minima. On the other hand, a large learning rate can result

in the objective function values and decision variables ﬂuctuating in an haphazard manner.

Even for simple convex models, as we saw in Figure 2.7 of Section 2.4, the choice of the

learning rate aﬀects learning performance. As a second elementary example with

=2,

assume hypothetically that the loss landscape takes the form of the following Rosenbrock

function,

C(◊)=(1≠ ◊

)

+ 100(◊

≠ ◊

)

, (4.9)

which is non-convex yet has a unique global minimum at ◊ =(1, 1).

Figure 4.2 illustrates the p e rformance of gradient descent on

(4.9)

. Here we also introduce

the option of varying the learning rate as optimization iterations progress – based on a

predeﬁned learning rate schedule in this case. Speciﬁcally, at each iteration

, instead of using

the ﬁxed learning rate

–

we use

–

(t)

which is indexed by iteration

and parameterized by

the original

–

. This strategy is sometimes used in practice, yet as we see in this example, it

does not work well for the parameters that we choose, and in none of the four trajectories

starting at an initial point

◊

init

, do we ﬁnd the minimum. Note that in this speciﬁc example

it is not hard to ﬁne-tune the learning rate

–

and/or the exponential decay parameter of

Note that in Section 2.5 we denoted the performance measure function as a function of

ˆy

and

for

each observation in D where as here we are taking it as a function of the parameters ◊.

121

4 Optimization Algorithms

–

(t)

, which is set at 0

99. However, this is a simple, hypothetical,

=2dimensional loss

landscape, whereas in actual problems with

2,tuning

–

(t)

to achieve eﬃcient learning

(with basic gradient descent) is often very diﬃcult or impossible.

(a) (b)

Figure 4.2:

Attempting minimization of a Rosenbrock function with a minimum at

◊

=(1

via basic gradient descent for two values of

–

with 10

iterations and

◊

init

=(0

, ≠

2).(a)Fixed

learning rate: –

(t)

= –. (b) Exponentially decaying learning rate: –

(t)

= – 0.99

Instead of adjusting the learning rate according to a predeﬁned schedule, other variants

try to adjust the learning rate according to other criteria such as the objective function

crossing some threshold values. Further, in some implementations,

–

(t)

is optimized in each

iteration, so that the function

(

◊

) decreases maximally in the direction of the step size; see

for example line search methods described in Section 4.5. However, such strategies are again

not fool-proof on their own, or are often not practical for deep learning.

Furthermore, even if a magical (“optimal”) learning rate

–

can be suggested for each iteration,

the basic gradient descent method applies the same learning rate to all parameter updates. It

turns out that it is better to learn the parameters

◊

,...,◊

at diﬀerent rates in an adaptive

manner because larger rates on rarely changing parameter coordinates may result in faste r

convergence of the algorithm. To handle such cases, it is common practice to use the ADAM

algorithm. This variant of gradient descent, together with its building blocks, are the focus

of Section 4.3.

Prior to exploring ADAM and related gradient descent adaptations, let us return to the key

issue mentioned above which is the computational complexity of obtaining exact gradients

ÒC

(

◊

). This issue is typically handled by computing approximate gradients either via

stochastic gradient descent, which we describe now, or more practically with the use of mini-

batches described in the sequel. Importantly, both of these approaches yield fast computable

approximate gradients. These approaches also yield parameter trajectories that manage to

escape local minima, ﬂat regions, or saddle points by introducing randomness or noise into

the optimization pro c es s.

122

4.2 Optimization in the Context of Deep Learning

Stochastic Gradient Descent

Recall that the loss function

(

◊

) for deep neural networks is typically a ﬁnite sum of the

form (4.4)with

(

◊

) being the loss computed for the

-th data sample. The method of

stochastic gradient descent exploits this structure by computing a noisy gradient using only

one randomly selected training sample. That is, it evaluates the gradient for a single

(

◊

)

instead of the gradient for all of C(◊ ).

More precisely, the algorithm operates similarly to Algorithm 2.1 (or variations of it), yet in

the

-th iteration of stochastic gradient descent, an index variable

is randomly selected

from the set

{

,...,n}

and the gradient

ÒC

(

◊

) is computed only for the

-th data sample

in place of Step 3 of Algorithm 2.1. The update rule for the decision variable is then,

◊

(t+1)

= ◊

(t)

≠ – ÒC

(◊

(t)

). (4.10)

This stochastic algorithm exhibits some interesting properties. Most importantly, the ex-

ecution of each iteration can be much faster than that of basic gradient descent because

only one sample is used at a time (the gain is of order

). Further, if the index variable

selected uniformly over

{

,...,n}

, the parameter update

(4.10)

is unbiased in the sense that

the expected descent s tep for e ach iteration

is the same as that of gradient descent. That

is, in each iteration, we have

E [ÒC

(◊)] =

i=1

ÒC

(◊)=ÒC(◊), for a ﬁxed ◊. (4.11)

and hence the stochastic step (4.10) is “on average correct”.

While being unbiased, the stochastic nature of

(4.10)

introduces ﬂuctuations in the descent

direction. Such noisy trajectories may initially appear like a drawback, yet one advantage

is that they enable the algorithm to escape ﬂat regions, saddle points, and local minima.

Thus, while the ﬂuctuations make it diﬃcult to guarantee convergence, in the context of

deep learning their eﬀect is often considered desirable in view of highly non-convex loss

landscapes.

Importantly, stochastic gradient descent is generally able to direct the parameters

◊

towards

the region where minima of

(

◊

) are located. To get a feel for this attribute of stochastic

gradient descent, we consider another hypothetical example where for i =1,...,n,

(◊)=a

(◊

≠ u

i,1

)

+(◊

≠ u

i,2

)

, (4.12)

for some constant

0 and

i,1

i,2

)

œ R

. He re

is the unique minimizer of

(

◊

These individual loss functions for each observation are convex and hence the total loss

function

(

◊

) of

(4.4)

is also convex.

Let

be the convex hull of

,...,u

, that is,

the smallest convex set that contains

,...,u

}

. It is now obvious that the global minima

of C is also in S.

The reader might be tempted to conclude that

[

◊

(t)

SGD

◊

(t)

for all tim e

,wherethesubscriptSGD

is for sto chastic gradient descent starting at the same initial condition as GD (“Gradient Descent”). However,

in general this is not correct when ÒC(◊) is non-linear in ◊.

This example does not resemble a general case with multiple lo cal minima since it is convex. Nevertheless,

it is useful for understanding some of the behaviour of stochastic gradient descent.

123

4 Optimization Algorithms

In Figure 4.3 we plot the contours of such a function when

=5where in Figure 4.3 (a) the

points

,...,u

are distinct and in (b) these points are identical. The convex hull is marked

by the region bounded in green. The ﬁgure also plots the evolution of gradient descent and

stochastic gradient desc ent in each of the c ase s. As is evident, while the stochastic gradient

descent trajectory is noisier, outside of the convex hull

, its trajectory is similar to that of

gradient descent.

Interestingly, in Figure 4.3 (a), once the trajectory hits

it mostly stays within

but with

a lot of ﬂuctuations. The reason for this behavior is that at any point

◊

not in

,thedescent

directions for both gradient descent and stochastic gradient descent are towards the set

it contains the minima of

(

◊

) and every

(

◊

). To make this a concrete argument, note

that the descent step in the t-th iteration of the stochastic gradient descent is

≠–ÒC

(◊

(t)

)=≠–

i=1

1{I

= i}ÒC

(◊

(t)

Since every

(

◊

) has its minimum in the set

,if

◊

(t)

/œ S

,thenevery

≠ÒC

(

◊

(t)

),the

steepest direction of

(

◊

) at

◊

(t)

, guides the algorithm towards the set

. Therefore,

irrespective of the choice of I

, the update moves ◊

(t+1)

towards S.

(a) (b)

Figure 4.3:

An hypothetical example where

(

◊

) is composed of individual

(

◊

) as in

(4.12)

Comparison of gradient descent (GD) and stochastic gradient descent (SGD). (a) A case where the

minimizers

,...,u

(

◊

) are diﬀerent and are each marked by green dots in the plot. (b) A

case where the minimizers u

,...,u

are the same.

Working with Mini-batches and the Concept of Epochs

On one extreme, computing the full gradient for basic gradient descent is computationally

challenging but the obtained gradients are noiseless. On the other extreme, stochastic

gradient descent reduces the computational complexity of each update but the updates are

noisy. One middle ground approach is to work with mini-batches. Almost any practical deep

learning training uses some variant of this approach.

124

4.2 Optimization in the Context of Deep Learning

A mini-batch is simply a small subset of the data points, of size

. Mini-batch sizes are

typically quite small in comparison to the size of the dataset and are often chosen based on

computational capabilities and constraints. In practice, a mini-batch size

is often s elec ted

so the computation of gradients associated with

observations can ﬁt on GPU memory.

This generally makes computation more eﬃcient in comparison to stochastic gradient descent

which cannot take full advantage of parallel computing or GPUs since the gradient evaluation

for a single observation index I

is not easy to parallelize.

When using mini-batches, for each iteration

, a mini-batch with indices

™ {

,...,n}

used to approximate the gradient via,

iœI

ÒC

(◊

(t)

). (4.13)

This estimated gradient is then used as part of gradient descent or one of its variants (such

as ADAM which we present in the sequel); for example it is used in place of Step 3 of

Algorithm 2.1. One common practical approach with mini-batches is to shuﬄe the training

data apriori and then use the shuﬄed indices sequentially. With this process there are

n/n

mini-batches

in the training dataset and each mini-batch is a distinct random subset of the

indices. In such a case, when using some variant of gradient descent, every pass on the full

dataset has

n/n

iterations, with one iteration per mini-batch. Such a pass on all the data

via

n/n

iterations is typically called an epoch. The training process then involves multiple

epochs. It is quite common to diagnose and track the training process in terms of epochs.

Note that one may also randomly shuﬄe the data (assign new mini-batches) after every

epoch to reduce the correlation between the epochs and reduce bias in the optimization

process.

Similarly to

(4.11)

when using the mini-batch approach, the estimated gradient is unbiased

in the sense that

j=1

ÒC

t,j

(◊)

j=1

i=1

ÒC

(◊)=ÒC(◊), for a ﬁxed ◊.

Here

t,j

denotes the

-th element in

and the ﬁrst equality holds because shuﬄing

makes each

t,j

to be uniform over

{

,...,n}

. To get a feel for the performance of

learning with mini-batches, a mathematical simpliﬁcation is to consider a variant where

(

t,1

,...,I

t,n

) is a collection of randomly selected indices on

{

,...,n}

with replacement.

As a consequence, each

t,j

is not only uniform over

{

,...,n}

, but also independent of all

the rest. This manner of computing gradients is an approximation to the above mentioned

mini-batches with shuﬄing the data after every epoch.

Importantly, in this setup, we notice that the noise of the gradients decreases as the mini-

batch size

increases. To see this, recall that for each

, the index variables

t,1

,...,I

t,n

are chosen independently. Thus, the variance of the gradient estimate is

Var

j=1

ÒC

t,j

(◊

(t)

)

Var

ÒC

t,1

(◊

(t)

)

We assume here that

divides

.Ifitisnotthecasethenthereare

Án/n

mini-batches with the last

one having less than n

samples.

Some authors refer to this as mini-batch gradient descent.

125

4 Optimization Algorithms

Since

Var

ÒC

t,1

(◊

(t)

)

is the variance of the gradient for an arbitrary data sample, we

see that mini-batch gradient descent has

times smaller variance than that of stochastic

gradient descent.

Loss Minimization is a Proxy for Optimal Performance

Recall the discussion in Section 2.5 where we saw that the ultimate focus of learning is to

optimize the expected performance on unseen data. This is typically quantiﬁed in terms

of some error metric which we wish to be minimal. In this section we simply refer to the

estimated expected performance function as

(

◊

), keeping in mind that it is typically an

average over the validation set. When training, it is often more tractable to minimize the

loss C(◊), however, the true goal is minimization of P(◊).

In deep learning, the model choice and training process typically involves choosing a model

with a loss function

(

◊

), such that loss minimization is computationally easy to carry out

and yields low

(

◊

). In that sense, the loss function

(

◊

) is often much more amenable to

derivative computations and gradient based learning, in comparison to trying to use the

performance function

(

◊

) directly. Further, by tracking a performance function

on a

validation dataset not used for training, we are often able to make sure that overtraining (also

called overﬁtting) is avoided. This interplay of the loss function

(

◊

) and the pe rformance

function

(

◊

) implies that deep learning optimization is not traditional optimization where

one seeks

◊

that truly optimizes the loss, but is rather a means of using optimization of

C(◊) as a surrogate to good performance on P(◊).

We now explore this setting of deep learning optimization in two ways. In one direction we

discuss a notion of stopping criteria that diﬀers from the termination conditions presented

at the end of Section 4.1. In another direction we highlight that the actual model choice

aﬀects the loss function

(

◊

), and some loss functions are more amenable to optimization in

comparison to others.

First, in terms of stopping criteria, as already mentioned above, for training a deep learning

model, standard practice is to use mini-batches and epochs. This me ans that some form

of the general Algorithm 4.1 is executed where multiple iterations are grouped into an

epoch. As training progresses (this is sometimes a process of hours, days, or more), we track

the evolution of the loss,

(

◊

(t)

), as well as one or more performance functions

(

◊

(t)

typically focusing on iterations t that are at the end of an epoch. In that sense, a common

stopping criterion is to stop at the epoch where

(

◊

(t)

) is small when evaluated over the

validation set.

A popular strategy for this approach is called early stopping. With this strategy, training

often continues for multiple epochs beyond the epo ch where

(

◊

(t)

) is minimized since at

that epoch it is still not evident that the minimum was reached. However, as the training

loop progresses, the model parameters for the best epoch are stored such that once training

is complete the “b est model” in terms of

(

◊

) on the validation set can be used. A typical

trajectory of the loss and performance m etric is displayed in Figure 4.4. With such a typical

trajectory, increasing the number of ep ochs of training is somewhat similar to the increase of

model complexity discussed in Section 2.5; see Figure 2.9 of that section. Running the model

beyond the minimum point of

(

◊

) is called over-training, similar to overﬁtting discussed in

In p r actice we may often track multiple performance functions.

In t h is context it is often called the training loss as it is the loss over the training set.

126

4.2 Optimization in the Context of Deep Learning

Section 2.5. The early stopping strategy is then a method to use validation performance to

try and avoid such over-training. Importantly, the actual minimum point, say

◊

, of

(

) is

not a desirable point for this purpose since it is typically a point of over-training and

(

◊

)

is sub-optimal in terms of the performance function.

Figure 4.4:

Training on a small subset of the MNIST dataset. The validation error rate is tracked

together with the training loss. The e arly stopping strategy uses the model from epoch 76 where

the validation performance is best.

As further details for Figure 4.4, we note that it is based on a deep fully connected network

(see Chapter 5)with5 layers and hundreds of thousands of parameters. We train the model

using only 1

000 MNIST images and use 500 additional images for the validation set. The

performance function used on the validation set is one minus the accuracy, and as is evident,

an accuracy of about 90% is obtained. Mini-batch sizes are taken at

= 10. As we can see,

the mo de l is trained for 200 epochs but retrospectively, the early stopping strategy pulls the

parameters

◊

associated with e poch 76. Note that our chosen model is far from a realistic

model which one would use for such a low data scenario, yet we present this example here

to illustrate the idea of early stopping.

The second way in which deep learning optimization diﬀers from general optimization is that

the mode l’s loss function itself is often speciﬁcally engineered for eﬃcient optimization. In

classical applications of optimization appearing in applied mathematics, operations research,

and other ﬁelds, one often needs to optimize a given objective without the ability to modify

the objective. In contrast, deep learning models are versatile enough so that the actual

model, and subsequently the loss function can be chosen with the goal of facilitating eﬃcient

optimization. Discussions of this nature appear in the sequel where for example the choice of

activation functions within a deep neural network signiﬁcantly aﬀects the nature of learning;

see for example the discussion of vanishing gradients in Chapter 5.

On this matter, we have already seen in Figure 3.4 of Section 3.2 how the cross entropy loss

for logistic regression presents a much b etter loss landscape for optimization in comparison

with the squared loss. In the case of logistic regression, cross entropy loss even implies having

a convex loss function, whereas that property is lost with the squared loss function. Further,

in more complex models using cross entropy is also beneﬁcial in terms of the loss landscape

even if it does not guarantee convexity per-se. We now present one such illustrative example.

127

4 Optimization Algorithms

(a) (b)

Figure 4.5:

The loss landscape of a very simple two-layer neural network. (a) Using the squared

loss C

(◊ )=(y

(i)

≠ ˆy

(i)

)

. (b) Using the binary cross entropy loss C

(◊ )=CE

binary

(i)

, ˆy

(i)

Similarly to the synthetic data used for the logistic regression example yielding Figure 3.4,

let us use the same synthetic data on the model,

f(x)=‡

Sig

‡

Sig

1+e

≠

1+e

≠w

In the context of deep neural networks presented in Chapter 5, this is a two-layer network

without bias terms, a single neuron per layer, and sigmoid activation functions. It is not a

realistic model one would use in practice, yet for our expository purposes it has the beneﬁt

of having only two parameters, and hence we may visualize the loss surface as a function of

and w

Interestingly, as we see in Figure 4.5, using the cross entropy loss presents a much smoother

loss landscap e in comparison to the squared error loss. Hence using gradient based opti-

mization with cross entropy would be much more eﬃcient and have the algorithm less likely

to get stuck in local minima or saddle points. Further, in much more realistic models with

thousands or millions of parameters, while not possible to fully visualize, using cros s entropy

often yields more sensible loss landscapes as well. Hence in summary we see that when

handling optimization in the context of deep learning, there are often other considerations

at play, beyond standard optimization.

4.3 Adaptive Optimization with ADAM

Essentially, all deep learning optimization techniques can be cast as some form of the desc ent

direction method outlined in Algorithm 4.1. Basic gradient descent (Algorithm 2.1) is one

speciﬁc form of this descent direction method, and as we outlined in the previous section, it

suﬀers from multiple drawbacks. While some of these drawbacks are alleviated when using

noisy gradients via stochastic gradient descent or mini-batches, there is still much more

room for improvement.

128

4.3 Adaptive Optimization with ADAM

In this section we present one of the most popular improvements over basic gradient descent

called the ADAM algorithm which is short for adaptive moment estimation. This algorithm

has become the default practical choice for training deep learning models. Like basic gradient

descent, ADAM is considered as a ﬁ rst-order method since it uses gradients (ﬁrst derivatives)

to determine the descent step used in Algorithm 4.1. For this, ADAM combines the ideas of

momentum, and adapt ive learning rates per coordinate into a simple mechanism of computing

the descent step. Hence to understand ADAM, we study these two concepts, keeping in

mind that the associated algorithms can also be viewed as independent methods in their

own right. Other variations of these concepts together with additional ﬁrst-order techniques

that are not employed in ADAM are further summarized in Section 4.5.

Adaptive Optimization and Exponential Smoothing

A basic theme of ADAM and many other ﬁrst-order methods has to do with adapting to

the local nature of the loss landscape as iterations progress. For simple

=2examples

as visualized in Figure 4.2 and Figure 4.5, we can quite easily get a taste for the nature

of the loss landscape, and it is generally not diﬃcult to choose descent steps that yield

good overall performance. However, for practical examples with larger values of

, at any

iteration

of the general Algorithm 4.1, the only information available at our disposal is

the history of

◊

(0)

,...,◊

(t≠1)

and the associated gradients. It thus makes sense to try and

“adapt” the descent step based on this history, perhaps with more emphasis on iterations

from the near-history. One may loosely call such an approach adaptive optimization.

One simple principle used in adaptive optimization is exponential smoothing. Beyond opti-

mization, exponential smoothing is popular in time series analysis and m any other branches

of data science and applied mathematics. In general, exp onential smoothing operates on a

sequence of vectors

(0)

(1)

(2)

...

. In the context of this section, the sequence may be a

sequence of gradients or a sequence of the squares of gradients.

The exponentially smoothed sequence is denoted as ¯u

(0)

, ¯u

(1)

,... and is computed via,

¯u

(t+1)

= — ¯u

(t)

+(1≠—)u

(t)

, with ¯u

(0)

=0, (4.14)

and

— œ

1). Thus, for

t Ø

0, each smoothed vector

¯u

(t+1)

is a convex combination of the

previous smoothed vector and the most recent (non-smoothed vector)

(t)

.When

—

=0,no

smoothing takes place,

and when

—

is a high value near 1, the new vector

(t)

only has a

small eﬀect on the new smoothed vector ¯u

(t+1)

It is simple to use the update formula

(4.14)

to represent the smoothed vector

¯u

(t)

as a

convex combination of the complete history. Speciﬁcally, by iterating (4.14)wehave,

¯u

(t+1)

=(1≠ —)

·=0

—

t≠·

(·)

, for t =0, 1, 2,..., (4.15)

and thus each

(0)

,...,u

(t)

contributes to

¯u

(t+1)

, with exponentially decaying weights. As

we see now, within the context of adaptive optimization, such straightforward exponential

smoothing can go a long way.

Note that with our convention of indices, when

—

=0the exponentiall y smoothed sequence is actually

atimeshiftoftheoriginalsequence.

129

4 Optimization Algorithms

Momentum

Recall that the descent step of basic gradient descent is

–

times the negative gradient. This

implies that in ﬂat regions of the loss landscape, gradient descent essentially stops. Further,

at saddle points or local minima, gradient descent stops completely. Such drawbacks may be

alleviated with the use of momentum.

To get an intuitive feel for the use of momentum in optimization, consider an analogy

of rolling a ball downhill on the loss landscape. It gains momentum as it roles downhill,

becoming increasingly faster until it reaches the bottom of a valley. Clearly, the ball keeps

memory of the past forces in the form of acceleration. This motivates us to use exponential

smoothing on the gradients to obtain an extra step called the momentum update.Itisthen

used as follows,

(t+1)

= —v

(t)

+(1≠—)ÒC(◊

(t)

)(Momentum Update), (4.16)

◊

(t+1)

= ◊

(t)

≠ –v

(t+1)

(Parameter Update), (4.17)

for scalar momentum parameter — œ [0, 1) and learning rate – > 0, starting with v

(0)

=0.

Similar to the general exponential smoothing of

(4.14)

, the momentum update

is an

exponential smoothing of the gradient vectors. When

—

=0, we have the basic gradient

descent method and for larger

—

, information of previous gradients plays a more signiﬁcant

role. In practice, for deep neural networks, the gradient

ÒC

(

◊

(t)

) is replaced with a noisy

gradient based on stochastic gradient descent or mini-batches.

The vector

(t)

(4.16)

is called the momentum

at the

-th update. Using

(4.15)

,wehave

(t+1)

=(1≠ —)

·=0

—

t≠·

ÒC(◊

(·)

), for t =0, 1, 2,.... (4.18)

That is, the momentum accumulates all the past (weighted) gradients, providing acceleration

to the parameter

◊

updates on downward surfaces. Therefore, for

— >

0, the next step take n

via

(4.17)

is not necessarily taken in the steep es t desce nt, instead the direction is dictated

by the (exponential smoothing) average of all the gradients up to the current iteration.

Figure 4.6 compares the performance of the gradient descent method with momentum for

diﬀerent values of

—

on the Rosenbrock function

(4.9)

. We observe that for large

—

,the

momentum method accelerates as it takes downward steps.

Adaptive Learning Rates p er Coordinate

Observe that the updates of the gradient descent method, with or without momentum, use

the same learning rate

–

for all components of

◊

. However, it is often better to learn the

parameters ◊

,...,◊

with potentially a speciﬁc learning rate for each coordinate ◊

One may alternatively parameterize the momentum update

(4.16)

(t+1)

—v

(t)

ÒC

(

◊

(t)

).This

alternative parameterization has the beneﬁt of allowing one to adjust the momentum parameter

—

without

having signiﬁcant eﬀects on the step size exhibited in

(4.17)

.Ourchoiceoftheparameterizationasin

(4.16)

is for consistency with ADAM presented below.

In physics, momentum is deﬁned as the product of the mass of an object and i t s velocity vector. The

object’s momentum points in the direction of an object’s movement. In contrast, here the momentum vector

(t)

points in the direction opposite to the step taken.

130

4.3 Adaptive Optimization with ADAM

Figure 4.6:

Application of momentum to the Rosenbrock function

(4.9)

for three diﬀerent values

—

with learning rate

–

001

≠ —

) and a ﬁxed number of iterations. Note that

—

=0is basic

gradient descent.

To get a feel for this issue, let us assume a situation where speciﬁc parameters are associated

with spec iﬁc features in the data.

For instance, in applications of deep neural networks

for natural language processing (see also Chapter 7), there are typically words that are less

frequent than others. For example in most texts on trees, the word

chaulmoogra

most likely

appears less frequently than the word

coconut

, even though both are names of tree species.

Now assume that there are associated paramete rs

◊

chaulmoogra

and

◊

coconut

, and consider

how these would be updated during learning.

What is likely to o c cur is that each mini-batch would have on average much fewer occurrences

chaulmoogra

in comparison to

coconut

. However, if we update all the parameters at the

same learning rate, then

◊

coconut

would be up dated more than

◊

chaulmoogra

. Hence with ﬁxed

learning rate

–

,the

◊

coconut

parameter is likely to converge quickly, while the

◊

chaulmoogra

parameter will be slow to respond.

A general approach for overcoming this issue is to scale each coordinate of the desce nt

step with a diﬀerent factor. That is, instead of considering a basic gradient descent step,

◊

= ≠–ÒC(◊), consider descent steps of the form,

◊

= ≠– r §ÒC(◊), (4.19)

In some simple models such as linear regression, logistic regression, or multinomial regression this

is exactly the case for categorical one-hot encoded variables. In more advanced models there is a more

complicated implicit mapping between parameters and features.

131

4 Optimization Algorithms

where

is some vector of positive entries which recalibrates the descent steps and

the element-wise product of two vectors.

That is, for a parameter such as

◊

chaulmoogra

would expect the coordinate

chaulmoogra

to be high in comparison to the coordinate

coconut

which is associated with

◊

coconut

. Such recalibration of the descent steps would result in

large steps for

◊

chaulmoogra

whenever

chaulmoogra

appears in a mini-batch, and smaller

steps for ◊

coconut

However, in general, ﬁnding a ﬁxed good vector

for

(4.19)

is diﬃcult, especially when the

number of parameters is huge and the actual relationship between parameters and features

is not clear. Instead, we use adaptive methods that adjust the descent step during the

optimization process.

We now introduce two such approaches called adaptive subgradient (or Adagrad) and root

mean square propagation (or RMSprop). In both of these approaches the update of coordinate

in iteration t can b e represented as,

◊

(t+1)

= ◊

(t)

≠

–

(t+1)

+ Á

ˆC(◊

(t)

)

ˆ◊

, (4.20)

where

(0)

(1)

(2)

,...

is a sequence of non-negative values that are updated adaptively

and

taken to be a small value of order 1

◊

≠8

to avoid division by zero. That is,

with this representation, the descent step at time

represented in terms of

(4.19)

has

(t+1)

+ Á

≠1

. Generally we aim for

to be some smo othed re presentation of the

square of the derivative

ˆC(◊

(t)

)/ˆ◊

. Hence,

is roughly the inverse of the magnitude of

the derivative, and r

is low when the changes for coordinate i are steep and vise-versa.

Vector-wise, the parameter update (4.20) can be represented as

◊

(t+1)

= ◊

(t)

≠

–

(t+1)

+ Á

§ÒC(◊

(t)

), (4.21)

where

(t)

,...,s

(t)

is considered a vector, and the addition, division, and square-

root operations are all element-wise operations.

Using the representation

(4.20)

(4.21)

, let us now deﬁne how the sequence of

(t)

}

computed both for Adagrad and RMSprop. Speciﬁcally,

(t+1)

]

[

·=0

ˆC(◊

(· )

)

ˆ◊

, for Adagrad,

“s

(t)

+(1≠“)

ˆC(◊

(t)

)

ˆ◊

, for RMSprop,

(4.22)

where the recursive relationship for RMSprop has the initial value at

(0)

=0. RMSprop is

also parameterized by a decay parameter

“ œ

1) with typical values at

“

999 implying

that for R MSprop the sequence

(t)

}

is a relatively slow exponential smoothing of the

square of the derivative.

The element-wise product is also called the Hadamard product or Schur product.

132

4.3 Adaptive Optimization with ADAM

Adagrad may appear simpler than RMSprop since it do e s not involve any recursive step or

any hyper-parameter like

“

and it is simply an accumulation of the squares of the derivatives.

However, the crucial drawback of Adagrad is that for each

,thesequence

(t)

}

is strictly

non-deceasing. As a consequence, the e ﬀec tive learning rate decreases during training, often

making it inﬁnitesimally small before convergence. RMSprop was introduced in the context

of deep learning after Adagrad, and overcomes this problem via exponential smoothing of

the squares of the partial derivatives.

It is also useful to use

(4.15)

to see that for RMSprop, the explicit (all-time) representation

of the vector s

(t+1)

is,

(t+1)

=(1≠ “)

·=0

“

t≠·

ÒC(◊

(·)

) §ÒC(◊

(·)

)

, (4.23)

while for Adagrad a similar representation holds without the (1

≠ “

) and

“

t≠·

elements in

the formula.

Bias Correction for Exp onenti al Smoothing

We have seen that both momentum and RMSprop involve exponential smoothing of the

gradients and squares of the gradients respectively. In both of these cases, we set zero initial

conditions at iteration

=0. Speciﬁcally, for momentum, we have

(0)

=0and for RMSprop,

we have

(0)

=0. In the absence of any better initial conditions, this is a sensible choice,

yet such initial conditions may introduce some initial (short-term) bias. This bias would

eventually “wash away” as

grows but it can play a signiﬁcant role for short term

.The

eﬀect of such bias is esp e cially pronounced when the res pective exponential smoothing

parameters — or “ are close to 1.

To mitigate such temporal bias, a common technique is to use bias correction.Toseethis,ﬁrst

observe from

(4.14)

that since the initial value

¯u

(0)

=0, the ﬁrst element of the exponential

smoothing is

¯u

(1)

=(1≠ —)u

(0)

Since

—

is usually set near 1,

¯u

(1)

is close to zero even when

(0)

is not close to 0. Consequently,

¯u

(2)

stays close to zero even if u

(1)

is away from 0 because

¯u

(2)

= — ¯u

(1)

+(1≠—)u

(1)

Thus, initial values of the exponential smoothing sequence

¯u

(0)

, ¯u

(1)

,...

remain close to 0

even when the elements of original sequence u

(0)

(1)

,... are far from 0.

One way to handle such bias is to consider the special case where the input vectors in

(4.15)

are ﬁxed at u

(t)

= u. In such a case, via simple evaluation of a geometric sum we have,

¯u

(t+1)

=(1≠ —)

·=0

—

t≠·

u =(1≠ —)

·=0

—

u =(1≠ —

t+1

)u.

Hence if we were to divide the exponentially smoothed value

¯u

(t)

by 1

≠ —

t+1

, we would

get constant vectors when

(t)

(constant). Further, in (more realistic) cases where the

underlying input sequence

(0)

(1)

(2)

...

is not constant but close to a constant, the bias

133

4 Optimization Algorithms

corrected sequence

{¯u

(t)

≠ —

)

}

may still do a better job at mitigating initial eﬀ ects.

Clearly as t grows, —

æ 0, and the eﬀect of this bias correction disappears.

Now we may apply such bias correction to momentum or RMSprop where we have the

speciﬁc forms

(4.18)

(4.23)

, respectively. In these cases, the bias corrected updates for

and s are,

ˆv

(t+1)

1 ≠ —

t+1

(t+1)

and ˆs

(t+1)

1 ≠ “

t+1

(t+1)

,t=0, 1, 2,.... (4.24)

Putting the Pieces Together: ADAM

Now that we understand ideas of momentum, RMSprop, and bias correction, we can piece

these together into a single algorithm, namely the adaptive moment estimation method, or

simply ADAM. Metaphorically, if the execution of the momentum method is a ball rolling

down a slope, the execution of ADAM can be seen as a heavy ball with friction rolling down

the slope.

The key update formula for ADAM is

◊

(t+1)

= ◊

(t)

≠ –

ˆs

(t+1)

+ Á

ˆv

(t+1)

, (4.25)

where all vector ope rations (division, square root, and addition of

) are interpreted element

wise. Here

ˆv

(t+1)

and

ˆs

(t+1)

are bias corrected exponential smoothing of the gradient and

the squared gradients as given in (4.24).

Algorithm 4.2: ADAM

Input: Dataset D = {(x

(1)

),...,(x

(n)

)},

objective function C(·)=C(·; D), and

initial parameter vector ◊

init

Output: Approximatly optimal ◊

1 t Ω 0 (Initialize iteration counter)

2 ◊ Ω ◊

init

, v Ω 0, and s Ω 0 (Initialize state vectors)

3 repeat

4 g ΩÒC (◊) (Compute gradient)

5 v Ω — v +(1≠—) g (Momentum update)

6 s Ω “ s +(1≠ “)(g § g) (Second moment update)

7 ˆv Ω

1 ≠ —

t+1

(Bias correction)

8 ˆs Ω

1 ≠ “

t+1

(Bias correction)

9 ◊ Ω ◊ ≠–

ˆs + Á

ˆv (Update parameters)

10 t Ω t +1

11 until termination condition is satisﬁed

12 return ◊

134

4.4 Automatic Diﬀerentiation

With ADAM,

–

is still called the learning rate and is still the most important parameter

which one needs to tune. The other parameters are

— œ

1) for the momentum exponential

smoothing as used in

(4.16)

and

“ œ

1) for the RMSprop exponential smoothing as is

used in (4.22). The common defaults for these parameters are — =0.9 and “ =0.999.

ADAM is presented in Algorithm 4.2 where Step 4 is the gradient computation, steps 5 and

6 are exponential smoothing (momentum and RMSprop), steps 7 and 8 are bias corrections,

and ﬁnally Step 9 is the descent step.

Since the introduction of ADAM in 2014, this algorithm became the most widely used

algorithm (or “optimizer”) in deep learning frameworks. In the next section we start to

focus on what is typically the most costly computation within the algorithm, namely Step 4,

where we evaluate the gradient. Further generalizations of ADAM and other variations of

ﬁrst-order methods are presented in Section 4.5.

4.4 Automatic Diﬀerentiation

The computation of gradient vectors is a crucial ste p in the ADAM algorithm or in any

other ﬁrst-order optimization method. Practically, in deep learning, the most common

way for computing such gradients is called automatic diﬀerentiation and is embodied in

the backpropagation algorithm. In this section we introduce general concepts of automatic

diﬀerentiation and then in Section 5.4 we specialize to the backpropagation algorithm for

deep neural networks.

All popular methods for computing derivatives, gradients, Jacobians, and Hessians follow

one of two approaches. In one approach, the algebraic expressions of the derivatives are

computed ﬁrst and then the derivatives for a particular given point are evaluated. For this

approach the expressions of the derivatives are either derived manually, which is generally a

tedious process, or using computer based symbolic diﬀerentiation methods. For example, we

have seen explicit gradient expressions in

(2.21)

for linear regression and

(3.17)

for logistic

regression. However, for general deep learning models, explicit expressions are not with such

compact form and instead suﬀer from the problem of expression swell. Thus an alternative

approach is to compute the numerical values of the derivatives at a given point directly.

This approach includes standard methods of numerical diﬀerentiation as well as automatic

diﬀerentiation which is our focus here.

In this section, we ﬁrst present an overview of numerical and symbolic diﬀerentiation. We

then outline key ideas of diﬀerentiable programming which is a programming paradigm

that e ncompass es automatic diﬀerentiation. We then present forward mode automatic

diﬀerentiation which is a stepping stone towards understanding backward mode automatic

diﬀerentiation. Finally, we present backward mode automatic diﬀerentiation.

Numerical and Symbolic Diﬀerentiation

Supp ose we would like to compute the gradient of the objective function

(

◊

) at a given

parameter vector

◊

. Numerical diﬀerentiation methods use the deﬁnition of the partial

derivative, see

(A.5)

in Section A.2, to compute the gradient

ÒC

(

◊

) approximately. In

particular, the most basic form of numerical diﬀe rentiation approximates the partial derivative

135

4 Optimization Algorithms

via,

ˆC(◊)

ˆ◊

C(◊ + he

) ≠ C(◊)

, (4.26)

for a small constant

0,where

is the

-th unit vector. Thus to obtain a numerical

estimate of

ÒC

(

◊

) we need to evaluate

(

) at

◊

, and

◊

for

,...,d

, and e ach time

use (4.26).

A classic problem with numerical diﬀerentiation is selection of the constant

.While

mathematically, smaller

provides a better approximation, numerically, small

yields

numerical instability due to round-oﬀ errors.

Further, in the context of deep learning

where

◊

is of very high dimension, the major drawback of numerical diﬀerentiation is that it

requires us to perform an order of

function evaluations to compute the gradient at a point.

The basic alternative to numerical diﬀerentiation is symbolic diﬀerentiation. With this

paradigm, instead of directly obtaining numerical values of derivatives, we represent ex-

pressions using computer algebra systems and obtain mathematical expressions for the

derivatives using automated algorithms based on the rules of calculus. At its core, symbolic

diﬀerentiation is a very useful tool. However, with deep learning, trying to rely on symbolic

diﬀerentiation solely is not practical. A key problem is expression swell where the exact

mathematical representation of partial derivatives of loss functions ass ociated with deep

neural networks may often require excessive resources due to the complexity of the resulting

expressions.

Since in deep le arning we are mainly concerned with numerical values of the derivatives

but not with their analytical expressions, the application of symbolic diﬀerentiation can

be unwieldy. However, instead of trying to ﬁnd the expression of the objective function

directly, if it is decomposed into several elementary operations, then we can numerically

compute the gradients of the objec tive function by obtaining symbolic derivative expressions

of these elementary operations while keeping track of the intermediate numerical values in a

sequential manner. This idea of interleaving between symbolic expressions and numerical

evaluations at elementary levels is the basis of automatic diﬀerentiation.

Overview of Diﬀerentiable Programming

Let us now introduce basic terminology of automatic diﬀerentiation within the world of

diﬀerentiable programming. While for deep learning, the central application is associated

with the multi-input scalar-output loss

æ R

, to get a general feel for diﬀerentiable

programming and automatic diﬀerentiation, let us ﬁrst consider general functions,

. That is, when m>1, the outputs are vector valued.

Diﬀerentiable programming refers to a programming paradigm where numerical computer

programs, or code for computer functions, are transformed into other functions which

execute the derivatives, gradients, Jacobians, Hessians, or higher-order derivatives of the

original computer functions. For example, consider a case with

=3and

=2and some

hypothetical computer programming language with some programmed function

g()

which

appears in code as follows:

g(t1, t2, t3) = (t1 - 7*t2 + 5*t3, t2*t3)

There are other similar schemes that generally exhibit less numerical error than

(4.26)

,yetanynumerical

diﬀerentiation scheme requires tuning of h.

136

4.4 Automatic Diﬀerentiation

We interpret this co de as a function which operates on three input numbers

◊

, ◊

, and

◊

and then returns two numbers of which the ﬁrst is a linear combination of the three inputs,

◊

≠

◊

, and the second is a product of two of the inputs,

◊

. From a programming

perspective it is a pure function in the sense that it does not depend on any other variables

and does not change any other variables in the system. Thus, the computer function

g()

implements a mathematical function g : R

æ R

with,

g(◊

, ◊

(◊

, ◊

),g

(◊

, ◊

)

where g

: R

æ R and g

: R

æ R.

With diﬀerentiable programming and automatic diﬀerentiation, we are interes ted in evaluation

of partial derivatives such as

ˆg

/ˆ◊

for

2 and

3, at speciﬁc values of

◊

, and

◊

. For this simple example these derivatives are obviously known analytically and

constitute the Jacobian matrix of g;see(A.9) in Appendix A. In our case, the Jacobian is,

(◊

, ◊

1 ≠75

0 ◊

◊

where the i, j element of the Jacobian J

is ˆg

/ˆ◊

In its most basic form, a diﬀerentiable programming system provides constructs such as

partial_derivative() which gives us the ability to write code such as for example:

partial_derivative(target=g, t=[2.5, 2.1, 1.4], i=2, j=3)

The meaning of such code is a request of the computer to evaluate the numerical value of the

Jacobian of the target function

◊

=(2

4) for

=2and

=3. That is, we expect

such code to compute

ˆg

/ˆ◊

and return 2

1 as a result in this case. In contrast to symbolic

or numerical diﬀerentiation, this computation is to be carried out through mechanisms of

automatic diﬀerentiation.

In addition to such general partial derivatives via

partial_derivative()

, we may also ex-

pect diﬀerentiable programming frameworks to expose software constructs such as

gradient()

for computing gradient vectors of functions

æ R

jacobian()

for computing the

Jacobian matrix of functions

æ R

, and

hessian()

for Hessian matrices of functions

æ R

. In any case, the mechanism of computation of these operations typically involves

a transformation of the source code of the original functions, or some data structures that

represent those functions, into new code or data s tructures that encode how derivative

evaluation is carried out. See the notes and references at the end of the chapter for references

to contemporary concrete software frameworks that execute such op erations.

Beyond

partial_derivative()

gradient()

jacobian()

, and

hessian()

, there are also

two other basic constructs in a diﬀerentiable programming framework. These are jacobian

vector products,

jvp()

, and vector Jacobian products,

vjp()

. A Jacobian vector product is

based on

◊ œ R

and

v œ R

and is the evaluation of

(

◊

)

which results in a vector in

Similarly a vector Jacobian product is based on

◊ œ R

and

u œ R

and is the evaluation of

€

(

◊

) which results in a row vector whose transpose is in

.Duetothewayinwhich

automatic diﬀerentiation operates,

jvp()

and

vjp()

are in fact the basic operations that

one may expect from a diﬀerentiable programming framework. Whereas the other operations,

partial_derivative(), etc. are all built on top of either jvp() or vjp().

137

4 Optimization Algorithms

Automatic diﬀerentiation generally works in two types of modes, namely forward mode which

is the algorithm behind

jvp()

and backward mode, also known as reverse mo de, which is

the algorithm behind

vjp()

. In both cases the multivariate chain rule plays a key role; see

Section A.3 in the appendix for a review of the chain rule. The general principle of automatic

diﬀerentiation is based on computing intermediate partial derivatives and incorporating these

computed values in an iterative computation. Forward mode automatic diﬀerentiation is a

straightforward application of the multivariable chain rule while backward mode is slightly

more involved. Note that the backpropagation algorithm for dee p learning is a special case

of backward mode. The subsections below focus on the inner workings of forward mode

and backward mode automatic diﬀerentiation, but ﬁrst let us see how

jvp()

and

vjp()

are

generally used in diﬀerentiable programming.

As stated above, the basic service provided by forward mode automatic diﬀerentiation is the

computation of Jacobian vector products. For instance, returning to the simple

=3,

example above, we may invoke forward mode automatic diﬀerentiation via code such as,

jvp(target=g, t=[2.5, 2.1, 1.4], v=[0,1,0]),

Here we choose

◊

=(2

4) as before and

to be the unit vector

.Theresultisin

this case is

[-7, 1.4]

.Setting

as a unit vector

can be useful because in general when

using

, the output

(

◊

)

is the

-th column of the Jacobian matrix. Hence in this

example of

jvp()

, we obtain as output the e ﬀe ct of

◊

on the two outputs

(

◊

) and

(

◊

)

which is the vector (ˆg

/ˆ◊

, ˆg

/ˆ◊

)=(≠7, 1.4).

This shows that to implement

partial_derivative

for some

ˆg

/ˆ◊

◊

, we can compute

(

◊

)

(the

-th column of the Jacobian matrix) using

jvp()

and take the

-th index of

the output. Speciﬁcally a single application of

jvp()

provides us with the partial

derivatives

ˆg

/ˆ◊

for all

,...,m

. Hence vector Jacobian products provided by forward

mode automatic diﬀerentiation are useful when the output

is large and we wish to only

see the eﬀect of a single, or a few input variables ◊

on the output.

However, note that in the deep learning loss function application where

=1and

is large, vector Jacobian products are not an eﬃcient means for evaluating the gradient.

Speciﬁcally, when we wish to evaluate

gradient()

using

jvp()

we need to invoke

jvp() d

separate times, each time with

for

,...,d

. Thus using forward mode automatic

diﬀerentiation (Jacobian vector products) is not eﬃcient for deep learning loss landscape

gradient evaluation.

The alternative to

jvp()

is to use vector Jacobian products as exposed via backward mode

automatic diﬀerentiation. With this approach we may invoke code such as,

vjp(target=g, t=[2.5, 2.1, 1.4], u=[1,0])

In this example we use the same

◊

as above and set

œ R

. Since the vector Jacobian

product evaluates

€

(

◊

) the output is a row vector of dimension

=3which is the ﬁrst

row of

(

◊

), namely

Òg

(

◊

)

€

. More generally applying Jacobian vector products on

yields the transposed gradient Òg

(◊)

€

In the deep learning loss minimization scenario

is very large and

=1. Thus the Jacobian

is simply the transpose of the gradient and is a long row vector. In such a case, computing

We can also use more general (non unit vectors) and obtain directional derivatives, yet in the remainder

of our discussion inputs to Jacobian vector products (

)andvectorJacobianproducts(

)areunitvectors.

138

4.4 Automatic Diﬀerentiation

the Jacobian vector product with

=1(scalar) yields the gradient of the loss function

directly. Thus since backward mode automatic diﬀerentiation implements

vjp()

, using it on

the loss function, is directly suited for deep learning.

We now explore the inner workings of forward mode automatic diﬀerentiation followed by

backward mode automatic diﬀerentiation. In b oth cases we restrict to

=1. Understanding

forward mode is a useful stepping stone to understanding backward mode. Note that forward

mode and backward mode involve internal implicit computations of Jacobian vector products

or vector Jacobian products, respectively.

The Computational Graph and Forward Mode Automatic Diﬀerentiation

Computational graphs play a key role in the implementa tion of automatic diﬀerentiation. A

computational graph for a multivariate function

æ R

is a directed graph where the

nodes correspond to elementary operations involved in the mathematical expression of

(

◊

Inputs to the graph are the variables

◊

,...,◊

and constants if required, and the output is

the function value C(◊).

✓

⇥

sin

✓

exp

+ C(✓)

✓

⇥

2

5 6

Figure 4.7:

The computational graph of

(4.27)

with nodes numbered 1 to 6. The primal

of each

node is the result of an elementary operation applied on its inputs.

As a simple example with d =3, consider,

C(◊)=sin(◊

◊

)+exp(◊

+ ◊

) ≠ 2◊

. (4.27)

The computational graph for this function, presented in Figure 4.7, is c omposed of nodes with

inputs and outputs. The output of each node is an elementary operation on the inputs to the

It may be also useful to see the mathematical representations of Jacobian vector products and vector

Jacobian products through a comp osition of functions. For this we may see

(A.16)

(A.17)

,and

(A.18)

,in

Appendix A.

139

4 Optimization Algorithms

node. These elem entary operations include summation, multiplication,

sin

(

)

, cos

(

)

, exp

(

etc. Nodes are numbered 1

,...

, and we denote the output of node

with

(

,...,u

)

when there are

inputs to node

, denoted as

,...,u

. In this example there are 6 nodes,

and taking node 6, as an illustration, we have

(

) where

=3and

(

) is

the summation of its input arguments. Note that

is sometimes called the primal of that

node in a given computation.

The evaluation at each node in the computational graph is executed at the instance in which

inputs to the node become available. This then implies the notion of iterations, where a node

is considered to be evaluated at a given iteration if the inputs to that node are all available

at that iteration but not previously. As an example, with

◊

ﬁ/

5), the iterations for

(4.27) are summarized in Table 4.1.

Tab le 4. 1:

Evaluation of the value of

(

◊

) in

(4.27)

at an example

◊

via the computational graph

in Figure 4.7 computing the primals a

for i =1,...,6.

General expressions of primals Values of primals at speciﬁc ◊

Input ◊ =(◊

, ◊

) ◊ =(ﬁ/6, 2, 5)

Iteration 1

= ◊

◊

= ◊

+ ◊

= ≠2◊

= ﬁ/3

= ≠10

Iteration 2

=sin(a

)

=exp(a

)

3/2

Iteration 3 (output)

= a

+ a

= C(◊)

3/2+e

≠ 10

As mentioned above, forward mode automatic diﬀerentiation implements Jacobian vector

products. In our exposition here, since the target function

(

) is scalar valued (

=1),

the Jacobian is the transpose of the gradient and with

v œ R

, the Jacobian vector product,

(

◊

)

, is a scalar.

Further, our exposition focuses on

, implying that the output of

forward mode automatic diﬀerentiation is

ˆC

(

◊

)

/ˆ◊

. Hence for our exposition here,

ﬁxed.

The key idea of forward mode automatic diﬀerentiation is to maintain a record of intermediate

derivatives as computation progresses through the computational graph. For this we denote,

˙a

ˆa

ˆ◊

, (4.28)

and call

˙a

the tangent at no de

, for ﬁxed

. Now since

(

,...,u

),usingthe

multivariable chain rule (see Appendix A.3), we have

˙a

k=1

ˆf

,...,u

)

ˆu

˙u

= Òf

,...,u

)

€

˙u, (4.29)

It i s in fact the directional derivative of C(·) in the dire ction v.

140

4.4 Automatic Diﬀerentiation

with

˙u

denoting the tangent of

and with

˙u

denoting the vector of these tangents.

Returning to Figure 4.7 as an example, we have that

Òf

(

)=(1

1) and thus,

˙a

=˙a

+˙a

. Similarly, in that ﬁgure, Òf

) = cos(u

) and thus ˙a

= cos(a

)˙a

Since

(4.29)

deﬁnes a recursive computation of the tangents, we also require the inputs to

the function

◊

,...,◊

to have tangents. We set these as the elements of the vector

v œ R

used in the Jacobian vector product. Speciﬁcally, here we focus on Jacobian vector products

with

since we are seeking the derivative with respect to a speciﬁc input

◊

. Hence we

set the tangent values for the inputs of the function as

◊

=1and

◊

=0for i ”= j.

For forward mode automatic diﬀerentiation, we progress on the computational graph in

the forward direction from the input to the output, evaluating both the primal

and

the tangent

˙a

at each node. As we progress, in each iteration, we consider the set of all

nodes

whose predecessors already have primal values and tanget values available, and for

each of these nodes we compute the primal values as was already shown in Table 4.1, and

importantly also compute the tangent values via

(4.29)

. This computation may be broken

into iterations, in a similar way to the iterations of the function evaluation computed in

Table 4.1.

In particular, since

(

) denotes an elementary operation, such as the function

exp

(

) at

node 5 in Figure 4.7, we assume the availability of a symbolic expression for each

ˆf

,...,u

)

ˆu

or alternatively an ability to numerically compute it exactly. Therefore, since we know the

numerical primal values

,...,u

and the numerical tangent values

˙u

,..., ˙u

, we can

easily compute

˙a

using

(4.29)

. Consequently, the numerical value of

ˆC(◊)

ˆ◊

can be computed

at the end of forward mode as it is the tangent value of the output node of the computational

graph.

Table 4.2 illustrates forward mode automatic diﬀerentiation using the example function

(

◊

)

(4.27)

. The ﬁrst column focuses on the general expression and the other column computes

the partial derivative of C(◊) with respect to ◊

at a speciﬁc point ◊.

Tab le 4. 2:

Forward mo de automatic diﬀerentiation for the example function

(4.27)

and

=2,

i.e., we seek the derivative with respect to

◊

. The right column illustrates computation

of the

derivative

ˆC(◊)

ˆ◊

at ◊ =(ﬁ/6, 2, 5) which is obtained via the ﬁnal tangent value ˙a

= ﬁ/12 + e

General expressions of Values of primals and tangents at

primals and tangents speciﬁc ◊ for j =2

Start

◊ =(◊

, ◊

)

◊ =(

◊

)

◊ =(ﬁ/6, 2, 5)

◊ =(0, 1, 0)

Iteration 1

, ˙a

)=(◊

◊

, ◊

◊

)

, ˙a

)=(◊

+ ◊

◊

)

, ˙a

)=(≠2◊

, ≠2

◊

)

, ˙a

)=(ﬁ/3, ﬁ/6)

, ˙a

)=(7, 1)

, ˙a

)=(≠10, 0)

Iteration 2

, ˙a

)=(sin(a

), cos(a

)˙a

)

, ˙a

)=(exp(a

), exp(a

)˙a

)

, ˙a

)=(

3/2, ﬁ/12)

, ˙a

)=(e

, e

)

Iteration 3

, ˙a

+ a

˙a

+˙a

, ˙a

3/2+e

≠ 10

ﬁ/12 + e

141

4 Optimization Algorithms

Backward Mode Automatic Diﬀerentiation

As mentioned above, backward mode automatic diﬀerentiation implements vector Jacobian

products. This allows us to use a single run of backward mode automatic diﬀerentiation

to compute the gradient vector for

æ R

. For deep learning this is a signiﬁcant

improvement over forward mode automatic diﬀerentiation which would require

executions

to obtain such a gradient.

For a given point

◊

, backward mode automatic diﬀerentiation is executed in two phases often

called the forward pass and backward pass. The forward pass phase is executed before the

backward pass phase. In the forward pass, the algorithm simply evaluates the computational

graph to compute the values of the intermediate primal variables

while recording the

dependencies of these variables in a bookkeeping manner, similar to the evaluation in

Table 4.1.

In the backward pass, all the derivatives of the objective function are computed by using

intermediate derivatives called adjoints. Sp e ciﬁcally, for each intermediate variable

,the

adjoint is

’

ˆC

ˆa

. (4.30)

Adjoint

’

captures the rate of change in the ﬁnal output

with respect to variable

Contrast this with the tangent

˙a

from forward mode automatic diﬀerentiation, as in

(4.28)

which captures the rate of change of a

with respect to an input variable.

In the backward pass we populate the values of the adjoint variables

’

. Speciﬁcally, we

progress on the computational graph in the backward direction starting from the output

node and propagating towards the input. For this, say that the output of node

is input to

nodes with indices

,...,i

. Then the only way that

can inﬂuence the output

is by

inﬂuencing a

,...,a

. Now using the multivariable chain rule,

ˆC

ˆa

k=1

ˆa

ˆC

ˆa

and thus using the notation of (4.30),

’

k=1

ˆa

’

. (4.31)

The relationship

(4.31)

is used in iterations of the backward pass. In every iteration, the

adjoints

’

in the right hand side of

(4.31)

are available from previous iterations and are

used to compute

’

. In particular at the start of the backward pass we initilize the adjoint

of the output node at 1. Further, recalling the notation used for forward mode automatic

diﬀerenation,

(

,...,u

) where

,...,u

are the inputs to no de

and

(

)

has readily computable derivatives with respect to its inputs. Hence, since the primals of

,...,u

were already populated during the forward pass, we compute

ˆa

locally at

node i and use it in (4.31).

To verify the computation recall the basic linearity and pro duct rules of scalar diﬀerentiation. Also,

remember that the derivative of

sin

(

) is

cos

(

),thatthederivativeof

exp

(

) is

exp

(

),andthat

cos

(

ﬁ/

3) = 1

142

4.5 Additional Techniques for First-Order Methods

Tab le 4.3 :

The backward pass of backward mode automatic diﬀerentiation for the example function

(4.27)

to compute its gradient at

◊

ﬁ/

5). The result is

ÒC

(

◊

)=(1

, ﬁ/

12 +

≠

2). This

is computed after the forward pass is executed; see Table 4.1.

General expressions of adjoints Values of adjoints at speciﬁc ◊

Start

’

ˆC(◊)

ˆa

’

Iteration 1

’

ˆa

’

=1◊ ’

’

ˆa

’

=1◊ ’

’

ˆa

’

=1◊ ’

’

Iteration 2

’

ˆa

’

= cos(a

)’

’

ˆa

’

=exp(a

)’

’

= cos(ﬁ/3) = 1/2

’

Iteration 3

ˆC(◊)

ˆ◊

ˆa

ˆ◊

’

= ◊

’

ˆC(◊)

ˆ◊

ˆa

ˆ◊

’

ˆa

ˆ◊

’

= ◊

’

+1◊ ’

ˆC(◊)

ˆ◊

ˆa

ˆ◊

’

ˆa

ˆ◊

’

=1◊ ’

≠ 2 ◊ ’

ˆC(◊)

ˆ◊

=2◊ 1/2=1

ˆC(◊)

ˆ◊

= ﬁ/6 ◊ 1/2+e

= ﬁ/12 + e

ˆC(◊)

ˆ◊

≠ 2

The elements of the gradient

ÒC

(

◊

) depend on the adjoints in a similar manner to

(4.31)

Speciﬁcally,

ˆC

ˆ◊

ˆa

ˆ◊

’

, (4.32)

where the summation is over all nodes indexed via

that take

◊

as input in the computa-

tional graph. Thus, in the ﬁnal iteration of the backward pass, we use

(4.32)

to obtain the

gradient ÒC(◊).

Let us return to the example in

(4.27)

with computational graph

(4.7)

evaluated at

◊

(

ﬁ/

5). Evaluation of

ÒC

(

◊

) via backward mode automatic diﬀerentiation ﬁrst executes

a forward pass as previously illustrated in Table 4.1. This populates the primals

,...,a

Then the backward pass progresses via repeated application of

(4.31)

and with application

(4.32)

in the ﬁnal iteration. This backward pass computation is summarized in Table 4.3.

4.5 Additional Techniques for First-Order Methods

In Section 4.3 we have seen one of the most popular deep learning optimization algorithms,

ADAM, as well as key milestone techniques that have led to its development. Then in

143

4 Optimization Algorithms

Section 4.4 we explored ways to compute the gradient

ÒC

(

◊

). In this section, we explore

other ﬁrst-order optimization ideas that are less popular in practical deep learning, but still

embody useful principles.

We begin the discussion by considering a modiﬁcation of the momentum technique, to a

variant called Nesterov momentum.

The idea of this method has also found its way into

an algorithm called Nadam (Nesterov ADAM). We continue by considering variants of

Adagrad and RMSProp for adaptive learning rates per coordinate. The variants we describe

are called Adadelta and Adamax. While as of today, these additional techniques are not

common in deep learning practice, they have made signiﬁcant impact earlier on and may

be fruitful in the future as well. In a sense, momentum (covered in Section 4.3), ADAM,

Nesterov momentum, Nadam, Adagrad (covered in Section 4.3), Adadelta, and Adamax are

all variants and improvements of basic gradient descent. Thus, our exposition in this section

aims to complete the picture on how basic gradient descent may be improved.

We close this section with an overview of line search techniques. With such techniques,

once a descent direction is determined, we seek to move in that direction with a step size

which minimizes loss over the next step in that direction. There are several variants for this

approach, and we present an overview of these variants.

Nesterov Momentum and the Nadam Algorithm

We have seen that the momentum updates as in

(4.16)

and

(4.17)

accelerate like a ball

rolling downhill. An issue with this method is that the steps do not slow down after reaching

the b ottom of a valley, and hence, there is a tendency to overshoot the valley ﬂoor. Therefore,

if we know approximately where the ball will be after each step, we can slow down the ball

before the hill slopes up again.

The idea of Nesterov momentum tries to improve on standard momentum by using the

gradient at the predicted future position instead of the gradient of the current position. The

update equations are,

(t+1)

= —v

(t)

+(1≠—)ÒC(◊

(t)

≠ –—v

(t)

¸ ˚˙ ˝

predicted

next point

), (4.33)

◊

(t+1)

= ◊

(t)

≠ –v

(t+1)

¸ ˚˙ ˝

actual

next point

, (4.34)

for constants

— œ

1) and

– >

0,with

(1)

=0. Compare

(4.33)

and

(4.34)

with

(4.16)

and

(4.17)

.Thediﬀerence here is that the gradient is computed at a predicted next point

◊

(t)

≠ –—v

(t)

, which is a proxy for the (unseen) actual next point

◊

(t+1)

◊

(t)

≠ –v

(t+1)

when — ¥ 1.

Implementing Nesterov momentum via

(4.33)

and

(4.34)

requires evaluation of the gradient

at the predicted points instead of actual points. This may incur overheads, especially when

incorporated as part of other algorithms. For example, if we also require

ÒC

(

◊

(t)

) at each

iteration for purpuses such as RMSprop, then the gradient needs to be computed twice

instead of once per iteration; once for

ÒC

(

◊

(t)

) and once for

ÒC

(

◊

(t)

≠ –—v

(t)

). For this

Nesterov momentum is also sometimes called Nesterov acceleration or Nesterov accelerated gradient.

144

4.5 Additional Techniques for First-Order Methods

reason, and also for simplicity of implementing gradients only at

◊

(t)

,alook-ahead momentum

mechanism is sometimes used in place of (4.33) and (4.34).

To see how this mechanism works, ﬁrst revisit the basic momentum update equations

(4.16)

and (4.17) and observe that the parameter up date c an be represented as,

◊

(t+1)

= ◊

(t)

≠ –

—v

(t)

+(1≠—)ÒC(◊

(t)

)

. (4.35)

Observe that

(t)

is based on gradients at points

◊

(0)

,...,◊

(t≠1)

, but not on the gradient at

◊

(t)

. Hence a way to incorporate this last gradient is with look-ahead momentum where we

replace v

(t)

of (4.35)byv

(t+1)

. This achieves behaviour similar to Nesterov momentum.

With such a replacement, we arrive at update equations of the form,

(t+1)

= —v

(t)

+(1≠—)ÒC

◊

(t)

, (4.36)

◊

(t+1)

= ◊

(t)

≠ –

—v

(t+1)

+(1≠—)ÒC

◊

(t)

¸ ˚˙ ˝

look-ahead

momentum

. (4.37)

Using

(4.36)

and

(4.37)

aims to provide similar behaviour to the Nesterov momentum

equations

(4.33)

and

(4.34)

, yet only requires evaluation of gradients at

◊

(t)

as opposed

to a shifted point as in

(4.33)

. While look-ahead momentum is not equivalent to Nesterov

momentum, the gist of both methods is similar.

Now with update equations like

(4.36)

and

(4.37)

, it is straightforward to adapt ADAM to

include be havior similar to Nesterov momentum. This is done using similar steps to those

used to construct ADAM using momentum, RMSprop, and bias correction in Section 4.3.

The diﬀerence here is that we use the look-ahead momentum equations

(4.36)

and

(4.37)

This algorithm is called Nadam.

In short, Nadam can be viewed exactly as the ADAM procedure in Algorithm 4.2,yetwith

line 7 replaced by,

ˆv Ω

—

1 ≠ —

t+2

v +

1 ≠ —

t+1

In this cas e,

ˆv

is not just a bias corrected version of

as in ADAM since it also incorporates

look-ahead momentum, which is the key e xtra feature that Nadam adds to ADAM.

Adadelta

Recall the computation of

(t+1)

for the Adagrad method as in

(4.22)

. When Adagrad’s

(t+1)

is used in

(4.21)

, it has the problem of monotonically decreasing eﬀective learning rates.

This is one of the reasons that eventually RMSprop became more popular than Adagrad.

However, there are other alternatives that are also very popular. One such alternative is the

Adadelta method which uses exponential smoothing of the squared gradients as in RMSprop,

but also uses another exponentially smoothed sequence of the descent step’s squares.

A key motivation for Adadelta is the observation that the update equation

(4.21)

for

RMSprop or Adagrad uses a unit-less quantity as the descent step. Speciﬁcally, in

(4.21)

the

only unit-full quantity in the c oeﬃcient multiplying

ÒC

(

◊

(t)

) is the reciprocal of

(t+1)

145

4 Optimization Algorithms

Hence that coeﬃcient has units which are the inverse of the gradient and these cancel out

the units of the gradient implying that the descent step is unit-less.

With Adadelta, the update equation

(4.21)

is modiﬁed so that the descent step maintains

the same units of the gradient, via

◊

(t+1)

= ◊

(t)

≠

◊

(t)

+ Á

(t+1)

+ Á

§ÒC(◊

(t)

)

¸ ˚˙ ˝

Descent step

ÒC(◊

(t)

)

, (4.38)

where

◊

(t)

is adaptively adjusted yet has the same units as

(t+1)

, making the coeﬃcient of

the gradient unit-free. Compare

(4.38)

with

(4.21)

to observe that

◊

(t)

replaces the

learning rate

–

. This also means that Adagrad is “learning rate free”. Instead, a potentially

more robust parameter

ﬂ œ

1) is used similarly to the

“

parameter for RMSProp. This

parameter speciﬁes how to exponentially s mooth squares of the descent steps.

Using the descent step of (4.38), the update equation for ◊

(t)

is,

◊

(t)

= ﬂ◊

(t≠1)

+(1≠ﬂ)

ÒC(◊

(t≠1)

) §

ÒC(◊

(t≠1)

)

starting with 

◊

(0)

=0. Then at iteration

, updating

◊

(t+1)

via

(4.38)

, we use both 

◊

(t)

and s

(t+1)

, where the latter is updated via the RMSprop update equation as in (4.22).

Other Norms and Adamax

We have already seen that the quantities

(t)

used in RMSprop or Adadelta as in

(4.22)

are useful for adaptive learning rates per coordinate. With these methods, we may loosley

interpret

(t)

as an estimate of the standard (

) Euclidean norm of the s equence of

gradients for coordinate i, smoothed up to time t.

In principal, one may wish to use other

norms (see Appendix A.1) in place of the

Euclidean norm. For this we may modify the RMSprop recursion as in (4.22) to the form,

(t+1)

= “

(t)

+(1≠“

)|g

(t)

, with g

(t)

ˆC(◊

(t)

)

ˆ◊

, (4.39)

and then interpret

(t)

as an estimate of the smoothed

norm of the sequence of

gradients for coordinate

up to time

. Note that for convenience of the calculations below,

we reparameterize the exponential smoothing parameter to be “

in place of “.

With a recursion such as

(4.39)

we can obtain an “ADAM like” algorithm that updates

parameters via an update rule such as,

◊

(t+1)

= ◊

(t)

≠ –

ˆs

(t+1)

1/p

+ Á

ˆv

(t+1)

, (4.40)

where

ˆv

(t+1)

is some bias-corrected momentum value, and

ˆs

(t+1)

is a bias corrected value

obtained from

(4.39)

; compare

(4.40)

(4.25)

. This additional hyper-parameter

,deter-

mining which

norm to use, can then potentially be tuned. However, for non-small

146

4.5 Additional Techniques for First-Order Methods

computations may often exhibit excessive numerical error due to overﬂow since we are raising

the gradient coordinate value to high powers and then taking a low power of it.

Adamax is a variant of this approach which instead of using the

norm uses the

norm.

Speciﬁcally in place of (s

(t)

)

we use r

(t)

which is recursively deﬁned as,

(t+1)

= max

“ r

(t)

, with r

(0)

=0. (4.41)

Then in place of (4.40)weuse

◊

(t+1)

= ◊

(t)

≠ –

(t+1)

+ Á

ˆv

(t+1)

. (4.42)

Note that the updates

(4.41)

and

(4.42)

are well justiﬁed as approximations of

(4.39)

and

(4.40). To see this recall that a sequence of L

norms converges to the L

norm and thus,

(t+1)

=lim

pæŒ

(t+1)

=lim

pæŒ

(1 ≠ “

)

1/p

·=0

“

p(t≠·)

(·)

=lim

pæŒ

·=0

“

(t≠·)

(·)

= max

·=0,...,t

“

(t≠·)

(·)

= max

“ max

·=0,...,t≠1

“

(t≠1≠·)

(·)

|, |g

(t)

Here, moving from the ﬁrst line to the second line, we made use of the general representation

of exponential smoothing

(4.15)

, and moving from the third line to the fourth line we use

the fact that

vector norms converge to the

vector norm as

p æŒ

.Thisthenjustiﬁes

(4.41) as a limiting case.

Line Search

So far all the methods we considered were variants of gradient descent. When these are

viewed as a special case of the general descent direction method of Algorithm 4.1, updates

are of the form

◊

(t+1)

◊

(t)

◊

(t)

. In the case of basic gradient descent we have descent

steps

◊

(t)

≠–ÒC

(

◊

(t)

), yet in general we can represent the descent step as

◊

(t)

–◊

(t)

where we call

◊

(t)

a descent direction. With basic gradient descent

◊

(t)

≠ÒC

(

◊

(t)

). In any

case, for some prescribed descent direction, the step size

Î◊

(t)

is determined by the choice

of –.

Line search is an approach where we seek the best step size for a given descent direction in

each iteration. Speciﬁcally, in iteration

, given a descent direction

◊

(t)

,wedeterminethe

–

for that iteration, denoted as –

(t)

, via,

–

(t)

= arg min

–

◊

(t)

+ –◊

(t)

. (4.43)

147

4 Optimization Algorithms

That is, each iteration involves a one dimensional minimization problem determining

–

(t)

which is used for the update

◊

(t+1)

= ◊

(t)

+ –

(t)

◊

(t)

Hence with line search, the next point ◊

(t+1)

is a minimizer of C(·) along the ray,

{◊

(t)

+ –◊

(t)

: – > 0}. (4.44)

When we pick

–

(t)

to exactly optimize

(4.43)

, the method is called exact line search.Invery

speciﬁc cases exact line search can be achieved explicitly with closed form formulas. However

in most cases, if we wish to carry out exact line search, we need to numerically optimize the

one dimensional optimization problem

(4.43)

for the b es t

–

. An alternative to exact line

search is inexact line search where we only probe the loss on the ray

(4.44)

for a few values

of –. Details are in the sequel.

(a) (b)

Figure 4.8:

One iteration of line search applied to a Rosenbrook function of

(4.9)

. (a) With the

current point

◊

(t)

=(0

0) and the descent direction

◊

(t)

= ≠ÒC(◊

(t)

)

, we optimize over a given

ray. (b) The value of the loss function along the ray with optimal –

(t)

¥ 0.00165.

Note that in practice, for deep learning, neither exact nor inexact line search are used

directly. One reason is that loss function evaluations

(

) are computationally demanding

and are of a similar order to the computational cost of gradient evaluation. Hence practice

has shown that one might as well update descent directions instead of optimizing loss over a

ﬁxed ray

(4.44)

. A second reason is that with deep learning we almost always have noisy

gradients using stochastic gradient descent or mini-batches, and hence searching for an

optimal

–

on a noisy direction can be practically less useful. Empirical evidence has shown

that in such cases, line search techniques are seldom useful on their own right. Nevertheless,

understanding line search is important as part of a general background for optimization and

is also used in quasi-Newton methods described in Section 4.6 which are sometimes used in

deep learning.

As a basic illustration of line search, return to the Rosenbrock function of

(4.9)

with contours

plotted in Figure 4.8 (a). In this plot we are at a point

◊

(t)

with the des ce nt direction

148

4.5 Additional Techniques for First-Order Methods

◊

(t)

≠ÒC

(

◊

(t)

). The plot in Figure 4.8 (b) presents the one dimensional cross section along

the ray

(4.44)

with that descent direction where we see that the optimal (exact numerical

line search) – is at –

(t)

It is interesting to analyze the case

◊

(t)

= ≠ÒC(◊

(t)

)

which we call basic gradient descent

with exact line search. In this case, trajectories of such optimization result in “zig-zagging”.

Speciﬁcally we have that every two consecutive descent steps are orthogonal. Namely,

◊

(t)

€

◊

(t+1)

=0 or g

(t)

€

(t+1)

=0,

with

(t)

denoting

ÒC

(

◊

(t)

). This property is illustrated in Figure 4.9 where we consider

gradient descent with exact line search applied to the Rosenbrock function

(4.9)

. We start

with initial condition

◊

(0)

=(0

, ≠

2) from which the ﬁrst two steps take very long jumps.

However, afterwards once the search is in a narrow valley, the zig-zagging phenomenon is

apparent and incurs a signiﬁcant eﬀective slow down on the rate of advance towards the

optimal point.

Figure 4.9:

Evolution of basic gradient descent with exact line search. We see the zig-zagging

property which slows down progress in narrow valleys.

To see the derivation of the zig-zagging property, observe that at

–

which optimizes

(4.43)

we have,

ˆC

◊

(t)

≠ –g

(t)

ˆ–

=0,

149

4 Optimization Algorithms

or written in terms of the basic deﬁnition of the derivative we have,

lim

hæ0

◊

(t)

≠ (– + h)g

(t)

≠ C

◊

(t)

≠ –g

(t)

=0. (4.45)

We now use the ﬁrst-order Taylor series expansion as in

(A.22)

(

) around the point

◊

(t)

≠ –g

(t)

for the point ◊

(t)

≠ –g

(t)

≠ hg

(t)

.Thisis,

C(◊

(t)

≠ –g

(t)

≠ hg

(t)

)=C(◊

(t)

≠ –g

(t)

) ≠ hg

(t)

€

ÒC

◊

(t)

≠ –g

(t)

+ O(h

where

(

) is a function such that

(

)

goes to a constant as

h æ

0.Substitutingin

(4.45) we obtain,

0 = lim

hæ0

≠hg

(t)

€

ÒC

◊

(t)

≠ –g

(t)

+ O(h

)

= ≠g

(t)

€

ÒC

◊

(t)

≠ –g

(t)

= ≠g

(t)

€

(t+1)

where the last equality holds because g

(t+1)

= ÒC

◊

(t)

≠ –g

(t)

We now visit a powerful numerical technique called the conjugate gradient method.Thisisa

topic which may receive special treatment in a general optimization text, yet here we only

present it in brief as an application of exact line search. Conjugate gradient is not typically

used directly for deep learning, yet since it is useful for approximately solving certain large

systems of linear equations, it sometimes ﬁnds its way as an aid to other deep learning

algorithms.

Suppose the objective function is of the following quadratic form,

C(◊)=

◊

€

A ◊ ≠ b

€

◊, (4.46)

where

is a

d ◊ d

dimensional symmetric pos itive deﬁnite matrix, and

b œ R

.Wehave

the Hessian

(

◊

for every

◊

and since

is positive deﬁnite,

(

) is strictly convex

and there is a unique global minimum. We further have

ÒC

(

◊

A◊ ≠ b

and this means

that the equation

ÒC

(

◊

)=0with the unique solution

◊

≠1

minimizes

(

◊

). Hence

the system of equations,

A◊ = b, with A p os itive deﬁnite, (4.47)

can be solved by solving the optimization problem of minimizing

(4.46)

. Thus one may

view the conjugate gradient method as an algorithm for (approximately) solving a linear

system of equations such as

(4.47)

. In particular when the dimension

is excessively large,

the conjugate gradient method can be eﬃcient while standard techniques s uch as Gaussian

elimination may fail. We have already seen one application of such equations in Chapter 2,

with the normal equations (2.15)whereA = X

€

X and b = X

€

When carrying out line search on an objective function as

(4.46)

, the cross section above

the ray

(4.44)

for any direction is a parabola and hence exact line search has an explicit

150

4.5 Additional Techniques for First-Order Methods

solution for

(4.43)

. Conjugate gradient carries out iterations of such line search, where in

each iteration the search direction ◊

(t)

is wisely chosen.

Speciﬁcally, at iteration

with current point

◊

(t)

and search direction

◊

(t)

, the next point

determined via (4.43)with(4.46)is,

◊

(t+1)

= ◊

(t)

+ –

(t)

◊

(t)

, with –

(t)

= ≠

◊

(t)

€

A ◊

(t)

≠ b

◊

(t)

€

A ◊

(t)

Then the next search direction is determined as linear combination of the next gradient and

the current search direction. Speciﬁcally,

◊

(t+1)

= ≠ÒC(◊

(t+1)

)+—

(t)

◊

(t)

, with —

(t)

ÒC(◊

(t+1)

)

€

A ◊

(t)

◊

(t)

€

A ◊

(t)

This coeﬃcient

—

(t)

is designed such that the current search direction

◊

(t)

and the next

search direction

◊

(t+1)

are conjugate with respect to

in the sense that,

◊

(t+1)

€

A ◊

(t)

=0.

This conjugacy ensures desirable properties for the method and in particular implies that a

minimum is reached in

iterations. Analysis and further motivation of conjugate gradient

are beyond our scope. Our presentation here is merely to show that conjugate gradient is an

application of exact line search.

Inexact Line Search

As discussed above, one may use exact line search to solve the univariate optimization

(4.43)

optimally in each iteration. An alternative is to only probe a few speciﬁc values of

–

(t)

according to some predeﬁned set of rules and stop at the best probed value. This is called

inexact line search and it is especially attractive when there is no analytical solution of

(4.43)

. Inexact line search may signiﬁcantly reduce the computational cost per iteration in

comparison to the repetitive application of a univariate numerical optimization technique.

A typical approach for inexact line search is backtracking (also known as backtracking line

search). The idea is to start with some predetermined maximal

–

0 and then decrease it

multiplicatively until a stopping condition is met. Spec iﬁcally, we evaluate

(

) on the ray

(4.44) with candidate – values such as

–

,...,

where the decrease factor of

can be tuned to be any factor in (0

1). We then stop at the

instant at which the stopping condition is met. Note that this search takes place at every

iteration t.

One very common stopping condition is known as the Armijo condition or the ﬁrst Wolfe

condition. Under this condition, at each iteration

, we decrease

–

values as above until it

Note that in each iteration, the conjugate gradient traverses the steepest direction under the norm

ÎuÎ

€

Au, u œ R

151

4 Optimization Algorithms

satisﬁes

◊

(t)

+ –◊

(t)

Æ C(◊

(t)

)+ —–Ò

C(◊

(t)

), (4.48)

where

— œ

1] is a parameter and

is shorthand notation for the directional derivative

in the direction ◊

(t)

; for directional derivatives recall (A.8) from Appendix A. That is,

C(◊

(t)

)=Ò

◊

(t)

C(◊

(t)

)=◊

(t)

€

ÒC(◊

(t)

To get an intuition for

(4.48)

, suppose that

—

=1, then the right hand side of

(4.48)

equal to the ﬁrst-order Taylor approximation of

◊

(t)

+ –◊

(t)

. Therefore, the decrease

in the ob jective suggested by the ﬁrst Wolfe condition is at least as good as the prediction

by the ﬁrst-order approximation. On the other extreme, if

—

=0,weselect

–

such that

◊

(t)

+ –◊

(t)

Æ C

(

◊

(t)

), which is also acce ptable as it guarantees that the next step at

least does not increase the objective.

Beyond the condition

(4.48)

, there are additional common stopping conditions that one may

employ together with backtracking line search. One such condition known as the second

Wolfe condition is,

◊

(t)

+ –◊

(t)

Ø “Ò

C(◊

(t)

), (4.49)

where “ œ (0, 1). Another common condition know n as the strong Wolfe condition is,

◊

(t)

+ –◊

(t)

Æ≠“Ò

C(◊

(t)

). (4.50)

This condition forces

–

(t)

to lie clos e to the solution of exact line search. The ﬁrst Wolfe

condition and the above strong Wolfe condition together are called the strong Wolfe conditions .

These conditions are sometimes used with quasi-Newton methods described below.

4.6 Concepts of Second-Order Methods

All the optimization methods we studied so far were based on gradient evaluation and im-

plicitly use the ﬁrst-order Taylor approximation of the objective function

(

◊

). In particular,

the gradient of the objective function helps us determine the direction of the next step.

However, by using curvature information captured via the second derivatives, or Hessian

C(◊), algorithms can take steps that move farther and are more precise.

In this section, we present some well-known second-order Taylor approximation based

optimization methods, simply referred to as second-order methods. These methods also fall

within the general descent direction framework of Algorithm 4.1, yet their evaluation of the

descent step

◊

(t)

either uses the Hessian explicitly, or employs some approximated estimate

for Hessian vector products.

While today’s deep learning practice rarely involves second-order methods, these methods

have huge potential. Indeed, if e mployed eﬃciently, their application in general deep learning

training can yie ld signiﬁcant performance improvement. Hence, understanding basic principles

of second-order techniques is useful for the mathematical engineering of deep learning.

152

4.6 Concepts of Second-Order Methods

This section starts by exploring simple concepts of second-order univariate optimization

where we introduce Newton’s method which is based on the ﬁrst and second derivative of

the objective function. We also present the secant method which is only based on the ﬁrst

derivative. We then move on to Newton’s method in the general multivariate case where the

Hessian matrix plays a key role. Indeed, for simple neural networks such as logistic regression

of Chapter 3, a method such as Newton is highly applicable.

For general deep learning with

◊ œ R

,where

is typically huge and the loss function is

complicated, it is hopeless to use explicit expressions or automatic diﬀerentiation for the

Hessian

(

◊

). In such a case, one may use quasi-Newton methods that only approximate

the Hessian via gradients. Such application of second-order methods to deep learning

is a contemporary active area of research which we are not able to cover fully. Instead,

our presentation builds up towards a popular method called the limited-memory BFGS

(Broyden–Fletcher–Goldfarb–Shanno) algorithm, or L-BFGS for short. With this we aim to

open up horizons for the reader to the multitude of optimization technique variations that

one may consider.

The Univariate Case

Newton’s method is an iterative method that uses the quadratic approximation, also called

the second-order Taylor approximation

(A.22)

, of the objective function around the current

point and ﬁnds the next step by optimizing the quadratic function. For ease of understanding,

ﬁrst consider the univariate case where

◊ œ R

. Assume that the objective function

(

◊

)

is twice diﬀerentiable. Suppose we are at

◊

(t)

in the

-th iteration and let

(t)

(

◊

) be the

quadratic approximation of the objec tive function C(◊) around ◊

(t)

, given by

C(◊) ¥ Q

(t)

(◊)=C(◊

(t)

)+(◊ ≠ ◊

(t)

(◊

(t)

(◊ ≠◊

(t)

)

ÕÕ

(◊

(t)

where C

(·) and C

ÕÕ

(·) are the ﬁrst and second derivatives respectively.

Note that

(t)

(

◊

) depends on

◊

(t)

. Further observe that

(

◊

) and

(t)

(

◊

) take the same value

when

◊

(t)

.Toﬁndthenextstep,

◊

(t+1)

,weminimize

(t)

(

◊

) by setting its derivative to

zero and obtain,

(◊

(t)

)+(◊ ≠ ◊

(t)

ÕÕ

(◊

(t)

)=0,

with the corresponding solution (for the variable ◊),

◊

(t+1)

= ◊

(t)

≠

(◊

(t)

)

ÕÕ

(◊

(t)

)

, (4.51)

whenever

ÕÕ

(

◊

(t)

)

”

=0. Newton’s method (for univariate function optimization) starts at

some initial point

◊

(0)

and then iterates

(4.51)

until some speciﬁed stopping criterion is met.

If converged to some point

◊

,then

(

◊

)=0implying that the point is a local minimum,

local maximum, or saddle point.

Any quadratic function is either strictly convex or strictly concave depending on the leading

coeﬃcient. Therefore, if

(

◊

) itself a strictly convex quadratic function, it has a unique global

minimum, and since the third order derivative of

(

◊

) is zero, the quadratic approximation

Note that when considered in the context of general root ﬁnding of the form

(

◊

)=0beyond the case

of F (·)=C

(·),themethodissometimescalledNewton-Raphson.

153

4 Optimization Algorithms

(t)

(

◊

) is identically equal to

(

◊

). In such a case, Newton’s method ﬁnds the minimum

in a single iteration of

(4.51)

for any initial point

◊

(0)

. This property hints at the speed at

which Newton’s method can move towards the minimum since any smooth function is well

approximated by a quadratic function in the neighbourhood of a local minimum.

(a) (b)

Figure 4.10:

Illustration of updates in Newton’s method. (a) The approximating quadratic function

(t)

(◊ ) is concave. (b) The approximating quadratic function Q

(t)

(◊ ) is convex.

While potentially very eﬃcient, in its basic form, Newton’s method exhibits several instability

issues. First observe that the method can end up in a local maximum even though the goal is

minimization. This is made apparent in Figure 4.10 which illustrates single step updates with

Newton’s method on a function that has both a local maximum and a local minimum. In (a)

a step is taken and it approaches (yet overshoots) the local maximum. In (b) a step is taken

(and also ove rshoots) the local minimum. As is evident, the initial point

◊

(t)

for both (a) and

(b) are near each other, yet the subtle diﬀerence in the sign of

ÕÕ

(

◊

(t)

) implies movement in

completely diﬀerent directions. In both cases, with another iteration the extremum point

would have been nearly reached. That is (a) would end up very near the local maximum

and similarly with another iteration (b) ends up very near the local minimum.

In general, the Newton update

(4.51)

is very sensitive to the second derivative and if

ÕÕ

(

◊

(t)

) is very close to zero, then

◊

(t+1)

can b e too far from

◊

(t)

, and thus, the quadratic

approximation is not appropriate. That is, inﬂection points

on the

◊ œ R

axis cause

havoc with Newton’s method. Similarly, as we see in the sequel for the multivariate case,

points

◊ œ R

where the Hessian

(

◊

) is nearly singular (eigenvalues near zero) are also

problematic in the same way.

To gain insights into some of the issues that may occur with Ne wton’s method, let us consider

some examples. The sensitivity of the method can cause overshoot of the minimum. Consider

for example

C(◊)=(◊ ≠3)

(◊ + 3). (4.52)

We have

(

◊

) = 3(

◊ ≠

3)(

◊

+ 1),

ÕÕ

(

◊

) = 6(

◊ ≠

1), and the function has a local minimum

at ◊

=3with C(◊

)=0. The descent step resulting from (4.51)isthus,

◊

(t)

= ≠

(◊

(t)

)

ÕÕ

(◊

(t)

)

(◊

(t)

≠ 3)(◊

(t)

+ 1)

1 ≠ ◊

(t)

An inﬂection point for a twice diﬀerentiable function C : R æ R is a point ◊ where C

ÕÕ

(◊)=0.

154

4.6 Concepts of Second-Order Methods

Say we start at

◊

(0)

5 where

(

◊

(0)

) = 10

125. At this point the descent step is

◊

(0)

75 implying that

◊

(1)

◊

(0)

◊

(0)

25. This overshoot of the local minimum

yields

(

◊

(1)

)

77 which is about a four-fold increase in the loss value. Note though that

in this case (as the reader may readily verify via simple calculations), the iterations that

follow (presented to 4 decimal points precision) are,

◊

(2)

=3.5956, ◊

(3)

=3.0683, ◊

(4)

=3.0011,

and in fact the error for

◊

(6)

is already less than 10

≠13

. Thus we see in this example that

there is initial signiﬁcant overshoot which is later corrected and is followed by very quick

convergence.

Worse than an overshoot is a situation which results in moving away from the local minimum

of interest. We have already seen an example of this in Figure 4.10 (a) where starting to the

left of the inﬂection point yields convergence to a local maximum. Further, for the example

function (4.52)thiswilloccurif◊

(0)

< 1.

An additional phenomenon that may occur is oscillation. As an extreme example, consider,

C(◊)=≠

◊

, which implies ◊

(t)

◊

(t)

(◊

(t)

≠ 5)

5 ≠ 3◊

(t)

◊

(t)

=1, then we get

◊

(t)

≠

2 and thus

◊

(t+1)

≠

1. Further, in the next step we get

◊

(t+1)

=2which brings us back to

◊

(t+2)

=1. Such undesirable perfect oscillation is a

singular phenomenon and is not likely to occur in practice. However approximate oscillation

may sometimes occur and persist for multiple iterations until dampening.

In practice, and especially when considering multi-dimensional generalizations, trust region

methods can mitigate problems such as overshoot, movement in the wrong direction, or

oscillations. These methods incorporate conditions which disallow steps of large step size

beyond some predeﬁned or adaptive thresholds. Trust region methods also incorporate

ﬁrst-order methods as a backup. We do not cover the details further.

A ﬁnal concern with Newton’s method is that it requires both the ﬁrst derivative

(

)

and the second derivative

ÕÕ

(

). In some scenarios these derivatives are easy to obtain,

yet in some other cases the univariate objective may be a complicated function and even

the application of automatic diﬀerentiation techniques may be inadequate for evaluating

ÕÕ

(

). This motivates the introduction of the secant method which is a modiﬁcation of

Newton’s method where the second derivative in each iteration is approximated by the last

two iterations. Namely, we use the approximation,

ÕÕ

(◊

(t)

) ¥

(◊

(t)

) ≠ C

(◊

(t≠1)

)

◊

(t)

≠ ◊

(t≠1)

, (4.53)

which becomes precise when

◊

(t)

and

◊

(t≠1)

are not too far from each other. With this

approximation, up dates of the s ec ant method are given by

◊

(t+1)

= ◊

(t)

≠

◊

(t)

≠ ◊

(t≠1)

(◊

(t)

) ≠ C

(◊

(t≠1)

)

(◊

(t)

155

4 Optimization Algorithms

where we initialize

◊

(0)

and

◊

(1)

to be close but not equal. In general, the secant method

may require more iterations for convergence than Newton’s method. The study of general

numerical analysis and optimization theory contains an in-depth analysis of such convergence

rates. This is a topic that we do not cover further.

The secant method also suﬀers from the problems associated with Newton’s method men-

tioned above, including overshoot, moving in the wrong direction, and oscillation. Neverthe-

less, like Newton’s method, the secant method is generally very powerful and arrives to a

local minimum as long as the initial points are within the vicinity of the minimum.

Importantly, note that the secant method can be viewed as a quasi-Newton method since it

approximates the Hessian (second derivative of a scalar function) with an expression that

only depends on the ﬁrst derivative. As we now generalize to multivariate optimization

we should keep in mind that in the context of deep learning, almost any second-order

method that one may wish to use should be a quasi-Newton method based on ﬁrst-order

derivatives (gradients), as opposed to explicit or automatically diﬀerentiated Hessians (second

derivatives). Nevertheless, let us ﬁrst study Newton’s method for multivariate functions.

The Multivariate Case and Hessians

For a twice diﬀerentiable multivariate objective function

æ R

the update of New ton’s

method is very similar to the univariate case of

(4.51)

. Speciﬁcally, the gradient and the

Hessian play the roles of the ﬁrst and the second-order derivatives, respectivly. We can

represent this update as,

◊

(t+1)

= ◊

(t)

≠ H

≠1

ÒC(◊

(t)

), (4.54)

where

(

◊

(t)

) is the Hessian of the objective function at

◊

(t)

, and when

non-singular; compare (4.54)with(4.51).

To see the development of

(4.54)

, as in the univariate case consider the quadratic approxi-

mation Q

(t)

(◊) of C(◊) around ◊

(t)

given by

C(◊) ¥ Q

(t)

(◊)=C(◊

(t)

)+(◊ ≠ ◊

(t)

)

€

ÒC(◊

(t)

(◊ ≠◊

(t)

)

€

(◊ ≠◊

(t)

). (4.55)

Then, similar to the univariate case, the next point

◊

(t+1)

can be determined as the

◊

such

that the gradient of Q

(t)

(◊) is zero. Observe that this gradient is

ÒQ

(t)

(◊)=ÒC(◊

(t)

)+H

(◊ ≠◊

(t)

and by equating ÒQ

(t)

(◊) to 0 and representing ◊ ≠ ◊

(t)

as the descent step ◊

(t)

◊

(t)

= ≠ÒC(◊

(t)

). (4.56)

This linear systems of equations for

◊

(t)

can be represented in terms of the elegant update

equation

(4.54)

with descent step

◊

(t)

≠H

≠1

ÒC

(

◊

(t)

), in case

is non-singular. However,

in practice, we typically try to solve

(4.56)

using a suitable linear solver.

Sometimes

Suitability of an algorithm depends on the size of

and any special structure in

that we can expect.

For example for small

,onemayuseLUdecomposition(Gaussianeliminiation),orQRfactorizationbased

methods, yet for large d conjugate gradient or other Krylov subspace methods may b e suitable.

156

4.6 Concepts of Second-Order Methods

approximate solutions for

(4.56)

are also an option. For example one may use the conjugate

gradient method when

is large and run for a small number of iterations to only approximately

solve (4.56).

In summary, Newton’s algorithm for multivariate

(

) starts with some

◊

(0)

œ R

.Then

at each iteration

in the algorithm with current point

◊

(t)

, we (in principle) evaluate the

gradient and Hessian at that point. We then solve or approximately solve

(4.56)

to obtain

◊

(t)

for the up date ◊

(t+1)

= ◊

(t)

+ ◊

(t)

In principle, in each iteration, solving

(4.56)

requires the Hessian matrix

(

◊

(t)

)

either via a closed form formula as for example in the case of logistic regression

(4.7)

, or

more generally via automatic diﬀerentiation. However with large

, computing the Hessian

or Hessian vector products via automatic diﬀerentiation is ge nerally not feasible. Hence

in general, Newton’s method on its own is not suitable for deep learning. Nevertheless,

quasi-Newton methods such as the L-BFGS method, presented below, or various adaptations

can be employed.

We also mention that all of the potential problems discussed above for the univariate case

can appear in the multivariate case. Overshoot, moving in the wrong direction, or oscillation

can occur due to similar reasons as the univariate case. Speciﬁcally at points

◊

(t)

that are far

from the local minimum, the quadratic approximation

(t)

(

◊

) may be far from

(

◊

).Further,

there are problems that may occur due to a singular, near-singular, or ill conditioned Hessian

matrix

. In this case the descent step solution to

(4.56)

may be very noisy.

Nevertheless,

Newton’s method is very powerful and often when improved via trust region methods and

similar techniques, it outperforms ﬁrst-order methods very well.

To build intuition of why Newton’s method often outperforms ﬁrst-order methods, let us

ﬁrst explore a relationship between the two approaches. If we use the constant diagonal

matrix

–

in place of the Hessian

, then Newton’s method reduces to the basic gradient

descent method with update

(2.20)

where the descent step is

◊

(t)

≠–ÒC

(

◊

).Toseethis,

return to the quadratic approximation (4.55)withH

replaced by

–

I, to obtain,

(t)

(◊)=C(◊

(t)

)+(◊ ≠ ◊

(t)

)

€

ÒC(◊

(t)

2–

(◊ ≠◊

(t)

)

€

(◊ ≠◊

(t)

), (4.57)

and thus,

(t)

(◊)=ÒC(◊

(t)

–

(◊ ≠◊

(t)

Equating this gradient expression to 0 yields the update

(2.20)

. Hence, we notice that the

inverse of

in Newton’s method plays the role of the learning rate

–

of basic gradient

descent. As a consequence, with this basic form of Newton’s method, there is no need to

calibrate the learning rate, as we do in most gradient descent approaches.

Finally let us get intuition of why Newton’s method generally converges to a (local) minimum

◊

faster than gradient descent when starting at a point close to

◊

. For this, return to

Figure 2.8 which compares two loss landscapes, one of which is circular and one of which is

elliptical. While that ﬁgure is created via a quadratic form where Netwon’s method would

In pract i ce one often monitors the level of ill-conditioning via the condition number of

,orsimilar

measures.

Note that Newton’s method can also be cast as a steepest descent method like basic gradient descent,

but with descent direction chosen with respect to the Hessian norm, ÎuÎ



€

u, u œ R

157

4 Optimization Algorithms

reach the minimum in a single iteration, in more realistic cases when functions are not

quadratic forms, they can still be locally well approximated by quadratic forms in the vicinity

of ◊

and hence considering Figure 2.8 as a general plot is valid.

It is known that with gradient descent, highly elliptical loss landscapes are diﬃcult for

optimization since descent steps may be trapped in valleys, and thus as argued in Section 2.4

in view of Figure 2.8, one often tries to carry out variable transformations to alleviate such

problems. In a sense the gradient descent quadratic approximation

(t)

(

) in

(4.57)

assumes

that the loss landscape is already perfectly circular since it uses a constant diagonal matrix

in place of the Hessian matrix. This is the best that basic gradient descent can oﬀer. Howe ver,

when using Newton’s method, the correct local quadratic approximation

(t)

(

) in

(4.55)

already adapts to the curvature and potential true elliptic nature of the loss landscape.

With such an approximation

(t)

(

), minimization over quadratic loss landscapes occurs in

a single iteration, and since locally all smooth functions are well approximated by quadratic

loss landscapes, when ◊

(t)

is in the vicinity of ◊

, converge is very quick.

Quasi-Newton Methods

In deep learning, the dimension of

◊

is very large and it is thus diﬃcult to solve the linear

equations

(4.56)

as required in each iteration of Newton’s method. Even in dimensions

where these equations are solvable, there is still the basic problem of computing the Hes sian

matrix, a feat which is typically intractable for mid-size problems and beyond. A popular

alternative is oﬀered by the use of quasi-Newton methods where either the Hessian or the

inverse of the He ssian matrix are approximated with each iteration. While the study of

quasi-Newton methods is mature, and these are used in many areas of applied mathematics

and engineering, the development of be st practices for application in deep learning is still an

active area of research. Here we simply outlines the key ideas of such methods.

The central idea in the type of quasi-Newton methods that we cover is based on an evolving

sequence of

d ◊d

matrices,

,...

. At each iteration, some update rule speciﬁes how

is up dated based on

t≠1

and other quantities from the current and previous iteration.

By the design of the update rule, each such

is kept as positive semideﬁnite (and hence

also non-singular), and in our presentation,

≠1

is ideally a good approximation

for the

Hessian matrix H

= Ò

C(◊

(t)

) associated with the current point ◊

(t)

of iteration t.

With such a B

matrix available, we now use the quadratic approximation,

(t)

(◊)=C(◊

(t)

)+(◊ ≠ ◊

(t)

)

€

ÒC(◊

(t)

(◊ ≠◊

(t)

)

€

≠1

(◊ ≠◊

(t)

), (4.58)

which is similar to Newton’s method quadratic approximation

(4.55)

with the diﬀerence that

now

≠1

plays the role of the Hessian. Similarly to the case in Newton’s method,

(t)

(

◊

) of

(4.58)

can be minimized where now the unique minimizer is

≠B

ÒC

(

◊

(t)

). Hence using such

a quadratic approximation we have a descent direction,

◊

(t)

= ≠B

ÒC(◊

(t)

). (4.59)

With quasi-Newton methods, we take a step in the direction speciﬁed by

◊

(t)

, but we also

scale the step with a scalar

–

(t)

0 which c ontrols the step size. Hence the key update rule

We note that alternative presentations may use B

as an approximation for the Hessian itself.

158

4.6 Concepts of Second-Order Methods

for quasi-Newton methods can be stated as,

◊

(t+1)

= ◊

(t)

≠ –

(t)

ÒC(◊

(t)

), (4.60)

which is similar to

(4.54)

. The determination of

–

(t)

is typically obtained via a line search

technique such as inexact line search using the second Wolfe condition (4.49).

Algorithm 4.3: Quasi-Newton

Input: Dataset D = {(x

(1)

),...,(x

(n)

)},

objective function C(·)=C(·; D), and

initial parameter vector ◊

init

Output: Approximately optimal ◊

1 ◊ Ω ◊

init

2 B Ω B

init

3 repeat

4 g ΩÒC (◊) (Compute gradient)

5 ◊

Ω≠Bg (Determine descent direction)

6 Determine – using ◊ and ◊

(Line search)

7 ◊ Ω ◊ + –◊

(Update parameters)

8 Update B (Update approximation of Ò

C(◊)

≠1

)

9 until termination condition is satisﬁed

10 return ◊

The typical quasi-Newton method is presented in Algorithm 4.3. As in any gradient based

algorithm, computation of the gradient is needed, and this is speciﬁed in Step 4. The

computation of the descent direction

(4.59)

is in Step 5. In Step 6 we use line search, and

Step 7 updates the current parameter based on

(4.60)

. Finally, Step 8 is the update rule,

speciﬁc to the type of quasi-Newton me thod used. In the sequel we present BFGS and

L-BFGS, each with their own speciﬁcation of this step.

As this method falls within the realm of a descent direction approach as in Algorithm 4.1,

ideally we wish to s et

◊

init

in Step 1, at a point close to a minimum, just like all the other

descent direction approaches in this chapter. As for the initialization matrix

init

in Step 2,

ideally we would have a matrix whose inverse is the Hessian at

◊

init

. However, since evaluation

of the Hessian matrix or its inverse is typically computationally costly or intractable, we

often initialize with B

init

= I,ad ◊ d identity matrix.

Now with the general outline of the quasi-Newton approach presented, one may seek go od

update rules for Step 8. Before we see concrete formulas for these updates within the BFGS

and L-BFGS algorithms, let us touch on theoretical asp ec ts that helped develop these

methods, and can be used for further analysis of their performance and properties.

Since the Hessian

captures the curvature information of the objective function, we may

consider the latest two p oints

◊

(t≠1)

and

◊

(t)

together with the gradients at these points,

ÒC(◊

(t≠1)

) and ÒC(◊

(t)

), for the approximate equality,

(◊

(t)

≠ ◊

(t≠1)

) ¥ÒC(◊

(t)

) ≠ÒC(◊

(t≠1)

). (4.61)

159

4 Optimization Algorithms

This approximation follows from

(A.24)

of Appendix A, and is a goo d approximation when

◊

(t)

and

◊

(t≠1)

are close. It generalizes

(4.53)

used in the univariate secant method. Note

that (4.61) holds with equality if C(◊ ) is a quadratic function. Further, with the notation,

◊

(t≠1)

=(◊

(t)

≠ ◊

(t≠1)

), and g

(t≠1)

= ÒC(◊

(t)

) ≠ÒC(◊

(t≠1)

), (4.62)

we have that

(4.61)

is represented as

◊

(t≠1)

¥ g

(t≠1)

. Now inspired by this approximation,

if we are able to select the

d◊d

positive deﬁnite matrix

, such that

≠1

is an approximation

of H

, then by considering the approximation as an equality,

◊

(t≠1)

= B

(t≠1)

. (4.63)

The equation

(4.63)

is known as the secant equation and it implies that

maps the change of

gradients

(t≠1)

into the descent direction

◊

(t≠1)

. A condition related to the secant equation

is,

(t≠1)

€

◊

(t≠1)

> 0, (4.64)

which is known as the curvature condition. Importantly, it can be shown that when the

curvature condition is satisﬁed, there is always a positive deﬁnite

such that the s ec ant

equation is satisﬁed.

In fact, there can be inﬁnitely many solutions. In general, quasi-

Newton methods are designed to maintain the curvature condition (4.64) and to provide a

unique up date rule for B

The curvature condition

(4.64)

allows us to also gain insight into why line search as in

Step 6 of Algorithm 4.3 is needed. In simple or synthetic cases where the loss function

(

)

is strictly convex, it can be shown that the curvature condition

(4.64)

is always satisﬁed.

However, loss functions in deep learning are typically highly non-convex, and thus there is

no guarantee of existence of

such that the secant equation is satisﬁed, unless we impose

an additional condition on –

(t)

of (4.60).

If in each iteration

we determine

–

(t)

using inexact line search with the sec ond Wolfe

condition

(4.49)

, then the curvature condition

(4.64)

is guaranteed to hold.

To see this,

suppose that the second Wolfe condition is applied, then from

◊

(t)

◊

(t≠1)

–

(t≠1)

◊

(t≠1)

we obtain

ÒC(◊

(t)

)

€

◊

(t≠1)

Ø “ ÒC(◊

(t≠1)

)

€

◊

(t≠1)

, (4.65)

where we used the relationship between the gradient and the directional derivative shown in

(A.8) of App e ndix A. Now the inner product of the curvature condition (4.64) satisﬁes,

(t≠1)

€

◊

(t≠1)

= ÒC(◊

(t)

)

€

◊

(t≠1)

≠ÒC(◊

(t≠1)

)

€

◊

(t≠1)

Ø “–

(t≠1)

ÒC(◊

(t≠1)

)

€

◊

(t≠1)

≠ÒC(◊

(t≠1)

)

€

◊

(t≠1)

=(“ ≠ 1)–

(t≠1)

ÒC(◊

(t≠1)

)

€

◊

(t≠1)

= ≠(“ ≠ 1)–

(t≠1)

ÒC(◊

(t≠1)

)

€

t≠1

ÒC(◊

(t≠1)

)

> 0.

It can be shown that if for

u, v œ R

€

0,thenthereexistsasymmetricpositivedeﬁnitematrix

B œ R

d◊d

such that Bu = v.

This also holds for the strong Wolfe condition (4.50).

160

4.6 Concepts of Second-Order Methods

The ﬁrst equality uses the deﬁnition of

(t≠1)

and the step size resulting from

(4.60)

◊

(t≠1)

–

(t≠1)

◊

(t≠1)

. The inequality that follows is due to

(4.65)

. Finally, the last inequality

is because

“ <

1 and

t≠1

is assumed to be positive deﬁnite based on the update rule. We

thus see that the second Wolfe condition is a means to ensure that the curvature condition

is satisﬁed.

The BFG S and L-BFGS Update Rules

Let us now focus on the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, and its

variant, the limited-memory BFGS (L-BFGS) algorithm. In practice, some small deep

learning applications already make eﬃcient use of L-BFGS. We mention that one common

approach in training, is to ﬁrst carry out training using standard gradient descent or a

variant, and then apply L-BFGS to “complete” the training process. Nevertheless, to date,

the application of L-BFGS and other variants of advanced quasi-Newton methods is still not

widespread in large scale deep learning.

We present the update rules for these two algorithms which specify Step 8 of Algorithm 4.3,

keeping in mind that

init

. Each quasi-Newton iteration of Algorithm 4.3 already bears

the computational cost of computing the gradient, the cost of several function evaluations

within the line search, and other costs associated with determining the descent direction and

updating of parameters. The computational cost of the update rule in Step 8 is generally

heavy. We contrast the computational cost of the BFGS update rule and that of the L-BFGS

update rule, below.

The BFGS update rule is,

= V

€

t≠1

+ ﬂ

(t≠1)

◊

(t≠1)

◊

(t≠1)

€

, (4.66)

where the positive scalar ﬂ

(t≠1)

and the matrix V

t≠1

are respectively given by,

ﬂ

(t≠1)

€

◊

(t≠1)

and V

t≠1

= I ≠ﬂ

(t≠1)

◊

(t≠1)

€

. (4.67)

Observe the use of the rank one matrices

◊

(t≠1)

◊

(t≠1)

€

and

(t≠1)

◊

(t≠1)

€

using

(4.62)

.The

positivity of

ﬂ

(t≠1)

is maintained by the curvature condition

(4.64)

. Importantly, it can

be shown that each matrix in the sequence

,...

, resulting from this update rule is

positive deﬁnite. We omit the details.

Note that due to the matrix multiplication

€

t≠1

, the computational cost of

updating

using

(4.66)

(

). This is a reasonable cost for small

, yet for larger

in some deep learning applications, it renders BFGS to o c ostly.

We mention that the development of the update rule in

(4.66)

and

(4.67)

can be cast as a

minimization problem where we seek the closest possible positive deﬁnite

to the previous

t≠1

under a speciﬁc matrix norm. The minimization is subject to the constraint that the

secant equation

(4.63)

holds. We omit the details of this derivation yet mention that the

matrix norm used for BFGS is a weighted Frobenius norm. See the notes and references at

the end of the chapter for more information.

161

4 Optimization Algorithms

The L-BFGS algorithm overcomes the computational cost of BFGS. Its update rule is not

for maintaining the matrix

in memory directly, but is rather based on keeping a limited

history (or limited memory) of the constituent vectors

◊

(t)

and

(t)

. Observe that in principle,

one can unroll the BFGS up date rules (4.66) and (4.67) to obtain,

= V

€

t≠1

€

t≠2

···V

€

init

···V

t≠2

t≠1

+ ﬂ

(t≠1)

◊

(t≠1)

◊

(t≠1)

€

+ ﬂ

(t≠2)

€

t≠1

◊

(t≠2)

◊

(t≠2)

€

t≠1

+ ﬂ

(0)

€

t≠1

€

t≠2

···V

€

◊

(0)

◊

(0)

€

···V

t≠2

t≠1

(4.68)

Further observe that this representation of

is a function of the sequence

{◊

(k)

}

for

,...,t≠

1 and

init

. Hence in principle, assuming that

init

is diagonal, if we wish to

compute the matrix-vector product

ÒC

(

◊

(t)

), we can do so with all multiplications being

vector-vector multiplications. This is due to the fact that each

matrix is composed of the

outer products

(t≠1)

◊

(t≠1)

€

. Yet obviously, for non-small

such multiplications b e com e

ineﬃcient as the number of terms in the expression (4.68) grows.

The L-BFGS idea is to use an approximate version

that can be stored indirectly

by storing only the late st

pairs of vec tors

(k)

, ◊

(k)

}

, for

t ≠ m, . . . , t ≠

1.The

hyper-parameter

is the length of the history (or memory) and is typically chosen to

be much smaller than

.Wemodify

(4.68)

with an approximation which limits the his-

tory used to the last

iterations. For this we drop the last

t ≠ m

terms of

(4.68)

, and

in the ﬁrst term

€

t≠1

€

t≠2

···V

€

init

···V

t≠2

t≠1

, we replace the middle multiplicands,

€

t≠m≠1

···V

€

init

···V

t≠m≠1

, with a symmetric positive deﬁnite matrix

(t)

init

.Thismod-

iﬁcation yields

¥ B

, with the form,

= V

€

t≠1

€

t≠2

···V

€

t≠m

(t)

init

t≠m

···V

t≠2

t≠1

+ ﬂ

(t≠1)

◊

(t≠1)

◊

(t≠1)

€

+ ﬂ

(t≠2)

€

t≠1

◊

(t≠2)

◊

(t≠2)

€

t≠1

+ ﬂ

(t≠m)

€

t≠1

€

t≠2

···V

€

t≠m

◊

(t≠m≠1)

◊

(t≠m≠1)

€

t≠m

···V

t≠2

t≠1

(4.69)

Note that

(t)

init

is selected for each iteration

and is usually set as a diagonal matrix to

reduce the storage and computational costs. A practically eﬀective choice is

(t)

init

(t≠1)

€

◊

(t≠1)

€

(t≠1)

With this choice for

(t)

init

, the expression

(4.69)

allows us to compute

ÒC

(

◊

(t)

) eﬃciently. In

particular,

ÒC

(

◊

(t)

) can be computed using

(

) inner products between

-dimensional

vectors, and thus the per iteration com putational cost for the L-BFGS algorithm is

(

while requiring to store only the latest m pairs of ve ctors {g

(k)

, ◊

(k)

}, k = t ≠ m, . . . , t ≠ 1,

162

4.6 Concepts of Second-Order Methods

which takes

(

) storage space. With

much smaller than

, the L-BFGS algorithm

requires far less storage than the BFGS algorithm. Finally we note that in terms of Algo-

rithm 4.3, the up date rule in Step 8 simply needs to push the vectors

(t)

, ◊

(t)

}

onto a queue

while popping the vectors

(t≠m)

, ◊

(t≠m)

}

out of that queue. The bulk of the algorithm is

in Step 5 with the eﬃcient matrix-vector multiplication using the e xpressions

(4.69)

and

(4.67).

163

4 Optimization Algorithms

Notes and References

Mathematical optimization is a classical subject and some of the general texts on this subject

are [

] where linear optimization is covered, [

272

] where both linear and non-linear optimization

methods are covered, and [

] where the focus is on non-linear methods. Further texts focusing on

convexity, convex analysis, and convex optimization are [

], [

], and [

]. Many of the algorithms

in these texts describe descent direction methods, yet some examples of optimization algorithms that

are not descent direction methods include direct methods (also known as zero-order or black box)

such as coordinate and pattern search methods; see [

238

]or[

311

] for an overview. The Rosenbrock

function was developed in [

354

] within the context of multivariate function optimization and since

then it became popular for basic optimization test cases and illustrations.

As mentioned in the Chapter 2 notes and references, gradient descent is attributed to Cauchy

from 1847 with [

];see[

253

] for an historical account. Stochastic gradient descent was initially

introduced in 1951 with [

350

] in the context of optimizing the expectation of a random variable

over a parametric family of distributions. See [

229

], [

230

], [

231

], [

255

], and [

349

] for early references

using sto chastic gradient descent approaches. After the initial popularity of this approach for

general optimization, other numerical optimization methods were developed and stochastic gradient

descent’s popularity decreased. Years later, it became relevant again due to developments in deep

learning; see [

] and [

] for early references in the context of deep learning. Interesting works that

attempt to explain the suitability of stochastic gradient descent for deep learning are [

], [

], and

[

]. Relationships between the use of mini-batches and stochastic gradient descent are studied in

[202], [203], and [261]. Practical recommendations of gradient based methods can be found in [29].

The ADAM algorithm was introduced in [

233

]. See [

] and [

345

] for additional papers that study

its performance. Ideas of momentum appeared in optimization early on in [

335

]. RMSprop appeared

in lectures by Geoﬀ Hinton in a 2014 course and has since become popular. Adagrad appeared

earlier on in [

112

]. An interesting empirical study considering ADAM and related methods is [

See [238] and [356] for a detailed study of momentum based methods.

The basic idea of automatic diﬀerentiation dates back to the 1950’s ; see [

] and [

313

]. The idea

of computing the partial derivatives using forward mode automatic diﬀerentiation is attributed

to Wengert [

420

]. Even though ideas of backward mode automatic diﬀerentiation ﬁrst appe ared

around 1960 and used in control theory by the end of that decade. The 1980’s paper [

386

]is

generally attributed for presenting backward mode automatic diﬀerentiation in the form we known

it now. Backward mode automatic diﬀerentiation for deep neural networks is referred to as the

backpropagation algorithm, a concept that we cover in detail in Chapter 5. In fact, ideas for the

backpropagation algorithm were developed independently of some of the automatic diﬀerentiation

ideas; see [

173

] for an historical overview as well as the notes and references at the end of Chapter 5

for more details.

In recent years, with the deep learning revolution, automatic diﬀerentiation has gained major

p opularity and is a core com ponent of deep learning software frameworks such as TensorFlow [

PyTorch [

324

], and others. In parallel, the application of automatic diﬀerentiation for a variety

of scientiﬁc applications has also been popularized with technologies such as Flux.jl [

201

] in Julia

and JAX [

] in Python. See [

] for a survey in the context of machine learning and [

147

]fora

more dated review. Additional diﬀerentiable programming frameworks that implement automatic

diﬀerentiation include CasADi, Autograd, AutoDiﬀ, JAX, Zygote.jl, Enzyme, and Chainer. Much of

the complexity of diﬀerentiable programming software frameworks is centered around the use of

GPUs as well as the placement and management of data in GPU memory.

Nesterov momentum was intro duced in [

305

]. See [

362

], [

363

], and [

431

] for some comparisons of

Nesterov momentum with the standard momentum approach. The Nadam algorithm was introduced

in [

109

] to incorporate Nesterov momentum into ADAM and is now part of some deep learning

frameworks. The Adadelta method appeared in [

440

] prior to the appearance and growth in

p opularity of ADAM. The Adamax method appeared in the same paper as ADAM, [

233

], as an

extension. Line search techniques are central in classical ﬁrst-order optimization theory. Inexact

line search techniques using Wolfe conditions ﬁrst appeared in [

422

] and [

423

]. Backtracking line

search, also known as Armijo line search, ﬁrst appeared in [16]. See for example chapter 3 of [311]

and chapter 4 of [

238

] for overviews of exact and inexact line search algorithms. Concepts of the

conjugate gradient method ﬁrst appeared in [

177

] for s olving linear equations with symmetric positive

164

4.6 Concepts of Second-Order Methods

semideﬁnite matrices. Since then, the method was incorporated in many numerical algorithms; see

for example chapter 11 of [141] and chapter 5 of [311] for more details.

Despite the success of second-order methods in statistics, in general numerical optimization, and in

engineering, they are yet to make it into mainstream large scale deep learning. Perhaps one reason

for this is that when implemented without care, they are much slower than ﬁrst-order methods for

high-dimensional models such as deep neural networks. Nevertheless, their eﬃcient incorporation in

deep learning is still an active area of research. The origins of Newton’s method, also known as the

Newton-Raphson method, can be dated back to 17th century; see chapter 2 of [

237

] for an account

on the history of optimization methods. For a detailed overview see [47].

Even though the secant method can be thought of as an approximation to Newton’s method, from

an historical perspective, it is several decades older than Newton’s method; see for example [

321

]

for origins and an account of the evolution of the secant method. Quasi-Newton methods are more

mo dern with the ﬁrst technique, the Davidon-Fletcher-Powell (DFP) method, appearing in 1959

as a technical report [

] and later in 1991 published as [

100

]. In the interim, the DFP method

was further modiﬁed in the early 1960’s with [

124

]. The BFGS method materialized in the 1960’s

and 1970’s, and the L-BFGS extension was introduced in 1980 with [

310

]. See also [

123

] for an

overview of these methods. See [

] and [

266

] for a discussion of convergence properties of BFGS

and L-BFGS respectively. Information about the weighted Frobenius norm used for constructing

BFGS updates is in chapter 6 of [

311

]. A related second order method is the Levenberg–Marquardt

algorithm; see chapter 18 of [

] for an introduction. Trust region methods are believed to have

originated from [

254

]; see [

439

] for a survey. Finally we suggest chapter 5 of [

311

] as a review of

quasi-Newton methods.

165