i
i
i
i
i
i
i
i
Mathematical Engineering
of Deep Learning
Chapter 4
Benoit Liquet, Sarat Moka and Yoni Nazarathy
February 28, 2024
i
i
i
i
i
i
i
i
Contents
Preface 3
1 Introduction 1
1.1 The Age of Deep Learning ............................ 1
1.2 A Taste of Tasks and Architectures ....................... 7
1.3 Key Ingredients of Deep Learning ........................ 12
1.4 DATA, Data, data! ................................ 17
1.5 Deep Learning as a Mathematical Engineering Discipline ........... 20
1.6 Notation and Mathematical Background .................... 23
Notes and References .................................. 25
2 Principles of Machine Learning 27
2.1 Key Activities of Machine Learning ....................... 27
2.2 Supervised Learning ............................... 32
2.3 Linear Models at Our Core ........................... 39
2.4 Iterative Optimization Based Learning ..................... 48
2.5 Generalization, Regularization, and Validation ................ 52
2.6 A Taste of Unsupervised Learning ....................... 62
Notes and References .................................. 72
3 Simple Neural Networ ks 75
3.1 Logistic Regression in Statistics ......................... 75
3.2 Logistic Regression as a Shallow Neural Network ............... 82
3.3 Multi-class Problems with Softmax ....................... 86
3.4 Beyond Linear Decision Boundaries ....................... 95
3.5 Shallow Autoencoders .............................. 99
Notes and References .................................. 111
4 Optimization Algorithms 113
4.1 Formulation of Optimization .......................... 113
4.2 Optimization in the Context of Deep Learning ................ 120
4.3 Adaptive Optimization with ADAM ...................... 128
4.4 Automatic Dierentiation ............................ 135
4.5 Additional Techniques for First-Order Methods ................ 143
4.6 Concepts of Second-Order Methods ....................... 152
Notes and References .................................. 164
5 Feedforward Deep Networks 167
5.1 The General Fully Connected Architecture ................... 167
5.2 The Expressive Power of Neural Networks ................... 173
5.3 Activation Function Alternatives ........................ 180
5.4 The Backpropagation Algorithm ........................ 184
5.5 Weight Initialization ............................... 192
7
i
i
i
i
i
i
i
i
Contents
5.6 Batch Normalization ............................... 194
5.7 Mitigating Overfitting with Dropout and Regularization ........... 197
Notes and References .................................. 203
6 Convolutional Neural Networks 205
6.1 Overview of Convolutional Neural Networks .................. 205
6.2 The Convolution Operation ........................... 209
6.3 Building a Convolutional Layer ......................... 216
6.4 Building a Convolutional Neural Network ................... 226
6.5 Inception, ResNets, and Other Landmark Architectures ........... 236
6.6 Beyond Classification ............................... 240
Notes and References .................................. 247
7 Sequence Models 249
7.1 Overview of Models and Activities for Sequence Data ............. 249
7.2 Basic Recurrent Neural Networks ........................ 255
7.3 Generalizations and Modifications to RNNs .................. 265
7.4 Encoders Decoders and the Attention Mechanism ............... 271
7.5 Transformers ................................... 279
Notes and References .................................. 294
8 Specialized Architectures and Paradigms 297
8.1 Generative Modelling Principles ......................... 297
8.2 Diusion Models ................................. 306
8.3 Generative Adversarial Networks ........................ 315
8.4 Reinforcement Learning ............................. 328
8.5 Graph Neural Networks ............................. 338
Notes and References .................................. 353
Epilogue 355
A Some Multivariable Calculus 357
A.1 Vectors and Functions in R
n
........................... 357
A.2 Derivatives .................................... 359
A.3 The Multivariable Chain Rule .......................... 362
A.4 Taylor’s Theorem ................................. 364
B Cross Entropy and Other Expectations with Logarithms 367
B.1 Divergences and Entropies ............................ 367
B.2 Computations for Multivariate Normal Distributions ............. 369
Bibliography 399
Index 401
8
i
i
i
i
i
i
i
i
4 Optimization Algorithms
The field of optimization is an important sub-field of applied mathematics. In the context
of deep learning it is just as important. Any form of training process for a deep learning
model requires optimization to minimize the loss. The decision variables are the learned
model parameters, and the data is typically considered fixed from the point of view of
optimization. In this chapter we explore general optimization methods and results, focusing
on the essential tools for optimization in the process of training deep learning models.
In Section 4.1 we formulate optimization problems in a general setup, discuss their forms,
and review local extrema, global extrema, and convexity. In Se ction 4.2 we specialize to
optimization of learned parameters for deep learning. We discuss the nature of such problems,
and common techniques including stochastic gradient descent, tracking performance measures,
and early stopping. In Section 4.3 we discuss various forms of first-order methods which
extend the basic gradient descent algorithm. The focus is on the popular ADAM optimizer
which has grown to become very popular for deep learning. In Section 4.4 we introduce
automatic dierentiation, a computational paradigm for ecient evaluation of derivatives.
Here we present both forward and backward mode automatic dierentiation in a general
context, and later in the next chapter we specialize to deep learning. We continue with
Section 4.5 where additional first-order methods are presented be yond gradient descent
and ADAM. In Section 4.6 we introduce and present an overview of second-order methods.
These powerful methods are sometimes used in deep learning, and when applicable they can
perform very well. Note that on a first reading of the book, one may focus on Sections 4.1,
4.2, 4.3, and 4.4 b e fore continuing to understand the deep learning models of Chapter 5.
4.1 Formulation of Optimization
Optimization algorithms, such as gradient descent described in Algorithm 2.1 of Chapter 2,
primarily aim to minimize an objective . We have already seen in previous chapters that
“learning”
¥
“optimization”. Hence to do well in deep learning, it is important to understand
the implementation of optimization algorithms as they play a crucial role in minimizing the
training loss.
The General Setup
Consider a multivariate real-valued function
C
:
R
d
æ R
. An optimization problem can be
written as
minimize
C()
sub ject to œ ,
(4.1)
113
i
i
i
i
i
i
i
i
4 Optimization Algorithms
where
R
d
is called the feasible set or the constraint set. In the context of learning,
represents the parameter space and the objective function C() is the loss function.
In other words, the aim of the optimization problem is to find a point
œ
where
C
(
) is
minimum. Such a point, denoted
ú
, is called a solution or minimizer of the optimization
problem (4.1). That is,
C(
ú
) Æ C(), for all œ . (4.2)
There can be multiple solutions to this problem and the set of all the solutions
ú
is usually
denoted by
argmin
œ
C().
The formulation
(4.1)
is general, in the sense that any optimization problem can be written
in this form. Particularly, the maximization problem
maximize
C
(
) subject to
œ
can be stated in the form
(4.1)
by replacing
C
(
) in
(4.1)
with
C
(
). We say that the
optimization problem is unconstrained if =
R
d
; otherwise, it is a constrained optimization
problem.
Our focus is primarily on unconstrained optimization and hence the “subject to” component
of
(4.1)
may be omitted. This is the c ommon case in the context of deep learning, and
in particular it is the case for the optimization problems that we already encountered
in the context of linear regression (Section 2.3), logistic regression (sec tions 3.1 and 3.2),
multinomial regression (Section 3.3), and most cases of autoencoders (Section 3.5).
General optimization theory and practice comes in many shapes and forms where dierent
methods can be used for dierent sub-types of the general problem
(4.1)
. In the context
of deep learning we typically make use of multivariate calculus
1
and assume (or construct)
the ob jective
C
(
·
) to have desirable properties, such as continuity, dierentiability (almost
everywhere), and other smoothness properties. With this type of optimization, the gradient
ÒC
and Hessian
Ò
2
C
are typically used. The former is the key component of first-order
methods as desc ribed in sections 4.3 and 4.5. The latter sometimes plays a role in second-
order methods as described in Section 4.6. Throughout this chapter when we make use of
the gradient or Hessian then we are implicitly assuming that these objects exist and are well
defined for the problem at hand.
Local and Global Minima
A solution of an unconstrained minimization problem is also known as a global minimum
because of
(4.2)
. However, there can be other points that are not solutions to the optimization
problem but exhibit similar properties within their vicinity. Such points are known as local
minima. In particular, a point
œ R
d
is called a local minimum of a continuous function
C
if there is an
r>
0 such that
C
(
)
Æ C
(
) for all
œ B
(
,r
),where
B
(
,r
) is an open ball
with center and radius r>0 define d by
B(,r)={ œ R
d
: Î Î <r}.
1
Refer to Appendix A for a review.
114
i
i
i
i
i
i
i
i
4.1 Formulation of Optimization
In other words, if
is a local minimum of
C
, we cannot decrease the value of
C
by moving
an infinitesimal distance from
in any direction; see Figure 4.1 for illustrations of global
and local minima.
Supp ose that the ob jec tive function
C
is dierentiable at a local minimum
.Then,itis
well known that the gradient
ÒC()=0
. This is a necessary condition for a point to be a
local minimum, but not sucient. For example, the condition
ÒC
(
)=0also holds if
is a
local maximum of
C
, that is,
is a local minimum of
C
. Further the condition
ÒC()=0
can hold even at a point
that is neither a local minimum nor a local maximum, and such
a point is called a saddle point where the objective function slopes up in one direction and
slopes down in another direction (for smo oth functions we define it below using the Hessian
matrix). See Figure 4.1 for examples.
(a)
8
6
4
2
0

l
-1
(b)
Figure 4.1:
Two dimensional smooth functions. (a) The function
C
(
)=
2
1
+
2
2
is convex with
a unique global minimum. (b) A well-known function called the peaks function is an example
non-convex function with several local minima, local maxima, and saddle points. Around each local
minimum, the function is locally convex. On the figure, two saddle points are marked with red dots.
Convexity and Saddle Points
There are classes of optimization problems for which a local minimum is also guaranteed to
be a global minimum. One such popular class is the family of convex problems. While most
deep learning based optimization problems are not convex, understanding convexity helps to
better understand optimization concepts.
We can establish stronger necessary conditions for a point to be a local minimum via
convexity.Aset
C
R
d
is said to be a convex set if for any , œ
C
and œ [0, 1],
⁄◊ +(1) œ
C
.
That is, for a convex set, the line segment connecting any two points in the set, remains in
the set. An example of a convex set is an open ball B(,r) for any œ R
d
and r>0.
We are now ready to define convex functions. Suppose that
C
R
d
is a convex set. Then,
a function
C
:
R
d
æ R
is called a convex function on
C
if for every
, œ
C
and
œ
[0
,
1],
115
i
i
i
i
i
i
i
i
4 Optimization Algorithms
we have
C
!
⁄◊ +(1)
"
Æ C()+(1 )C(). (4.3)
Furthermore, we say that the function
C
is strictly convex on
C
if the inequality in
(4.3)
holds with strict inequality for all œ (0, 1) and for every choice of , œ
C
with = .
If the function
C
is convex on the entire space
R
d
, then all the local minima are global
minima. Furthermore, if
C
is strictly convex on
R
d
, then there is a unique global minimum.
See Figure 4.1 (a) for an example of a convex function.
If we further assume that
C
is twice dierentiable, then we can define convexity and strict
convexity in terms of the Hessian matrix. In particular,
C
is convex on
C
if and only if the
Hessian
Ò
2
C
(
) is positive semidefinite
2
at every inte rior point on
C
. Furthermore,
C
is
strictly convex on
C
if and only if the Hessian
Ò
2
C
(
) is positive definite at every interior
point on
C
.
Note that if a point
is a local minimum of a twice dierentiable function
C
,thenthere
exists some
r>
0 such that
C
is convex on the open ball
B
(
,r
), and hence we sometimes
say that
C
is locally convex around
. See Figure 4.1 (b) for an example of a function with
several local minima with local convexity around each minimum point.
For twice dierentiable functions, it is easy to define saddle points. A point
with
ÒC
(
)=0
is called a saddle point if the Hessian
Ò
2
C
(
) at
is neither positive semidefinite nor negative
semidefinite.
Objective Functions in Deep Learning
In the context of deep learning, the typical form of the ob je ctive function is
C():=C( ; D)=
1
n
n
ÿ
i=1
C
i
(), (4.4)
where
n
is the number of data samples in a given dataset
D
. We have already seen this
form in
(2.11)
where
C
i
(
) is the loss computed for the
i
-th data sample. Specific forms for
this
i
-th data sample loss include the square loss, cross entropy loss, and other forms that
have already appeared in chapters 2 and 3. In this chapter the data samples play a more
minor role and hence from now onwards in this chapter we suppress
D
from the notation
and use C().
The optimization problem that corresponds to the objective function
(4.4)
is often called
the finite sum problem. For such problems, it is relatively easy to study the properties
of
C
(
) by studying the corresponding properties of each
C
i
(
), its gradient
ÒC
i
(
) and
Hessian
Ò
2
C
i
(
). Most importantly, the form
(4.4)
plays an important role in finding the
descent direction for optimization methods such as stochastic gradient descent, described
in Section 4.2. When each
C
i
(
) is twice dierentiable, the gradient and Hessian of the
2
See (A.12)inAppendixA for definition of positive semidefinite and positive definite matrices.
116
i
i
i
i
i
i
i
i
4.1 Formulation of Optimization
objective C() are respectively given by
ÒC()=
1
n
n
ÿ
i=1
ÒC
i
(), and Ò
2
C()=
1
n
n
ÿ
i=1
Ò
2
C
i
(). (4.5)
A lo cal minimum of
C
(
·
) at
can b e characterized via
Ò
2
C
(
) b e ing positive sem idefinite
in addition to
ÒC
(
)=0. It is useful to note that
Ò
2
C
(
) is positive semidefinite at
if
each Ò
2
C
i
() is positive semidefinite at , because for any œ R
d
,
Ò
2
C() =
1
n
n
ÿ
i=1
!
Ò
2
C
i
()
"
. (4.6)
Convexity of Certain Shallow Neural Networks
While most deep learning optimization problems are not convex, the loss functions of the
logistic regression and multinomial regression models of Chapter 3 are convex. The same
is true for the linear regression model of Chapter 2. For completeness, we now establish
convexity properties of these simple models. The details of this subsection may be skipped
on a first reading.
We begin with the linear regression model of Section 2.3 with the standard quadratic loss.
In this case the associated optimization problem is
(2.13)
and we may compute the Hessian
for the objective C()=Îy XÎ
2
to be
Ò
2
C()=2X
X,
where we use the notation of the design matrix
X
from
(2.10)
. The Hessian is thus a Gram
matrix and this implies it is positive s em idefinite.
3
Hence the optimization problem is convex.
Further if
X
has linearly independent columns then
Ò
2
C
(
) is positive definite and the
problem is strictly convex and has a unique optimal p oint.
We now continue to logistic regression of sections 3.1 and 3.2. An expression for the gradient
is in
(3.17)
and we may follow up with an expression for the Hessian. Specifically, the
d d
Hessian matrix of each C
i
() is
Ò
2
C
i
()=
Sig
(b + w
x
(i)
)
!
1
Sig
(b + w
x
(i)
)
"
5
1
x
(i)
6
Ë
1 x
(i)
È
, (4.7)
which is evidently a product of a strictly positive s calar and a Gram matrix. Thus it is always
positive semidefinite following the same argument used for linear regression with quadratic
loss.
4
Now using the argument at
(4.6)
, the matrix
Ò
2
C
(
) is also positive semidefinite and
thus the total loss function
C
(
) is convex. Further, in cases where the vectors (1
,x
(i)
) for
i
=1
,...,n
span
R
d
(this can only hold when
d Æ n
) then we cannot find a non-zero vector
œ R
d
orthogonal to all the vectors (1
,x
(i)
). Thus, for any vector
=0, there exists at
least one
i
such that
Ò
2
C
i
() >
0 and hence
Ò
2
C
(
) is positive definite, making the
total loss function C() strictly convex.
3
For any matrix
A
,theassociatedGrammatrixis
A
A
and thus
A
A
=
ÎAÎ
2
Ø
0 for any vector
.Furtherwithlinearlyindependentcolumns
A
=0for any
=0and thus
ÎAÎ
2
>
0 implying that
A
is
positive denite.
4
Note however that
Ò
2
C
i
()
is never (strictly) positive definite because
Ò
2
C
i
(
) is a rank one matrix.
117
i
i
i
i
i
i
i
i
4 Optimization Algorithms
We now move onto multinomial regression as well as linear regression with other loss functions
introduced in Section 2.3. The optimization problems of these models also enjoy convexity
prop e rties, yet expressions of the Hessian are not easy to work with. We thus use other
means to argue for convexity.
For multinomial regression recall C
i
() from (3.33) which we can represent as,
C
i
() = log
K
ÿ
j=1
e
b
j
+w
(j)
x
(i)
(b
k
+ w
(k)
x
(i)
), (4.8)
when
y
(i)
=
k
. Recall here that
includes both bias terms (scalars)
b
1
,...,b
K
and weight
vectors w
(1)
,...,w
(K)
;see(3.23).
As a building block we first use the log-sum-exp function, h : R
K
æ R of the form
h(v) = log
K
ÿ
j=1
e
v
j
.
We can show that
h
(
v
) is convex by e xplicitly c alculating its Hessian expression and seeing
that it is positive semidefinite
5
for all
v œ R
K
. Further using first principles of the definition
of convexity,
(4.3)
, we can show that for any convex function
h
:
R
K
æ R
,
h
(
u
1
,...,
u
K
)
is convex in
œ R
d
for any fixed
u
1
,...,u
K
œ R
d
.Since
is the vector consisting of all the
weights and the biases, by using multiple zeros in each
u
j
œ R
d
, we can construct
u
j
from
the p-dimensional vector x
(i)
such that
v
j
=
u
j
= b
j
+ w
(j)
x
(i)
.
As a consequence, the first term of
(4.8)
is convex in
. Further, the second term is also
convex as it is an ane function of . Hence (4.8) is a sum of convex functions and is thus
convex.
We now again consider linear regression, but this time with the absolute error loss
(2.18)
and Huber loss (2.19). In both of these cases, like the standard quadratic loss, we have,
C
i
()=g (y
(i)
x
(i)
),
where
g
:
R æ R
is a convex function; see Figure 2.6 (a) in Chapter 2. Here again we use
the fact that composition of a convex function and an ane function is convex and this
establishes convexity of C
i
() and thus of C() for all these loss function alternatives.
Before we depart from this discussion of convexity we suggest that the reader consider the
plots of Figure 3.4 of Chapter 3. In that figure, it is evident that logistic regression with “the
wrong” loss function, namely squared loss in this case, may yield a non-convex problem. We
highlight this here to emphasize that the convexity properties discussed above are more of
an exception than the rule. Indeed, most deep learning models of this book have non-convex
objectives.
5
The gradient of the log-sum-exp function is the softmax function for which we present the Jacobian in
(5.19)
,andtheJacobianofthegradientofafunctionisequaltotheHessianofthatfunction.Italsoturns
out that the matrix in
(5.19)
is the covariance matrix of a categorical distribution (multinomial distribution
with one trial) implying that it is positive semidefinite.
118
i
i
i
i
i
i
i
i
4.1 Formulation of Optimization
General Approach of Descent Direction Methods
We have seen that the loss functions of logistic regression and multinomial regression are
convex, and thus they have global minima. Unfortunately, even for these shallow neural
networks, despite being convex, there is no known analytical solution for the minimizer.
Further, in more general deep neural networks, we do not have such convexity properties
and the ob jective functions typically exhibit numerous local minima. Thus in general, it is
impossible to think of an analytical solution for the optimization problems in deep learning,
and thus we have to relay on more advanced numerical optimization methods.
All the popular optimization methods used in deep learning can be put under one general
framework called the descent direction method. It is an iterative method that attempts to
reduce the objective in each iteration. This might seem obvious, howe ver the reader should
keep in mind that in general optimization methods beyond those covered in this book, steps
may sometimes increase the objective before further reduction. In contrast, with the descent
direction method, while increases may occur in certain iterations, the overall goal is to have
a reduction of the objective in each iteration. This method is presented in Algorithm 4.1.
Algorithm 4.1: The descent direction method
Input: Dataset D = {(x
(1)
,y
(1)
),...,(x
(n)
,y
(n)
)},
objective function C(·)=C(·; D), and
initial parameter vector
init
Output: Approximately optimal
1 Ω
init
2 repeat
3 Determine the descent step
s
4 Ω +
s
5 until termination condition is satisfied
6 return
As is evident, Algorithm 4.1 relies on the sp ec ific way in which we compute the descent step
and the termination condition. For each iteration
t
=0
,
1
,
2
,...
of Algorithm 4.1, denote
the values of the parameter vector
and the descent step
s
by
(t)
and
(t)
s
,respectively.
With this notation, the sequence of p oints
(1)
,
(2)
,..., evolves via,
(t+1)
=
(t)
+
(t)
s
, with
(0)
=
init
,
until
(t)
satisfies the specified termination condition. Note that the magnitude of the descent
step Î
(t)
s
Î denotes the step size in the t-th iteration.
The descent step
(t)
s
at the
t
-th iteration is determined by information available in or until
the iteration
t
, where the exact manner in which it is computed depends on the variant of
the algorithm used. One specific example that we have already seen is the gradient descent
method, Algorithm 2.1 of Section 2.4, which is a special case of Algorithm 4.1. With gradient
descent, the descent step is set at
s
=
ÒC
(
) where
is the learning rate and since
>
0, at each step the gradient descent method moves along the steepest descent direction.
6
6
Note that some authors use the term steepest descent to denote a more general class of algorithms of
which gradient descent can be taken as a special case. However, in this book “steepest descent” and “gradient
descent” are considered synonymous.
119
i
i
i
i
i
i
i
i
4 Optimization Algorithms
Other variants encountered in the sequel are stochastic gradient descent (SGD) presented in
Section 4.2, ADAM presented in Section 4.3, second-order methods presented in Section 4.6,
and other variants.
When the goal is to minimize the objective function
C
, ideally, we would like to terminate
the algorithm only after reaching a minimum point. This convergence is dicult to guarantee
in general as the updates of the algorithm can slow down or oscillate in the vicinity of the
minimum. Therefore, in practice there are several termination conditions that allow some
tolerance so that the algorithm can output a point in the vicinity of the minimum. We now
list a few traditionally popular termination criteria. However, note that the next section
presents optimization in the context of deep learning where other criteria are often used.
Most termination conditions use a small tolerance value,
Á >
0. One standard termination
condition tracks absolute improvement, where we terminate when the change in the function
value is smaller than the tolerance. Namely,
-
-
C(
(t)
) C(
(t+1)
)
-
-
< Á.
An alternative is to use the gradient magnitude or step size where we terminate respectively
if,
ÎÒC(
(t+1)
)Î < Á, or Î
(t)
s
Î < Á.
When the region around the (local) minimum is shallow, it is sometimes useful to use the
relative improvement criterion where we terminate based on the relative change in the
function value. Specifically, we terminate if
|C(
(t)
) C(
(t+1)
)|
|C(
(t)
)|
< Á.
In addition to these criteria which track the state of the optimization procedure, an alternative
is to simply run for a fixed number of iterations
t
=0
,...,T
, or similarly to limit the running
(clock) time of the algorithm to a desired predefined value.
Similarly, the termination condition at Step 5 of Algorithm 4.1 can be dierent for dierent
methods. The optimal output point obtained by the algorithm may not nece ss arily minimize
the objective function
C
(
·
). Rather it depends on the termination condition. For instance,
as we see in Section 4.2, the final
returned by Algorithm 4.1 could be a minimizer of the
validation set accuracy, but not a minimizer of the objective function C(·).
4.2 Optimization in the Context of Deep Learning
In the previous section we saw that certain optimization problems are tractable in the sense
that a local minimum is also the global minimum. Specifically, the loss functions of linear
regression, logistic regression, and multinomial regression are convex. However, in most
deep learning models, starting with the models that we cover in the next chapter, the loss
functions are not as “blessed”. For such deep learning models, the number of parameters is
huge and the problems are non-convex, often having a large number of local minima.
Nevertheless, basic optimization techniques are very useful for deep learning with the
compromise of not finding the global minimum of the loss, but rather finding a point close
to some local minimum with satisfactory performance. That is, the obtained p oint has
120
i
i
i
i
i
i
i
i
4.2 Optimization in the Context of Deep Learning
performance which is in general not far o from the performance at the global minimum.
Indeed, in deep learning, we use optimization algorithms trying to minimize the training
loss function
C
(
) while our ultimate goal is finding
that has good performance on the
unseen data; see Section 2.5 for a discussion of performance on unseen data. That is, the
goal is to actually minimize expectations or averages of performance functions
7
P
(
),such
as the validation set error rate. We then hope that this minimization resembles performance
on unseen data. Hence, as the reader studies the optimization methods of this chapter, they
should ke ep in mind that minimization of the loss
C
(
) is often a proxy for minimization
of P().
Challenges Posed by Basic Gradient Descent
Almost all cases where optimization is used for deep learning involve some variant of gradient
descent; see Algorithm 2.1 of Section 2.4 for an introduction to gradient descent. However,
basic gradient desc ent poses a few important challenges that can make the algorithm
impractical in the deep learning context.
The primary challenge is that in each iteration, the gradient descent algorithm uses the
entire dataset to compute the gradient
ÒC
(
). In the context of deep learning, the sample
size
n
of the training data can be large or the dimension
d
of the parameter
can be huge,
or both. As a consequence, the op e ration of computing the gradient using the entire dataset
for every update can be very slow, or even intractable in terms of memory requirements.
Further, as already mentioned above, loss functions of deep neural networks tend to be
highly non-convex, with multiple lo c al minima. For such functions, the apriori selection
of a good learning rate
is very dicult. A low learning rate can lead to an increas ed
computational cost as a result of slow convergence of the algorithm, and it can lead to the
algorithm getting stuck at local minima or flat sections of the loss landscape, eliminating
the chance to explore other local minima. On the other hand, a large learning rate can result
in the objective function values and decision variables fluctuating in an haphazard manner.
Even for simple convex models, as we saw in Figure 2.7 of Section 2.4, the choice of the
learning rate aects learning performance. As a second elementary example with
d
=2,
assume hypothetically that the loss landscape takes the form of the following Rosenbrock
function,
C()=(1
1
)
2
+ 100(
2
2
1
)
2
, (4.9)
which is non-convex yet has a unique global minimum at =(1, 1).
Figure 4.2 illustrates the p e rformance of gradient descent on
(4.9)
. Here we also introduce
the option of varying the learning rate as optimization iterations progress based on a
predefined learning rate schedule in this case. Specifically, at each iteration
t
, instead of using
the fixed learning rate
we use
(t)
which is indexed by iteration
t
and parameterized by
the original
. This strategy is sometimes used in practice, yet as we see in this example, it
does not work well for the parameters that we choose, and in none of the four trajectories
starting at an initial point
init
, do we find the minimum. Note that in this specific example
it is not hard to fine-tune the learning rate
and/or the exponential decay parameter of
7
Note that in Section 2.5 we denoted the performance measure function as a function of
ˆy
and
y
for
each observation in D where as here we are taking it as a function of the parameters .
121
i
i
i
i
i
i
i
i
4 Optimization Algorithms
(t)
, which is set at 0
.
99. However, this is a simple, hypothetical,
d
=2dimensional loss
landscape, whereas in actual problems with
d>
2,tuning
(t)
to achieve ecient learning
(with basic gradient descent) is often very dicult or impossible.
(a) (b)
Figure 4.2:
Attempting minimization of a Rosenbrock function with a minimum at
=(1
,
1)
via basic gradient descent for two values of
with 10
5
iterations and
init
=(0
.
2
,
0
.
2).(a)Fixed
learning rate:
(t)
= . (b) Exponentially decaying learning rate:
(t)
= 0.99
t
.
Instead of adjusting the learning rate according to a predefined schedule, other variants
try to adjust the learning rate according to other criteria such as the objective function
crossing some threshold values. Further, in some implementations,
(t)
is optimized in each
iteration, so that the function
C
(
) decreases maximally in the direction of the step size; see
for example line search methods described in Section 4.5. However, such strategies are again
not fool-proof on their own, or are often not practical for deep learning.
Furthermore, even if a magical (“optimal”) learning rate
can be suggested for each iteration,
the basic gradient descent method applies the same learning rate to all parameter updates. It
turns out that it is better to learn the parameters
1
,...,
d
at dierent rates in an adaptive
manner because larger rates on rarely changing parameter coordinates may result in faste r
convergence of the algorithm. To handle such cases, it is common practice to use the ADAM
algorithm. This variant of gradient descent, together with its building blocks, are the focus
of Section 4.3.
Prior to exploring ADAM and related gradient descent adaptations, let us return to the key
issue mentioned above which is the computational complexity of obtaining exact gradients
ÒC
(
). This issue is typically handled by computing approximate gradients either via
stochastic gradient descent, which we describe now, or more practically with the use of mini-
batches described in the sequel. Importantly, both of these approaches yield fast computable
approximate gradients. These approaches also yield parameter trajectories that manage to
escape local minima, flat regions, or saddle points by introducing randomness or noise into
the optimization pro c es s.
122
i
i
i
i
i
i
i
i
4.2 Optimization in the Context of Deep Learning
Stochastic Gradient Descent
Recall that the loss function
C
(
) for deep neural networks is typically a finite sum of the
form (4.4)with
C
i
(
) being the loss computed for the
i
-th data sample. The method of
stochastic gradient descent exploits this structure by computing a noisy gradient using only
one randomly selected training sample. That is, it evaluates the gradient for a single
C
i
(
)
instead of the gradient for all of C( ).
More precisely, the algorithm operates similarly to Algorithm 2.1 (or variations of it), yet in
the
t
-th iteration of stochastic gradient descent, an index variable
I
t
is randomly selected
from the set
{
1
,...,n}
and the gradient
ÒC
I
t
(
) is computed only for the
I
t
-th data sample
in place of Step 3 of Algorithm 2.1. The update rule for the decision variable is then,
(t+1)
=
(t)
ÒC
I
t
(
(t)
). (4.10)
This stochastic algorithm exhibits some interesting properties. Most importantly, the ex-
ecution of each iteration can be much faster than that of basic gradient descent because
only one sample is used at a time (the gain is of order
n
). Further, if the index variable
I
t
is
selected uniformly over
{
1
,...,n}
, the parameter update
(4.10)
is unbiased in the sense that
the expected descent s tep for e ach iteration
t
is the same as that of gradient descent. That
is, in each iteration, we have
E [ÒC
I
t
()] =
1
n
n
ÿ
i=1
ÒC
i
()=ÒC(), for a fixed . (4.11)
and hence the stochastic step (4.10) is “on average correct”.
8
While being unbiased, the stochastic nature of
(4.10)
introduces fluctuations in the descent
direction. Such noisy trajectories may initially appear like a drawback, yet one advantage
is that they enable the algorithm to escape flat regions, saddle points, and local minima.
Thus, while the fluctuations make it dicult to guarantee convergence, in the context of
deep learning their eect is often considered desirable in view of highly non-convex loss
landscapes.
Importantly, stochastic gradient descent is generally able to direct the parameters
towards
the region where minima of
C
i
(
) are located. To get a feel for this attribute of stochastic
gradient descent, we consider another hypothetical example where for i =1,...,n,
C
i
()=a
i
(
1
u
i,1
)
2
+(
2
u
i,2
)
2
, (4.12)
for some constant
a
i
>
0 and
u
i
=(
u
i,1
,u
i,2
)
œ R
2
. He re
u
i
is the unique minimizer of
C
i
(
).
These individual loss functions for each observation are convex and hence the total loss
function
C
(
) of
(4.4)
is also convex.
9
Let
S
be the convex hull of
u
1
,...,u
n
, that is,
S
is
the smallest convex set that contains
{u
1
,...,u
n
}
. It is now obvious that the global minima
of C is also in S.
8
The reader might be tempted to conclude that
E
[
(t)
SGD
]=
(t)
GD
for all tim e
t
,wherethesubscriptSGD
is for sto chastic gradient descent starting at the same initial condition as GD (“Gradient Descent”). However,
in general this is not correct when ÒC() is non-linear in .
9
This example does not resemble a general case with multiple lo cal minima since it is convex. Nevertheless,
it is useful for understanding some of the behaviour of stochastic gradient descent.
123
i
i
i
i
i
i
i
i
4 Optimization Algorithms
In Figure 4.3 we plot the contours of such a function when
n
=5where in Figure 4.3 (a) the
points
u
1
,...,u
5
are distinct and in (b) these points are identical. The convex hull is marked
by the region bounded in green. The figure also plots the evolution of gradient descent and
stochastic gradient desc ent in each of the c ase s. As is evident, while the stochastic gradient
descent trajectory is noisier, outside of the convex hull
S
, its trajectory is similar to that of
gradient descent.
Interestingly, in Figure 4.3 (a), once the trajectory hits
S
it mostly stays within
S
but with
a lot of fluctuations. The reason for this behavior is that at any point
not in
S
,thedescent
directions for both gradient descent and stochastic gradient descent are towards the set
S
as
it contains the minima of
C
(
) and every
C
i
(
). To make this a concrete argument, note
that the descent step in the t-th iteration of the stochastic gradient descent is
ÒC
I
t
(
(t)
)=
n
ÿ
i=1
1{I
t
= i}ÒC
i
(
(t)
).
Since every
C
i
(
) has its minimum in the set
S
,if
(t)
/œ S
,thenevery
≠ÒC
i
(
(t)
),the
steepest direction of
C
i
(
) at
(t)
, guides the algorithm towards the set
S
. Therefore,
irrespective of the choice of I
t
, the update moves
(t+1)
towards S.
(a) (b)
Figure 4.3:
An hypothetical example where
C
(
) is composed of individual
C
i
(
) as in
(4.12)
.
Comparison of gradient descent (GD) and stochastic gradient descent (SGD). (a) A case where the
minimizers
u
1
,...,u
n
of
C
i
(
) are dierent and are each marked by green dots in the plot. (b) A
case where the minimizers u
1
,...,u
n
are the same.
Working with Mini-batches and the Concept of Epochs
On one extreme, computing the full gradient for basic gradient descent is computationally
challenging but the obtained gradients are noiseless. On the other extreme, stochastic
gradient descent reduces the computational complexity of each update but the updates are
noisy. One middle ground approach is to work with mini-batches. Almost any practical deep
learning training uses some variant of this approach.
124
i
i
i
i
i
i
i
i
4.2 Optimization in the Context of Deep Learning
A mini-batch is simply a small subset of the data points, of size
n
b
. Mini-batch sizes are
typically quite small in comparison to the size of the dataset and are often chosen based on
computational capabilities and constraints. In practice, a mini-batch size
n
b
is often s elec ted
so the computation of gradients associated with
n
b
observations can fit on GPU memory.
This generally makes computation more ecient in comparison to stochastic gradient descent
which cannot take full advantage of parallel computing or GPUs since the gradient evaluation
for a single observation index I
t
is not easy to parallelize.
When using mini-batches, for each iteration
t
, a mini-batch with indices
I
t
{
1
,...,n}
is
used to approximate the gradient via,
1
n
b
ÿ
iœI
t
ÒC
i
(
(t)
). (4.13)
This estimated gradient is then used as part of gradient descent or one of its variants (such
as ADAM which we present in the sequel); for example it is used in place of Step 3 of
Algorithm 2.1. One common practical approach with mini-batches is to shue the training
data apriori and then use the shued indices sequentially. With this process there are
n/n
b
mini-batches
10
in the training dataset and each mini-batch is a distinct random subset of the
indices. In such a case, when using some variant of gradient descent, every pass on the full
dataset has
n/n
b
iterations, with one iteration per mini-batch. Such a pass on all the data
via
n/n
b
iterations is typically called an epoch. The training process then involves multiple
epochs. It is quite common to diagnose and track the training process in terms of epochs.
Note that one may also randomly shue the data (assign new mini-batches) after every
epoch to reduce the correlation between the epochs and reduce bias in the optimization
process.
Similarly to
(4.11)
when using the mini-batch approach, the estimated gradient is unbiased
in the sense that
E
S
U
1
n
b
n
b
ÿ
j=1
ÒC
I
t,j
()
T
V
=
1
n
b
n
b
ÿ
j=1
1
n
n
ÿ
i=1
ÒC
i
()=ÒC(), for a fixed .
Here
I
t,j
denotes the
j
-th element in
I
t
and the first equality holds because shuing
makes each
I
t,j
to be uniform over
{
1
,
2
,...,n}
. To get a feel for the performance of
learning with mini-batches, a mathematical simplification is to consider a variant where
I
t
=
(
I
t,1
,...,I
t,n
b
) is a collection of randomly selected indices on
{
1
,...,n}
with replacement.
11
As a consequence, each
I
t,j
is not only uniform over
{
1
,
2
,...,n}
, but also independent of all
the rest. This manner of computing gradients is an approximation to the above mentioned
mini-batches with shuing the data after every epoch.
Importantly, in this setup, we notice that the noise of the gradients decreases as the mini-
batch size
n
b
increases. To see this, recall that for each
t
, the index variables
I
t,1
,...,I
t,n
b
are chosen independently. Thus, the variance of the gradient estimate is
Var
Q
a
1
n
b
n
b
ÿ
j=1
ÒC
I
t,j
(
(t)
)
R
b
=
1
n
b
Var
1
ÒC
I
t,1
(
(t)
)
2
.
10
We assume here that
n
b
divides
n
.Ifitisnotthecasethenthereare
Án/n
b
Ë
mini-batches with the last
one having less than n
b
samples.
11
Some authors refer to this as mini-batch gradient descent.
125
i
i
i
i
i
i
i
i
4 Optimization Algorithms
Since
Var
!
ÒC
I
t,1
(
(t)
)
"
is the variance of the gradient for an arbitrary data sample, we
see that mini-batch gradient descent has
n
b
times smaller variance than that of stochastic
gradient descent.
Loss Minimization is a Proxy for Optimal Performance
Recall the discussion in Section 2.5 where we saw that the ultimate focus of learning is to
optimize the expected performance on unseen data. This is typically quantified in terms
of some error metric which we wish to be minimal. In this section we simply refer to the
estimated expected performance function as
P
(
), keeping in mind that it is typically an
average over the validation set. When training, it is often more tractable to minimize the
loss C(), however, the true goal is minimization of P().
In deep learning, the model choice and training process typically involves choosing a model
with a loss function
C
(
), such that loss minimization is computationally easy to carry out
and yields low
P
(
). In that sense, the loss function
C
(
) is often much more amenable to
derivative computations and gradient based learning, in comparison to trying to use the
performance function
P
(
) directly. Further, by tracking a performance function
12
on a
validation dataset not used for training, we are often able to make sure that overtraining (also
called overfitting) is avoided. This interplay of the loss function
C
(
) and the pe rformance
function
P
(
) implies that deep learning optimization is not traditional optimization where
one seeks
ú
that truly optimizes the loss, but is rather a means of using optimization of
C() as a surrogate to good performance on P().
We now explore this setting of deep learning optimization in two ways. In one direction we
discuss a notion of stopping criteria that diers from the termination conditions presented
at the end of Section 4.1. In another direction we highlight that the actual model choice
aects the loss function
C
(
), and some loss functions are more amenable to optimization in
comparison to others.
First, in terms of stopping criteria, as already mentioned above, for training a deep learning
model, standard practice is to use mini-batches and epochs. This me ans that some form
of the general Algorithm 4.1 is executed where multiple iterations are grouped into an
epoch. As training progresses (this is sometimes a process of hours, days, or more), we track
the evolution of the loss,
13
C
(
(t)
), as well as one or more performance functions
P
(
(t)
),
typically focusing on iterations t that are at the end of an epoch. In that sense, a common
stopping criterion is to stop at the epoch where
P
(
(t)
) is small when evaluated over the
validation set.
A popular strategy for this approach is called early stopping. With this strategy, training
often continues for multiple epochs beyond the epo ch where
P
(
(t)
) is minimized since at
that epoch it is still not evident that the minimum was reached. However, as the training
loop progresses, the model parameters for the best epoch are stored such that once training
is complete the “b est model” in terms of
P
(
) on the validation set can be used. A typical
trajectory of the loss and performance m etric is displayed in Figure 4.4. With such a typical
trajectory, increasing the number of ep ochs of training is somewhat similar to the increase of
model complexity discussed in Section 2.5; see Figure 2.9 of that section. Running the model
beyond the minimum point of
P
(
) is called over-training, similar to overfitting discussed in
12
In p r actice we may often track multiple performance functions.
13
In t h is context it is often called the training loss as it is the loss over the training set.
126
i
i
i
i
i
i
i
i
4.2 Optimization in the Context of Deep Learning
Section 2.5. The early stopping strategy is then a method to use validation performance to
try and avoid such over-training. Importantly, the actual minimum point, say
ú
, of
C
(
·
) is
not a desirable point for this purpose since it is typically a point of over-training and
P
(
ú
)
is sub-optimal in terms of the performance function.
Figure 4.4:
Training on a small subset of the MNIST dataset. The validation error rate is tracked
together with the training loss. The e arly stopping strategy uses the model from epoch 76 where
the validation performance is best.
As further details for Figure 4.4, we note that it is based on a deep fully connected network
(see Chapter 5)with5 layers and hundreds of thousands of parameters. We train the model
using only 1
,
000 MNIST images and use 500 additional images for the validation set. The
performance function used on the validation set is one minus the accuracy, and as is evident,
an accuracy of about 90% is obtained. Mini-batch sizes are taken at
n
b
= 10. As we can see,
the mo de l is trained for 200 epochs but retrospectively, the early stopping strategy pulls the
parameters
associated with e poch 76. Note that our chosen model is far from a realistic
model which one would use for such a low data scenario, yet we present this example here
to illustrate the idea of early stopping.
The second way in which deep learning optimization diers from general optimization is that
the mode l’s loss function itself is often specifically engineered for ecient optimization. In
classical applications of optimization appearing in applied mathematics, operations research,
and other fields, one often needs to optimize a given objective without the ability to modify
the objective. In contrast, deep learning models are versatile enough so that the actual
model, and subsequently the loss function can be chosen with the goal of facilitating ecient
optimization. Discussions of this nature appear in the sequel where for example the choice of
activation functions within a deep neural network significantly aects the nature of learning;
see for example the discussion of vanishing gradients in Chapter 5.
On this matter, we have already seen in Figure 3.4 of Section 3.2 how the cross entropy loss
for logistic regression presents a much b etter loss landscape for optimization in comparison
with the squared loss. In the case of logistic regression, cross entropy loss even implies having
a convex loss function, whereas that property is lost with the squared loss function. Further,
in more complex models using cross entropy is also beneficial in terms of the loss landscape
even if it does not guarantee convexity per-se. We now present one such illustrative example.
127
i
i
i
i
i
i
i
i
4 Optimization Algorithms
(a) (b)
Figure 4.5:
The loss landscape of a very simple two-layer neural network. (a) Using the squared
loss C
i
( )=(y
(i)
ˆy
(i)
)
2
. (b) Using the binary cross entropy loss C
i
( )=CE
binary
(y
(i)
, ˆy
(i)
).
Similarly to the synthetic data used for the logistic regression example yielding Figure 3.4,
let us use the same synthetic data on the model,
f(x)=
Sig
!
w
2
Sig
(w
1
x)
"
=
1
1+e
w
2
1+e
w
1
x
.
In the context of deep neural networks presented in Chapter 5, this is a two-layer network
without bias terms, a single neuron per layer, and sigmoid activation functions. It is not a
realistic model one would use in practice, yet for our expository purposes it has the benefit
of having only two parameters, and hence we may visualize the loss surface as a function of
w
1
and w
2
.
Interestingly, as we see in Figure 4.5, using the cross entropy loss presents a much smoother
loss landscap e in comparison to the squared error loss. Hence using gradient based opti-
mization with cross entropy would be much more ecient and have the algorithm less likely
to get stuck in local minima or saddle points. Further, in much more realistic models with
thousands or millions of parameters, while not possible to fully visualize, using cros s entropy
often yields more sensible loss landscapes as well. Hence in summary we see that when
handling optimization in the context of deep learning, there are often other considerations
at play, beyond standard optimization.
4.3 Adaptive Optimization with ADAM
Essentially, all deep learning optimization techniques can be cast as some form of the desc ent
direction method outlined in Algorithm 4.1. Basic gradient descent (Algorithm 2.1) is one
specific form of this descent direction method, and as we outlined in the previous section, it
suers from multiple drawbacks. While some of these drawbacks are alleviated when using
noisy gradients via stochastic gradient descent or mini-batches, there is still much more
room for improvement.
128
i
i
i
i
i
i
i
i
4.3 Adaptive Optimization with ADAM
In this section we present one of the most popular improvements over basic gradient descent
called the ADAM algorithm which is short for adaptive moment estimation. This algorithm
has become the default practical choice for training deep learning models. Like basic gradient
descent, ADAM is considered as a rst-order method since it uses gradients (first derivatives)
to determine the descent step used in Algorithm 4.1. For this, ADAM combines the ideas of
momentum, and adapt ive learning rates per coordinate into a simple mechanism of computing
the descent step. Hence to understand ADAM, we study these two concepts, keeping in
mind that the associated algorithms can also be viewed as independent methods in their
own right. Other variations of these concepts together with additional first-order techniques
that are not employed in ADAM are further summarized in Section 4.5.
Adaptive Optimization and Exponential Smoothing
A basic theme of ADAM and many other first-order methods has to do with adapting to
the local nature of the loss landscape as iterations progress. For simple
d
=2examples
as visualized in Figure 4.2 and Figure 4.5, we can quite easily get a taste for the nature
of the loss landscape, and it is generally not dicult to choose descent steps that yield
good overall performance. However, for practical examples with larger values of
d
, at any
iteration
t
of the general Algorithm 4.1, the only information available at our disposal is
the history of
(0)
,...,
(t1)
and the associated gradients. It thus makes sense to try and
“adapt” the descent step based on this history, perhaps with more emphasis on iterations
from the near-history. One may loosely call such an approach adaptive optimization.
One simple principle used in adaptive optimization is exponential smoothing. Beyond opti-
mization, exponential smoothing is popular in time series analysis and m any other branches
of data science and applied mathematics. In general, exp onential smoothing operates on a
sequence of vectors
u
(0)
,u
(1)
,u
(2)
...
. In the context of this section, the sequence may be a
sequence of gradients or a sequence of the squares of gradients.
The exponentially smoothed sequence is denoted as ¯u
(0)
, ¯u
(1)
,... and is computed via,
¯u
(t+1)
= ¯u
(t)
+(1)u
(t)
, with ¯u
(0)
=0, (4.14)
and
œ
[0
,
1). Thus, for
t Ø
0, each smoothed vector
¯u
(t+1)
is a convex combination of the
previous smoothed vector and the most recent (non-smoothed vector)
u
(t)
.When
=0,no
smoothing takes place,
14
and when
is a high value near 1, the new vector
u
(t)
only has a
small eect on the new smoothed vector ¯u
(t+1)
.
It is simple to use the update formula
(4.14)
to represent the smoothed vector
¯u
(t)
as a
convex combination of the complete history. Specifically, by iterating (4.14)wehave,
¯u
(t+1)
=(1 )
t
ÿ
·=0
t·
u
(·)
, for t =0, 1, 2,..., (4.15)
and thus each
u
(0)
,...,u
(t)
contributes to
¯u
(t+1)
, with exponentially decaying weights. As
we see now, within the context of adaptive optimization, such straightforward exponential
smoothing can go a long way.
14
Note that with our convention of indices, when
=0the exponentiall y smoothed sequence is actually
atimeshiftoftheoriginalsequence.
129
i
i
i
i
i
i
i
i
4 Optimization Algorithms
Momentum
Recall that the descent step of basic gradient descent is
times the negative gradient. This
implies that in flat regions of the loss landscape, gradient descent essentially stops. Further,
at saddle points or local minima, gradient descent stops completely. Such drawbacks may be
alleviated with the use of momentum.
To get an intuitive feel for the use of momentum in optimization, consider an analogy
of rolling a ball downhill on the loss landscape. It gains momentum as it roles downhill,
becoming increasingly faster until it reaches the bottom of a valley. Clearly, the ball keeps
memory of the past forces in the form of acceleration. This motivates us to use exponential
smoothing on the gradients to obtain an extra step called the momentum update.Itisthen
used as follows,
v
(t+1)
= v
(t)
+(1)ÒC(
(t)
)(Momentum Update), (4.16)
(t+1)
=
(t)
v
(t+1)
(Parameter Update), (4.17)
for scalar momentum parameter œ [0, 1) and learning rate > 0, starting with v
(0)
=0.
Similar to the general exponential smoothing of
(4.14)
, the momentum update
15
is an
exponential smoothing of the gradient vectors. When
=0, we have the basic gradient
descent method and for larger
, information of previous gradients plays a more significant
role. In practice, for deep neural networks, the gradient
ÒC
(
(t)
) is replaced with a noisy
gradient based on stochastic gradient descent or mini-batches.
The vector
v
(t)
in
(4.16)
is called the momentum
16
at the
t
-th update. Using
(4.15)
,wehave
v
(t+1)
=(1 )
t
ÿ
·=0
t·
ÒC(
(·)
), for t =0, 1, 2,.... (4.18)
That is, the momentum accumulates all the past (weighted) gradients, providing acceleration
to the parameter
updates on downward surfaces. Therefore, for
>
0, the next step take n
via
(4.17)
is not necessarily taken in the steep es t desce nt, instead the direction is dictated
by the (exponential smoothing) average of all the gradients up to the current iteration.
Figure 4.6 compares the performance of the gradient descent method with momentum for
dierent values of
on the Rosenbrock function
(4.9)
. We observe that for large
,the
momentum method accelerates as it takes downward steps.
Adaptive Learning Rates p er Coordinate
Observe that the updates of the gradient descent method, with or without momentum, use
the same learning rate
for all components of
. However, it is often better to learn the
parameters
1
,...,
d
with potentially a specific learning rate for each coordinate
i
.
15
One may alternatively parameterize the momentum update
(4.16)
as
v
(t+1)
=
v
(t)
+
ÒC
(
(t)
).This
alternative parameterization has the benefit of allowing one to adjust the momentum parameter
without
having significant eects on the step size exhibited in
(4.17)
.Ourchoiceoftheparameterizationasin
(4.16)
is for consistency with ADAM presented below.
16
In physics, momentum is defined as the product of the mass of an object and i t s velocity vector. The
object’s momentum points in the direction of an object’s movement. In contrast, here the momentum vector
v
(t)
points in the direction opposite to the step taken.
130
i
i
i
i
i
i
i
i
4.3 Adaptive Optimization with ADAM
Figure 4.6:
Application of momentum to the Rosenbrock function
(4.9)
for three dierent values
of
with learning rate
=0
.
001
/
(1
) and a fixed number of iterations. Note that
=0is basic
gradient descent.
To get a feel for this issue, let us assume a situation where specific parameters are associated
with spec ific features in the data.
17
For instance, in applications of deep neural networks
for natural language processing (see also Chapter 7), there are typically words that are less
frequent than others. For example in most texts on trees, the word
chaulmoogra
most likely
appears less frequently than the word
coconut
, even though both are names of tree species.
Now assume that there are associated paramete rs
chaulmoogra
and
coconut
, and consider
how these would be updated during learning.
What is likely to o c cur is that each mini-batch would have on average much fewer occurrences
of
chaulmoogra
in comparison to
coconut
. However, if we update all the parameters at the
same learning rate, then
coconut
would be up dated more than
chaulmoogra
. Hence with fixed
learning rate
,the
coconut
parameter is likely to converge quickly, while the
chaulmoogra
parameter will be slow to respond.
A general approach for overcoming this issue is to scale each coordinate of the desce nt
step with a dierent factor. That is, instead of considering a basic gradient descent step,
s
= ÒC(), consider descent steps of the form,
s
= r §ÒC(), (4.19)
17
In some simple models such as linear regression, logistic regression, or multinomial regression this
is exactly the case for categorical one-hot encoded variables. In more advanced models there is a more
complicated implicit mapping between parameters and features.
131
i
i
i
i
i
i
i
i
4 Optimization Algorithms
where
r
is some vector of positive entries which recalibrates the descent steps and
§
is
the element-wise product of two vectors.
18
That is, for a parameter such as
chaulmoogra
we
would expect the coordinate
r
chaulmoogra
to be high in comparison to the coordinate
r
coconut
which is associated with
coconut
. Such recalibration of the descent steps would result in
large steps for
chaulmoogra
whenever
chaulmoogra
appears in a mini-batch, and smaller
steps for
coconut
.
However, in general, finding a fixed good vector
r
for
(4.19)
is dicult, especially when the
number of parameters is huge and the actual relationship between parameters and features
is not clear. Instead, we use adaptive methods that adjust the descent step during the
optimization process.
We now introduce two such approaches called adaptive subgradient (or Adagrad) and root
mean square propagation (or RMSprop). In both of these approaches the update of coordinate
i
in iteration t can b e represented as,
(t+1)
i
=
(t)
i
Ò
s
(t+1)
i
+ Á
ˆC(
(t)
)
ˆ◊
i
, (4.20)
where
s
(0)
i
,s
(1)
i
,s
(2)
i
,...
is a sequence of non-negative values that are updated adaptively
and
Á
taken to be a small value of order 1
10
8
to avoid division by zero. That is,
with this representation, the descent step at time
t
represented in terms of
(4.19)
has
r
i
=
3
Ò
s
(t+1)
i
+ Á
4
1
. Generally we aim for
s
i
to be some smo othed re presentation of the
square of the derivative
ˆC(
(t)
)/ˆ◊
i
. Hence,
r
i
is roughly the inverse of the magnitude of
the derivative, and r
i
is low when the changes for coordinate i are steep and vise-versa.
Vector-wise, the parameter update (4.20) can be represented as
(t+1)
=
(t)
Ô
s
(t+1)
+ Á
§ÒC(
(t)
), (4.21)
where
s
(t)
=(
s
(t)
1
,...,s
(t)
d
),
Á
is considered a vector, and the addition, division, and square-
root operations are all element-wise operations.
Using the representation
(4.20)
or
(4.21)
, let us now define how the sequence of
{s
(t)
i
}
is
computed both for Adagrad and RMSprop. Specifically,
s
(t+1)
i
=
Y
_
]
_
[
q
t
·=0
1
ˆC(
(· )
)
ˆ◊
i
2
2
, for Adagrad,
s
(t)
i
+(1)
1
ˆC(
(t)
)
ˆ◊
i
2
2
, for RMSprop,
(4.22)
where the recursive relationship for RMSprop has the initial value at
s
(0)
i
=0. RMSprop is
also parameterized by a decay parameter
œ
[0
,
1) with typical values at
=0
.
999 implying
that for R MSprop the sequence
{s
(t)
i
}
is a relatively slow exponential smoothing of the
square of the derivative.
18
The element-wise product is also called the Hadamard product or Schur product.
132
i
i
i
i
i
i
i
i
4.3 Adaptive Optimization with ADAM
Adagrad may appear simpler than RMSprop since it do e s not involve any recursive step or
any hyper-parameter like
and it is simply an accumulation of the squares of the derivatives.
However, the crucial drawback of Adagrad is that for each
i
,thesequence
{s
(t)
i
}
is strictly
non-deceasing. As a consequence, the e ec tive learning rate decreases during training, often
making it infinitesimally small before convergence. RMSprop was introduced in the context
of deep learning after Adagrad, and overcomes this problem via exponential smoothing of
the squares of the partial derivatives.
It is also useful to use
(4.15)
to see that for RMSprop, the explicit (all-time) representation
of the vector s
(t+1)
is,
s
(t+1)
=(1 )
t
ÿ
·=0
t·
1
ÒC(
(·)
) §ÒC(
(·)
)
2
, (4.23)
while for Adagrad a similar representation holds without the (1
) and
t·
elements in
the formula.
Bias Correction for Exp onenti al Smoothing
We have seen that both momentum and RMSprop involve exponential smoothing of the
gradients and squares of the gradients respectively. In both of these cases, we set zero initial
conditions at iteration
t
=0. Specifically, for momentum, we have
v
(0)
=0and for RMSprop,
we have
s
(0)
=0. In the absence of any better initial conditions, this is a sensible choice,
yet such initial conditions may introduce some initial (short-term) bias. This bias would
eventually “wash away” as
t
grows but it can play a significant role for short term
t
.The
eect of such bias is esp e cially pronounced when the res pective exponential smoothing
parameters or are close to 1.
To mitigate such temporal bias, a common technique is to use bias correction.Toseethis,first
observe from
(4.14)
that since the initial value
¯u
(0)
=0, the first element of the exponential
smoothing is
¯u
(1)
=(1 )u
(0)
.
Since
is usually set near 1,
¯u
(1)
is close to zero even when
u
(0)
is not close to 0. Consequently,
¯u
(2)
stays close to zero even if u
(1)
is away from 0 because
¯u
(2)
= ¯u
(1)
+(1)u
(1)
.
Thus, initial values of the exponential smoothing sequence
¯u
(0)
, ¯u
(1)
,...
remain close to 0
even when the elements of original sequence u
(0)
,u
(1)
,... are far from 0.
One way to handle such bias is to consider the special case where the input vectors in
(4.15)
are fixed at u
(t)
= u. In such a case, via simple evaluation of a geometric sum we have,
¯u
(t+1)
=(1 )
1
t
ÿ
·=0
t·
2
u =(1 )
1
t
ÿ
·=0
·
2
u =(1
t+1
)u.
Hence if we were to divide the exponentially smoothed value
¯u
(t)
by 1
t+1
, we would
get constant vectors when
u
(t)
=
u
(constant). Further, in (more realistic) cases where the
underlying input sequence
u
(0)
,u
(1)
,u
(2)
...
is not constant but close to a constant, the bias
133
i
i
i
i
i
i
i
i
4 Optimization Algorithms
corrected sequence
{¯u
(t)
/
(1
t
)
}
may still do a better job at mitigating initial e ects.
Clearly as t grows,
t
æ 0, and the eect of this bias correction disappears.
Now we may apply such bias correction to momentum or RMSprop where we have the
specific forms
(4.18)
or
(4.23)
, respectively. In these cases, the bias corrected updates for
v
and s are,
ˆv
(t+1)
=
1
1
t+1
v
(t+1)
and ˆs
(t+1)
=
1
1
t+1
s
(t+1)
,t=0, 1, 2,.... (4.24)
Putting the Pieces Together: ADAM
Now that we understand ideas of momentum, RMSprop, and bias correction, we can piece
these together into a single algorithm, namely the adaptive moment estimation method, or
simply ADAM. Metaphorically, if the execution of the momentum method is a ball rolling
down a slope, the execution of ADAM can be seen as a heavy ball with friction rolling down
the slope.
The key update formula for ADAM is
(t+1)
=
(t)
1
Ô
ˆs
(t+1)
+ Á
ˆv
(t+1)
, (4.25)
where all vector ope rations (division, square root, and addition of
Á
) are interpreted element
wise. Here
ˆv
(t+1)
and
ˆs
(t+1)
are bias corrected exponential smoothing of the gradient and
the squared gradients as given in (4.24).
Algorithm 4.2: ADAM
Input: Dataset D = {(x
(1)
,y
(1)
),...,(x
(n)
,y
(n)
)},
objective function C(·)=C(·; D), and
initial parameter vector
init
Output: Approximatly optimal
1 t Ω 0 (Initialize iteration counter)
2 Ω
init
, v Ω 0, and s Ω 0 (Initialize state vectors)
3 repeat
4 g ΩÒC () (Compute gradient)
5 v Ω v +(1) g (Momentum update)
6 s Ω s +(1 )(g § g) (Second moment update)
7 ˆv Ω
v
1
t+1
(Bias correction)
8 ˆs Ω
s
1
t+1
(Bias correction)
9 Ω
1
Ô
ˆs + Á
ˆv (Update parameters)
10 t Ω t +1
11 until termination condition is satisfied
12 return
134
i
i
i
i
i
i
i
i
4.4 Automatic Dierentiation
With ADAM,
is still called the learning rate and is still the most important parameter
which one needs to tune. The other parameters are
œ
[0
,
1) for the momentum exponential
smoothing as used in
(4.16)
and
œ
[0
,
1) for the RMSprop exponential smoothing as is
used in (4.22). The common defaults for these parameters are =0.9 and =0.999.
ADAM is presented in Algorithm 4.2 where Step 4 is the gradient computation, steps 5 and
6 are exponential smoothing (momentum and RMSprop), steps 7 and 8 are bias corrections,
and finally Step 9 is the descent step.
Since the introduction of ADAM in 2014, this algorithm became the most widely used
algorithm (or “optimizer”) in deep learning frameworks. In the next section we start to
focus on what is typically the most costly computation within the algorithm, namely Step 4,
where we evaluate the gradient. Further generalizations of ADAM and other variations of
first-order methods are presented in Section 4.5.
4.4 Automatic Dierentiation
The computation of gradient vectors is a crucial ste p in the ADAM algorithm or in any
other first-order optimization method. Practically, in deep learning, the most common
way for computing such gradients is called automatic dierentiation and is embodied in
the backpropagation algorithm. In this section we introduce general concepts of automatic
dierentiation and then in Section 5.4 we specialize to the backpropagation algorithm for
deep neural networks.
All popular methods for computing derivatives, gradients, Jacobians, and Hessians follow
one of two approaches. In one approach, the algebraic expressions of the derivatives are
computed first and then the derivatives for a particular given point are evaluated. For this
approach the expressions of the derivatives are either derived manually, which is generally a
tedious process, or using computer based symbolic dierentiation methods. For example, we
have seen explicit gradient expressions in
(2.21)
for linear regression and
(3.17)
for logistic
regression. However, for general deep learning models, explicit expressions are not with such
compact form and instead suer from the problem of expression swell. Thus an alternative
approach is to compute the numerical values of the derivatives at a given point directly.
This approach includes standard methods of numerical dierentiation as well as automatic
dierentiation which is our focus here.
In this section, we first present an overview of numerical and symbolic dierentiation. We
then outline key ideas of dierentiable programming which is a programming paradigm
that e ncompass es automatic dierentiation. We then present forward mode automatic
dierentiation which is a stepping stone towards understanding backward mode automatic
dierentiation. Finally, we present backward mode automatic dierentiation.
Numerical and Symbolic Dierentiation
Supp ose we would like to compute the gradient of the objective function
C
(
) at a given
parameter vector
. Numerical dierentiation methods use the definition of the partial
derivative, see
(A.5)
in Section A.2, to compute the gradient
ÒC
(
) approximately. In
particular, the most basic form of numerical die rentiation approximates the partial derivative
135
i
i
i
i
i
i
i
i
4 Optimization Algorithms
via,
ˆC()
ˆ◊
i
¥
C( + he
i
) C()
h
, (4.26)
for a small constant
h>
0,where
e
i
is the
i
-th unit vector. Thus to obtain a numerical
estimate of
ÒC
(
) we need to evaluate
C
(
·
) at
, and
+
he
i
for
i
=1
,...,d
, and e ach time
use (4.26).
A classic problem with numerical dierentiation is selection of the constant
h
.While
mathematically, smaller
h
provides a better approximation, numerically, small
h
yields
numerical instability due to round-o errors.
19
Further, in the context of deep learning
where
is of very high dimension, the major drawback of numerical dierentiation is that it
requires us to perform an order of
d
function evaluations to compute the gradient at a point.
The basic alternative to numerical dierentiation is symbolic dierentiation. With this
paradigm, instead of directly obtaining numerical values of derivatives, we represent ex-
pressions using computer algebra systems and obtain mathematical expressions for the
derivatives using automated algorithms based on the rules of calculus. At its core, symbolic
dierentiation is a very useful tool. However, with deep learning, trying to rely on symbolic
dierentiation solely is not practical. A key problem is expression swell where the exact
mathematical representation of partial derivatives of loss functions ass ociated with deep
neural networks may often require excessive resources due to the complexity of the resulting
expressions.
Since in deep le arning we are mainly concerned with numerical values of the derivatives
but not with their analytical expressions, the application of symbolic dierentiation can
be unwieldy. However, instead of trying to find the expression of the objective function
directly, if it is decomposed into several elementary operations, then we can numerically
compute the gradients of the objec tive function by obtaining symbolic derivative expressions
of these elementary operations while keeping track of the intermediate numerical values in a
sequential manner. This idea of interleaving between symbolic expressions and numerical
evaluations at elementary levels is the basis of automatic dierentiation.
Overview of Dierentiable Programming
Let us now introduce basic terminology of automatic dierentiation within the world of
dierentiable programming. While for deep learning, the central application is associated
with the multi-input scalar-output loss
C
:
R
d
æ R
, to get a general feel for dierentiable
programming and automatic dierentiation, let us first consider general functions,
g
:
R
d
æ
R
m
. That is, when m>1, the outputs are vector valued.
Dierentiable programming refers to a programming paradigm where numerical computer
programs, or code for computer functions, are transformed into other functions which
execute the derivatives, gradients, Jacobians, Hessians, or higher-order derivatives of the
original computer functions. For example, consider a case with
d
=3and
m
=2and some
hypothetical computer programming language with some programmed function
g()
which
appears in code as follows:
g(t1, t2, t3) = (t1 - 7*t2 + 5*t3, t2*t3)
19
There are other similar schemes that generally exhibit less numerical error than
(4.26)
,yetanynumerical
dierentiation scheme requires tuning of h.
136
i
i
i
i
i
i
i
i
4.4 Automatic Dierentiation
We interpret this co de as a function which operates on three input numbers
1
,
2
, and
3
and then returns two numbers of which the first is a linear combination of the three inputs,
1
7
2
+5
3
, and the second is a product of two of the inputs,
2
3
. From a programming
perspective it is a pure function in the sense that it does not depend on any other variables
and does not change any other variables in the system. Thus, the computer function
g()
implements a mathematical function g : R
3
æ R
2
with,
g(
1
,
2
,
3
)=
!
g
1
(
1
,
2
,
3
),g
2
(
1
,
2
,
3
)
"
,
where g
1
: R
3
æ R and g
2
: R
3
æ R.
With dierentiable programming and automatic dierentiation, we are interes ted in evaluation
of partial derivatives such as
ˆg
i
/ˆ◊
j
for
i
=1
,
2 and
j
=1
,
2
,
3, at specific values of
1
,
2
, and
3
. For this simple example these derivatives are obviously known analytically and
constitute the Jacobian matrix of g;see(A.9) in Appendix A. In our case, the Jacobian is,
J
g
(
1
,
2
,
3
)=
5
1 75
0
3
2
6
,
where the i, j element of the Jacobian J
g
is ˆg
i
/ˆ◊
j
.
In its most basic form, a dierentiable programming system provides constructs such as
partial_derivative() which gives us the ability to write code such as for example:
partial_derivative(target=g, t=[2.5, 2.1, 1.4], i=2, j=3)
The meaning of such code is a request of the computer to evaluate the numerical value of the
Jacobian of the target function
g
at
=(2
.
5
,
2
.
1
,
1
.
4) for
i
=2and
j
=3. That is, we expect
such code to compute
ˆg
2
/ˆ◊
3
and return 2
.
1 as a result in this case. In contrast to symbolic
or numerical dierentiation, this computation is to be carried out through mechanisms of
automatic dierentiation.
In addition to such general partial derivatives via
partial_derivative()
, we may also ex-
pect dierentiable programming frameworks to expose software constructs such as
gradient()
for computing gradient vectors of functions
g
:
R
d
æ R
,
jacobian()
for computing the
Jacobian matrix of functions
g
:
R
d
æ R
m
, and
hessian()
for Hessian matrices of functions
g
:
R
d
æ R
. In any case, the mechanism of computation of these operations typically involves
a transformation of the source code of the original functions, or some data structures that
represent those functions, into new code or data s tructures that encode how derivative
evaluation is carried out. See the notes and references at the end of the chapter for references
to contemporary concrete software frameworks that execute such op erations.
Beyond
partial_derivative()
,
gradient()
,
jacobian()
, and
hessian()
, there are also
two other basic constructs in a dierentiable programming framework. These are jacobian
vector products,
jvp()
, and vector Jacobian products,
vjp()
. A Jacobian vector product is
based on
œ R
d
and
v œ R
d
and is the evaluation of
J
g
(
)
v
which results in a vector in
R
m
.
Similarly a vector Jacobian product is based on
œ R
d
and
u œ R
m
and is the evaluation of
u
J
g
(
) which results in a row vector whose transpose is in
R
d
.Duetothewayinwhich
automatic dierentiation operates,
jvp()
and
vjp()
are in fact the basic operations that
one may expect from a dierentiable programming framework. Whereas the other operations,
partial_derivative(), etc. are all built on top of either jvp() or vjp().
137
i
i
i
i
i
i
i
i
4 Optimization Algorithms
Automatic dierentiation generally works in two types of modes, namely forward mode which
is the algorithm behind
jvp()
and backward mode, also known as reverse mo de, which is
the algorithm behind
vjp()
. In both cases the multivariate chain rule plays a key role; see
Section A.3 in the appendix for a review of the chain rule. The general principle of automatic
dierentiation is based on computing intermediate partial derivatives and incorporating these
computed values in an iterative computation. Forward mode automatic dierentiation is a
straightforward application of the multivariable chain rule while backward mode is slightly
more involved. Note that the backpropagation algorithm for dee p learning is a special case
of backward mode. The subsections below focus on the inner workings of forward mode
and backward mode automatic dierentiation, but first let us see how
jvp()
and
vjp()
are
generally used in dierentiable programming.
As stated above, the basic service provided by forward mode automatic dierentiation is the
computation of Jacobian vector products. For instance, returning to the simple
d
=3,
m
=2
example above, we may invoke forward mode automatic dierentiation via code such as,
jvp(target=g, t=[2.5, 2.1, 1.4], v=[0,1,0]),
Here we choose
=(2
.
5
,
2
.
1
,
1
.
4) as before and
v
to be the unit vector
e
2
.Theresultisin
this case is
[-7, 1.4]
.Setting
v
as a unit vector
20
can be useful because in general when
using
v
=
e
j
, the output
J
g
(
)
e
j
is the
j
-th column of the Jacobian matrix. Hence in this
example of
jvp()
, we obtain as output the e e ct of
2
on the two outputs
g
1
(
) and
g
2
(
)
which is the vector (ˆg
1
/ˆ◊
2
, ˆg
2
/ˆ◊
2
)=(7, 1.4).
This shows that to implement
partial_derivative
for some
ˆg
i
/ˆ◊
j
at
, we can compute
J
g
(
)
e
j
(the
j
-th column of the Jacobian matrix) using
jvp()
and take the
i
-th index of
the output. Specifically a single application of
jvp()
on
v
=
e
j
provides us with the partial
derivatives
ˆg
i
/ˆ◊
j
for all
i
=1
,...,m
. Hence vector Jacobian products provided by forward
mode automatic dierentiation are useful when the output
m
is large and we wish to only
see the eect of a single, or a few input variables
j
on the output.
However, note that in the deep learning loss function application where
m
=1and
d
is large, vector Jacobian products are not an ecient means for evaluating the gradient.
Specifically, when we wish to evaluate
gradient()
using
jvp()
we need to invoke
jvp() d
separate times, each time with
v
=
e
j
for
j
=1
,...,d
. Thus using forward mode automatic
dierentiation (Jacobian vector products) is not ecient for deep learning loss landscape
gradient evaluation.
The alternative to
jvp()
is to use vector Jacobian products as exposed via backward mode
automatic dierentiation. With this approach we may invoke code such as,
vjp(target=g, t=[2.5, 2.1, 1.4], u=[1,0])
In this example we use the same
as above and set
u
=
e
1
œ R
2
. Since the vector Jacobian
product evaluates
u
J
g
(
) the output is a row vector of dimension
d
=3which is the first
row of
J
g
(
), namely
Òg
1
(
)
. More generally applying Jacobian vector products on
u
=
e
i
yields the transposed gradient Òg
i
()
.
In the deep learning loss minimization scenario
d
is very large and
m
=1. Thus the Jacobian
is simply the transpose of the gradient and is a long row vector. In such a case, computing
20
We can also use more general (non unit vectors) and obtain directional derivatives, yet in the remainder
of our discussion inputs to Jacobian vector products (
v
)andvectorJacobianproducts(
u
)areunitvectors.
138
i
i
i
i
i
i
i
i
4.4 Automatic Dierentiation
the Jacobian vector product with
u
=1(scalar) yields the gradient of the loss function
directly. Thus since backward mode automatic dierentiation implements
vjp()
, using it on
the loss function, is directly suited for deep learning.
We now explore the inner workings of forward mode automatic dierentiation followed by
backward mode automatic dierentiation. In b oth cases we restrict to
m
=1. Understanding
forward mode is a useful stepping stone to understanding backward mode. Note that forward
mode and backward mode involve internal implicit computations of Jacobian vector products
or vector Jacobian products, respectively.
21
The Computational Graph and Forward Mode Automatic Dierentiation
Computational graphs play a key role in the implementa tion of automatic dierentiation. A
computational graph for a multivariate function
C
:
R
d
æ R
is a directed graph where the
nodes correspond to elementary operations involved in the mathematical expression of
C
(
).
Inputs to the graph are the variables
1
,...,
d
and constants if required, and the output is
the function value C().
1
sin
2
+
exp
+ C()
3
2
1
2
3
4
5 6
a
1
a
4
a
2
a
5
a
6
a
3
Figure 4.7:
The computational graph of
(4.27)
with nodes numbered 1 to 6. The primal
a
i
of each
node is the result of an elementary operation applied on its inputs.
As a simple example with d =3, consider,
C()=sin(
1
2
)+exp(
2
+
3
) 2
3
. (4.27)
The computational graph for this function, presented in Figure 4.7, is c omposed of nodes with
inputs and outputs. The output of each node is an elementary operation on the inputs to the
21
It may be also useful to see the mathematical representations of Jacobian vector products and vector
Jacobian products through a comp osition of functions. For this we may see
(A.16)
,
(A.17)
,and
(A.18)
,in
Appendix A.
139
i
i
i
i
i
i
i
i
4 Optimization Algorithms
node. These elem entary operations include summation, multiplication,
sin
(
·
)
, cos
(
·
)
, exp
(
·
),
etc. Nodes are numbered 1
,
2
,...
, and we denote the output of node
i
with
a
i
=
f
i
(
u
1
,...,u
¸
i
)
when there are
¸
i
inputs to node
i
, denoted as
u
1
,...,u
¸
i
. In this example there are 6 nodes,
and taking node 6, as an illustration, we have
a
6
=
f
6
(
a
3
,a
4
,a
5
) where
¸
6
=3and
f
6
(
·
) is
the summation of its input arguments. Note that
a
i
is sometimes called the primal of that
node in a given computation.
The evaluation at each node in the computational graph is executed at the instance in which
inputs to the node become available. This then implies the notion of iterations, where a node
is considered to be evaluated at a given iteration if the inputs to that node are all available
at that iteration but not previously. As an example, with
=(
/
6
,
2
,
5), the iterations for
(4.27) are summarized in Table 4.1.
Tab le 4. 1:
Evaluation of the value of
C
(
) in
(4.27)
at an example
via the computational graph
in Figure 4.7 computing the primals a
i
for i =1,...,6.
General expressions of primals Values of primals at specific
Input =(
1
,
2
,
3
) =(/6, 2, 5)
Iteration 1
a
1
=
1
2
a
2
=
2
+
3
a
3
= 2
3
a
1
= /3
a
2
=7
a
3
= 10
Iteration 2
a
4
=sin(a
1
)
a
5
=exp(a
2
)
a
4
=
Ô
3/2
a
5
=e
7
Iteration 3 (output)
a
6
= a
4
+ a
5
+ a
3
= C()
a
6
= C()
=
Ô
3/2+e
7
10
As mentioned above, forward mode automatic dierentiation implements Jacobian vector
products. In our exposition here, since the target function
C
(
·
) is scalar valued (
m
=1),
the Jacobian is the transpose of the gradient and with
v œ R
d
, the Jacobian vector product,
J
C
(
)
v
, is a scalar.
22
Further, our exposition focuses on
v
=
e
j
, implying that the output of
forward mode automatic dierentiation is
ˆC
(
)
/ˆ◊
j
. Hence for our exposition here,
j
is
fixed.
The key idea of forward mode automatic dierentiation is to maintain a record of intermediate
derivatives as computation progresses through the computational graph. For this we denote,
˙a
i
=
ˆa
i
ˆ◊
j
, (4.28)
and call
˙a
i
the tangent at no de
i
, for fixed
j
. Now since
a
i
=
f
i
(
u
1
,...,u
¸
i
),usingthe
multivariable chain rule (see Appendix A.3), we have
˙a
i
=
¸
i
ÿ
k=1
ˆf
i
(u
1
,...,u
¸
i
)
ˆu
k
˙u
k
= Òf
i
(u
1
,...,u
¸
i
)
˙u, (4.29)
22
It i s in fact the directional derivative of C(·) in the dire ction v.
140
i
i
i
i
i
i
i
i
4.4 Automatic Dierentiation
with
˙u
k
denoting the tangent of
u
k
and with
˙u
denoting the vector of these tangents.
Returning to Figure 4.7 as an example, we have that
Òf
6
(
u
1
,u
2
,u
3
)=(1
,
1
,
1) and thus,
˙a
6
a
3
a
4
a
5
. Similarly, in that figure, Òf
4
(u
1
) = cos(u
1
) and thus ˙a
4
= cos(a
1
a
1
.
Since
(4.29)
defines a recursive computation of the tangents, we also require the inputs to
the function
1
,...,
d
to have tangents. We set these as the elements of the vector
v œ R
d
used in the Jacobian vector product. Specifically, here we focus on Jacobian vector products
with
v
=
e
j
since we are seeking the derivative with respect to a specific input
j
. Hence we
set the tangent values for the inputs of the function as
˙
j
=1and
˙
i
=0for i = j.
For forward mode automatic dierentiation, we progress on the computational graph in
the forward direction from the input to the output, evaluating both the primal
a
i
and
the tangent
˙a
i
at each node. As we progress, in each iteration, we consider the set of all
nodes
i
whose predecessors already have primal values and tanget values available, and for
each of these nodes we compute the primal values as was already shown in Table 4.1, and
importantly also compute the tangent values via
(4.29)
. This computation may be broken
into iterations, in a similar way to the iterations of the function evaluation computed in
Table 4.1.
In particular, since
f
i
(
·
) denotes an elementary operation, such as the function
exp
(
·
) at
node 5 in Figure 4.7, we assume the availability of a symbolic expression for each
ˆf
i
(u
1
,...,u
¸
i
)
ˆu
k
,
or alternatively an ability to numerically compute it exactly. Therefore, since we know the
numerical primal values
u
1
,...,u
¸
i
and the numerical tangent values
˙u
1
,..., ˙u
¸
i
, we can
easily compute
˙a
i
using
(4.29)
. Consequently, the numerical value of
ˆC()
ˆ◊
j
can be computed
at the end of forward mode as it is the tangent value of the output node of the computational
graph.
Table 4.2 illustrates forward mode automatic dierentiation using the example function
C
(
)
in
(4.27)
. The first column focuses on the general expression and the other column computes
the partial derivative of C() with respect to
2
at a specific point .
Tab le 4. 2:
Forward mo de automatic dierentiation for the example function
(4.27)
and
j
=2,
i.e., we seek the derivative with respect to
2
. The right column illustrates computation
23
of the
derivative
ˆC()
ˆ◊
2
at =(/6, 2, 5) which is obtained via the final tangent value ˙a
6
= /12 + e
7
.
General expressions of Values of primals and tangents at
primals and tangents specific for j =2
Start
=(
1
,
2
,
3
)
˙
=(
˙
1
,
˙
2
,
˙
3
)
=(/6, 2, 5)
˙
=(0, 1, 0)
Iteration 1
(a
1
, ˙a
1
)=(
1
2
,
1
˙
2
+
˙
1
2
)
(a
2
, ˙a
2
)=(
2
+
3
,
˙
2
+
˙
3
)
(a
3
, ˙a
3
)=(2
3
, 2
˙
3
)
(a
1
, ˙a
1
)=(/3, /6)
(a
2
, ˙a
2
)=(7, 1)
(a
3
, ˙a
3
)=(10, 0)
Iteration 2
(a
4
, ˙a
4
)=(sin(a
1
), cos(a
1
a
1
)
(a
5
, ˙a
5
)=(exp(a
2
), exp(a
2
a
2
)
(a
4
, ˙a
4
)=(
Ô
3/2, /12)
(a
5
, ˙a
5
)=(e
7
, e
7
)
Iteration 3
(a
6
, ˙a
6
)=
5
a
3
+ a
4
+ a
5
˙a
3
a
4
a
5
6
(a
6
, ˙a
6
)=
5
Ô
3/2+e
7
10
/12 + e
7
6
141
i
i
i
i
i
i
i
i
4 Optimization Algorithms
Backward Mode Automatic Dierentiation
As mentioned above, backward mode automatic dierentiation implements vector Jacobian
products. This allows us to use a single run of backward mode automatic dierentiation
to compute the gradient vector for
C
:
R
d
æ R
. For deep learning this is a significant
improvement over forward mode automatic dierentiation which would require
d
executions
to obtain such a gradient.
For a given point
, backward mode automatic dierentiation is executed in two phases often
called the forward pass and backward pass. The forward pass phase is executed before the
backward pass phase. In the forward pass, the algorithm simply evaluates the computational
graph to compute the values of the intermediate primal variables
a
i
while recording the
dependencies of these variables in a bookkeeping manner, similar to the evaluation in
Table 4.1.
In the backward pass, all the derivatives of the objective function are computed by using
intermediate derivatives called adjoints. Sp e cifically, for each intermediate variable
a
i
,the
adjoint is
i
=
ˆC
ˆa
i
. (4.30)
Adjoint
i
captures the rate of change in the final output
C
with respect to variable
a
i
.
Contrast this with the tangent
˙a
i
from forward mode automatic dierentiation, as in
(4.28)
,
which captures the rate of change of a
i
with respect to an input variable.
In the backward pass we populate the values of the adjoint variables
i
. Specifically, we
progress on the computational graph in the backward direction starting from the output
node and propagating towards the input. For this, say that the output of node
i
is input to
˜
¸
i
nodes with indices
i
1
,...,i
˜
¸
i
. Then the only way that
a
i
can influence the output
C
is by
influencing a
i
1
,...,a
i
˜
¸
i
. Now using the multivariable chain rule,
ˆC
ˆa
i
=
˜
¸
i
ÿ
k=1
ˆa
i
k
ˆa
i
ˆC
ˆa
i
k
,
and thus using the notation of (4.30),
i
=
˜
¸
i
ÿ
k=1
ˆa
i
k
ˆa
i
i
k
. (4.31)
The relationship
(4.31)
is used in iterations of the backward pass. In every iteration, the
adjoints
i
k
in the right hand side of
(4.31)
are available from previous iterations and are
used to compute
i
. In particular at the start of the backward pass we initilize the adjoint
of the output node at 1. Further, recalling the notation used for forward mode automatic
dierenation,
a
i
k
=
f
i
k
(
u
1
,...,u
¸
i
k
) where
u
1
,...,u
¸
i
k
are the inputs to no de
i
k
and
f
i
k
(
·
)
has readily computable derivatives with respect to its inputs. Hence, since the primals of
u
1
,...,u
¸
i
k
were already populated during the forward pass, we compute
ˆa
i
k
ˆa
i
locally at
node i and use it in (4.31).
23
To verify the computation recall the basic linearity and pro duct rules of scalar dierentiation. Also,
remember that the derivative of
sin
(
·
) is
cos
(
·
),thatthederivativeof
exp
(
·
) is
exp
(
·
),andthat
cos
(
/
3) = 1
/
2.
142
i
i
i
i
i
i
i
i
4.5 Additional Techniques for First-Order Methods
Tab le 4.3 :
The backward pass of backward mode automatic dierentiation for the example function
(4.27)
to compute its gradient at
=(
/
6
,
2
,
5). The result is
ÒC
(
)=(1
, /
12 +
e
7
,e
7
2). This
is computed after the forward pass is executed; see Table 4.1.
General expressions of adjoints Values of adjoints at specific
Start
6
=
ˆC()
ˆa
6
6
=1
Iteration 1
4
=
ˆa
6
ˆa
4
6
=1
6
5
=
ˆa
6
ˆa
5
6
=1
6
3
=
ˆa
6
ˆa
3
6
=1
6
4
=1
5
=1
3
=1
Iteration 2
1
=
ˆa
4
ˆa
1
4
= cos(a
1
)
4
2
=
ˆa
5
ˆa
2
5
=exp(a
2
)
5
1
= cos(/3) = 1/2
2
=e
7
Iteration 3
ˆC()
ˆ◊
1
=
ˆa
1
ˆ◊
1
1
=
2
1
ˆC()
ˆ◊
2
=
ˆa
1
ˆ◊
2
1
+
ˆa
2
ˆ◊
2
2
=
1
1
+1
2
ˆC()
ˆ◊
3
=
ˆa
2
ˆ◊
3
2
+
ˆa
3
ˆ◊
3
3
=1
2
2
3
ˆC()
ˆ◊
1
=2 1/2=1
ˆC()
ˆ◊
2
= /6 1/2+e
7
= /12 + e
7
ˆC()
ˆ◊
3
=e
7
2
The elements of the gradient
ÒC
(
) depend on the adjoints in a similar manner to
(4.31)
.
Specifically,
ˆC
ˆ◊
j
=
ÿ
k
ˆa
j
k
ˆ◊
j
j
k
, (4.32)
where the summation is over all nodes indexed via
j
k
that take
j
as input in the computa-
tional graph. Thus, in the final iteration of the backward pass, we use
(4.32)
to obtain the
gradient ÒC().
Let us return to the example in
(4.27)
with computational graph
(4.7)
evaluated at
=
(
/
6
,
2
,
5). Evaluation of
ÒC
(
) via backward mode automatic dierentiation first executes
a forward pass as previously illustrated in Table 4.1. This populates the primals
a
1
,...,a
6
.
Then the backward pass progresses via repeated application of
(4.31)
and with application
of
(4.32)
in the final iteration. This backward pass computation is summarized in Table 4.3.
4.5 Additional Techniques for First-Order Methods
In Section 4.3 we have seen one of the most popular deep learning optimization algorithms,
ADAM, as well as key milestone techniques that have led to its development. Then in
143
i
i
i
i
i
i
i
i
4 Optimization Algorithms
Section 4.4 we explored ways to compute the gradient
ÒC
(
). In this section, we explore
other first-order optimization ideas that are less popular in practical deep learning, but still
embody useful principles.
We begin the discussion by considering a modification of the momentum technique, to a
variant called Nesterov momentum.
24
The idea of this method has also found its way into
an algorithm called Nadam (Nesterov ADAM). We continue by considering variants of
Adagrad and RMSProp for adaptive learning rates per coordinate. The variants we describe
are called Adadelta and Adamax. While as of today, these additional techniques are not
common in deep learning practice, they have made significant impact earlier on and may
be fruitful in the future as well. In a sense, momentum (covered in Section 4.3), ADAM,
Nesterov momentum, Nadam, Adagrad (covered in Section 4.3), Adadelta, and Adamax are
all variants and improvements of basic gradient descent. Thus, our exposition in this section
aims to complete the picture on how basic gradient descent may be improved.
We close this section with an overview of line search techniques. With such techniques,
once a descent direction is determined, we seek to move in that direction with a step size
which minimizes loss over the next step in that direction. There are several variants for this
approach, and we present an overview of these variants.
Nesterov Momentum and the Nadam Algorithm
We have seen that the momentum updates as in
(4.16)
and
(4.17)
accelerate like a ball
rolling downhill. An issue with this method is that the steps do not slow down after reaching
the b ottom of a valley, and hence, there is a tendency to overshoot the valley floor. Therefore,
if we know approximately where the ball will be after each step, we can slow down the ball
before the hill slopes up again.
The idea of Nesterov momentum tries to improve on standard momentum by using the
gradient at the predicted future position instead of the gradient of the current position. The
update equations are,
v
(t+1)
= v
(t)
+(1)ÒC(
(t)
–—v
(t)
¸ ˚˙ ˝
predicted
next point
), (4.33)
(t+1)
=
(t)
v
(t+1)
¸ ˚˙ ˝
actual
next point
, (4.34)
for constants
œ
[0
,
1) and
>
0,with
v
(1)
=0. Compare
(4.33)
and
(4.34)
with
(4.16)
and
(4.17)
.Thedierence here is that the gradient is computed at a predicted next point
(t)
–—v
(t)
, which is a proxy for the (unseen) actual next point
(t+1)
=
(t)
v
(t+1)
when ¥ 1.
Implementing Nesterov momentum via
(4.33)
and
(4.34)
requires evaluation of the gradient
at the predicted points instead of actual points. This may incur overheads, especially when
incorporated as part of other algorithms. For example, if we also require
ÒC
(
(t)
) at each
iteration for purpuses such as RMSprop, then the gradient needs to be computed twice
instead of once per iteration; once for
ÒC
(
(t)
) and once for
ÒC
(
(t)
–—v
(t)
). For this
24
Nesterov momentum is also sometimes called Nesterov acceleration or Nesterov accelerated gradient.
144
i
i
i
i
i
i
i
i
4.5 Additional Techniques for First-Order Methods
reason, and also for simplicity of implementing gradients only at
(t)
,alook-ahead momentum
mechanism is sometimes used in place of (4.33) and (4.34).
To see how this mechanism works, first revisit the basic momentum update equations
(4.16)
and (4.17) and observe that the parameter up date c an be represented as,
(t+1)
=
(t)
!
v
(t)
+(1)ÒC(
(t)
)
"
. (4.35)
Observe that
v
(t)
is based on gradients at points
(0)
,...,
(t1)
, but not on the gradient at
(t)
. Hence a way to incorporate this last gradient is with look-ahead momentum where we
replace v
(t)
of (4.35)byv
(t+1)
. This achieves behaviour similar to Nesterov momentum.
With such a replacement, we arrive at update equations of the form,
v
(t+1)
= v
(t)
+(1)ÒC
!
(t)
"
, (4.36)
(t+1)
=
(t)
!
v
(t+1)
+(1)ÒC
!
(t)
"
¸ ˚˙ ˝
look-ahead
momentum
"
. (4.37)
Using
(4.36)
and
(4.37)
aims to provide similar behaviour to the Nesterov momentum
equations
(4.33)
and
(4.34)
, yet only requires evaluation of gradients at
(t)
as opposed
to a shifted point as in
(4.33)
. While look-ahead momentum is not equivalent to Nesterov
momentum, the gist of both methods is similar.
Now with update equations like
(4.36)
and
(4.37)
, it is straightforward to adapt ADAM to
include be havior similar to Nesterov momentum. This is done using similar steps to those
used to construct ADAM using momentum, RMSprop, and bias correction in Section 4.3.
The dierence here is that we use the look-ahead momentum equations
(4.36)
and
(4.37)
.
This algorithm is called Nadam.
In short, Nadam can be viewed exactly as the ADAM procedure in Algorithm 4.2,yetwith
line 7 replaced by,
ˆv Ω
1
t+2
v +
1
1
t+1
g.
In this cas e,
ˆv
is not just a bias corrected version of
v
as in ADAM since it also incorporates
look-ahead momentum, which is the key e xtra feature that Nadam adds to ADAM.
Adadelta
Recall the computation of
s
(t+1)
for the Adagrad method as in
(4.22)
. When Adagrad’s
s
(t+1)
is used in
(4.21)
, it has the problem of monotonically decreasing eective learning rates.
This is one of the reasons that eventually RMSprop became more popular than Adagrad.
However, there are other alternatives that are also very popular. One such alternative is the
Adadelta method which uses exponential smoothing of the squared gradients as in RMSprop,
but also uses another exponentially smoothed sequence of the descent step’s squares.
A key motivation for Adadelta is the observation that the update equation
(4.21)
for
RMSprop or Adagrad uses a unit-less quantity as the descent step. Specifically, in
(4.21)
the
only unit-full quantity in the c oecient multiplying
ÒC
(
(t)
) is the reciprocal of
Ô
s
(t+1)
.
145
i
i
i
i
i
i
i
i
4 Optimization Algorithms
Hence that coecient has units which are the inverse of the gradient and these cancel out
the units of the gradient implying that the descent step is unit-less.
With Adadelta, the update equation
(4.21)
is modified so that the descent step maintains
the same units of the gradient, via
(t+1)
=
(t)
Ô
(t)
+ Á
Ô
s
(t+1)
+ Á
§ÒC(
(t)
)
¸ ˚˙ ˝
Descent step
Â
ÒC(
(t)
)
, (4.38)
where
(t)
is adaptively adjusted yet has the same units as
s
(t+1)
, making the coecient of
the gradient unit-free. Compare
(4.38)
with
(4.21)
to observe that
Ô
(t)
+
Á
replaces the
learning rate
. This also means that Adagrad is “learning rate free”. Instead, a potentially
more robust parameter
œ
[0
,
1) is used similarly to the
parameter for RMSProp. This
parameter specifies how to exponentially s mooth squares of the descent steps.
Using the descent step of (4.38), the update equation for
(t)
is,
(t)
=
(t1)
+(1)
1
Â
ÒC(
(t1)
) §
Â
ÒC(
(t1)
)
2
,
starting with
(0)
=0. Then at iteration
t
, updating
(t+1)
via
(4.38)
, we use both
(t)
and s
(t+1)
, where the latter is updated via the RMSprop update equation as in (4.22).
Other Norms and Adamax
We have already seen that the quantities
s
(t)
i
used in RMSprop or Adadelta as in
(4.22)
are useful for adaptive learning rates per coordinate. With these methods, we may loosley
interpret
Ò
s
(t)
i
as an estimate of the standard (
L
2
) Euclidean norm of the s equence of
gradients for coordinate i, smoothed up to time t.
In principal, one may wish to use other
L
p
norms (see Appendix A.1) in place of the
Euclidean norm. For this we may modify the RMSprop recursion as in (4.22) to the form,
s
(t+1)
i
=
p
s
(t)
i
+(1
p
)|g
(t)
i
|
p
, with g
(t)
i
=
ˆC(
(t)
)
ˆ◊
i
, (4.39)
and then interpret
!
s
(t)
i
"
1
p
as an estimate of the smoothed
L
p
norm of the sequence of
gradients for coordinate
i
up to time
t
. Note that for convenience of the calculations below,
we reparameterize the exponential smoothing parameter to be
p
in place of .
With a recursion such as
(4.39)
we can obtain an “ADAM like” algorithm that updates
parameters via an update rule such as,
(t+1)
=
(t)
1
!
ˆs
(t+1)
"
1/p
+ Á
ˆv
(t+1)
, (4.40)
where
ˆv
(t+1)
is some bias-corrected momentum value, and
ˆs
(t+1)
is a bias corrected value
obtained from
(4.39)
; compare
(4.40)
to
(4.25)
. This additional hyper-parameter
p
,deter-
mining which
L
p
norm to use, can then potentially be tuned. However, for non-small
p
146
i
i
i
i
i
i
i
i
4.5 Additional Techniques for First-Order Methods
computations may often exhibit excessive numerical error due to overflow since we are raising
the gradient coordinate value to high powers and then taking a low power of it.
Adamax is a variant of this approach which instead of using the
L
p
norm uses the
L
Œ
norm.
Specifically in place of (s
(t)
i
)
1
p
we use r
(t)
i
which is recursively defined as,
r
(t+1)
i
= max
Ë
r
(t)
i
,g
(t)
i
È
, with r
(0)
i
=0. (4.41)
Then in place of (4.40)weuse
(t+1)
=
(t)
1
r
(t+1)
+ Á
ˆv
(t+1)
. (4.42)
Note that the updates
(4.41)
and
(4.42)
are well justified as approximations of
(4.39)
and
(4.40). To see this recall that a sequence of L
p
norms converges to the L
Œ
norm and thus,
r
(t+1)
i
=lim
pæŒ
1
s
(t+1)
i
2
1
p
=lim
pæŒ
S
U
(1
p
)
1/p
A
t
ÿ
·=0
p(t·)
|g
(·)
i
|
p
B
1
p
T
V
=lim
pæŒ
A
t
ÿ
·=0
1
(t·)
|g
(·)
i
|
2
p
B
1
p
= max
·=0,...,t
(t·)
-
-
g
(·)
i
-
-
= max
5
max
·=0,...,t1
(t1·)
|g
(·)
i
|, |g
(t)
i
|
6
.
Here, moving from the first line to the second line, we made use of the general representation
of exponential smoothing
(4.15)
, and moving from the third line to the fourth line we use
the fact that
L
p
vector norms converge to the
L
Œ
vector norm as
p æŒ
.Thisthenjusties
(4.41) as a limiting case.
Line Search
So far all the methods we considered were variants of gradient descent. When these are
viewed as a special case of the general descent direction method of Algorithm 4.1, updates
are of the form
(t+1)
=
(t)
+
(t)
s
. In the case of basic gradient descent we have descent
steps
(t)
s
=
ÒC
(
(t)
), yet in general we can represent the descent step as
(t)
s
=
–◊
(t)
d
where we call
(t)
d
a descent direction. With basic gradient descent
(t)
d
=
≠ÒC
(
(t)
). In any
case, for some prescribed descent direction, the step size
Î
(t)
s
Î
is determined by the choice
of .
Line search is an approach where we seek the best step size for a given descent direction in
each iteration. Specifically, in iteration
t
, given a descent direction
(t)
d
,wedeterminethe
for that iteration, denoted as
(t)
, via,
(t)
= arg min
C
1
(t)
+ –◊
(t)
d
2
. (4.43)
147
i
i
i
i
i
i
i
i
4 Optimization Algorithms
That is, each iteration involves a one dimensional minimization problem determining
(t)
which is used for the update
(t+1)
=
(t)
+
(t)
(t)
d
.
Hence with line search, the next point
(t+1)
is a minimizer of C(·) along the ray,
{
(t)
+ –◊
(t)
d
: > 0}. (4.44)
When we pick
(t)
to exactly optimize
(4.43)
, the method is called exact line search.Invery
specific cases exact line search can be achieved explicitly with closed form formulas. However
in most cases, if we wish to carry out exact line search, we need to numerically optimize the
one dimensional optimization problem
(4.43)
for the b es t
. An alternative to exact line
search is inexact line search where we only probe the loss on the ray
(4.44)
for a few values
of . Details are in the sequel.
(a) (b)
Figure 4.8:
One iteration of line search applied to a Rosenbrook function of
(4.9)
. (a) With the
current point
(t)
=(0
.
8
,
0) and the descent direction
(t)
d
= ≠ÒC(
(t)
)
, we optimize over a given
ray. (b) The value of the loss function along the ray with optimal
(t)
¥ 0.00165.
Note that in practice, for deep learning, neither exact nor inexact line search are used
directly. One reason is that loss function evaluations
C
(
·
) are computationally demanding
and are of a similar order to the computational cost of gradient evaluation. Hence practice
has shown that one might as well update descent directions instead of optimizing loss over a
fixed ray
(4.44)
. A second reason is that with deep learning we almost always have noisy
gradients using stochastic gradient descent or mini-batches, and hence searching for an
optimal
on a noisy direction can be practically less useful. Empirical evidence has shown
that in such cases, line search techniques are seldom useful on their own right. Nevertheless,
understanding line search is important as part of a general background for optimization and
is also used in quasi-Newton methods described in Section 4.6 which are sometimes used in
deep learning.
As a basic illustration of line search, return to the Rosenbrock function of
(4.9)
with contours
plotted in Figure 4.8 (a). In this plot we are at a point
(t)
with the des ce nt direction
148
i
i
i
i
i
i
i
i
4.5 Additional Techniques for First-Order Methods
(t)
d
=
≠ÒC
(
(t)
). The plot in Figure 4.8 (b) presents the one dimensional cross section along
the ray
(4.44)
with that descent direction where we see that the optimal (exact numerical
line search) is at
(t)
.
It is interesting to analyze the case
(t)
d
= ≠ÒC(
(t)
)
which we call basic gradient descent
with exact line search. In this case, trajectories of such optimization result in “zig-zagging”.
Specifically we have that every two consecutive descent steps are orthogonal. Namely,
(t)
s
(t+1)
s
=0 or g
(t)
g
(t+1)
=0,
with
g
(t)
denoting
ÒC
(
(t)
). This property is illustrated in Figure 4.9 where we consider
gradient descent with exact line search applied to the Rosenbrock function
(4.9)
. We start
with initial condition
(0)
=(0
.
4
,
0
.
2) from which the first two steps take very long jumps.
However, afterwards once the search is in a narrow valley, the zig-zagging phenomenon is
apparent and incurs a significant eective slow down on the rate of advance towards the
optimal point.
Figure 4.9:
Evolution of basic gradient descent with exact line search. We see the zig-zagging
property which slows down progress in narrow valleys.
To see the derivation of the zig-zagging property, observe that at
which optimizes
(4.43)
,
we have,
ˆC
!
(t)
g
(t)
"
ˆ–
=0,
149
i
i
i
i
i
i
i
i
4 Optimization Algorithms
or written in terms of the basic definition of the derivative we have,
lim
hæ0
C
!
(t)
( + h)g
(t)
"
C
!
(t)
g
(t)
"
h
=0. (4.45)
We now use the first-order Taylor series expansion as in
(A.22)
of
C
(
·
) around the point
(t)
g
(t)
for the point
(t)
g
(t)
hg
(t)
.Thisis,
C(
(t)
g
(t)
hg
(t)
)=C(
(t)
g
(t)
) hg
(t)
ÒC
1
(t)
g
(t)
2
+ O(h
2
),
where
O
(
h
2
) is a function such that
O
(
h
2
)
/h
2
goes to a constant as
h æ
0.Substitutingin
(4.45) we obtain,
0 = lim
hæ0
hg
(t)
ÒC
!
(t)
g
(t)
"
+ O(h
2
)
h
= g
(t)
ÒC
1
(t)
g
(t)
2
= g
(t)
g
(t+1)
,
where the last equality holds because g
(t+1)
= ÒC
!
(t)
g
(t)
"
.
We now visit a powerful numerical technique called the conjugate gradient method.Thisisa
topic which may receive special treatment in a general optimization text, yet here we only
present it in brief as an application of exact line search. Conjugate gradient is not typically
used directly for deep learning, yet since it is useful for approximately solving certain large
systems of linear equations, it sometimes finds its way as an aid to other deep learning
algorithms.
Suppose the objective function is of the following quadratic form,
C()=
1
2
A b
, (4.46)
where
A
is a
d d
dimensional symmetric pos itive definite matrix, and
b œ R
d
.Wehave
the Hessian
Ò
2
C
(
)=
A
for every
and since
A
is positive definite,
C
(
·
) is strictly convex
and there is a unique global minimum. We further have
ÒC
(
)=
A b
and this means
that the equation
ÒC
(
)=0with the unique solution
ú
=
A
1
b
minimizes
C
(
). Hence
the system of equations,
A = b, with A p os itive definite, (4.47)
can be solved by solving the optimization problem of minimizing
(4.46)
. Thus one may
view the conjugate gradient method as an algorithm for (approximately) solving a linear
system of equations such as
(4.47)
. In particular when the dimension
d
is excessively large,
the conjugate gradient method can be ecient while standard techniques s uch as Gaussian
elimination may fail. We have already seen one application of such equations in Chapter 2,
with the normal equations (2.15)whereA = X
X and b = X
y.
When carrying out line search on an objective function as
(4.46)
, the cross section above
the ray
(4.44)
for any direction is a parabola and hence exact line search has an explicit
150
i
i
i
i
i
i
i
i
4.5 Additional Techniques for First-Order Methods
solution for
(4.43)
. Conjugate gradient carries out iterations of such line search, where in
each iteration the search direction
(t)
d
is wisely chosen.
Specifically, at iteration
t
with current point
(t)
and search direction
(t)
d
, the next point
determined via (4.43)with(4.46)is,
(t+1)
=
(t)
+
(t)
(t)
d
, with
(t)
=
(t)
d
!
A
(t)
b
"
(t)
d
A
(t)
d
.
Then the next search direction is determined as linear combination of the next gradient and
the current search direction. Specifically,
(t+1)
d
= ≠ÒC(
(t+1)
)+
(t)
(t)
d
, with
(t)
=
ÒC(
(t+1)
)
A
(t)
d
(t)
d
A
(t)
d
.
This coecient
(t)
is designed such that the current search direction
(t)
d
and the next
search direction
(t+1)
d
are conjugate with respect to
A
in the sense that,
(t+1)
d
A
(t)
d
=0.
This conjugacy ensures desirable properties for the method and in particular implies that a
minimum is reached in
d
iterations. Analysis and further motivation of conjugate gradient
are beyond our scope. Our presentation here is merely to show that conjugate gradient is an
application of exact line search.
25
Inexact Line Search
As discussed above, one may use exact line search to solve the univariate optimization
(4.43)
optimally in each iteration. An alternative is to only probe a few specific values of
(t)
according to some predefined set of rules and stop at the best probed value. This is called
inexact line search and it is especially attractive when there is no analytical solution of
(4.43)
. Inexact line search may significantly reduce the computational cost per iteration in
comparison to the repetitive application of a univariate numerical optimization technique.
A typical approach for inexact line search is backtracking (also known as backtracking line
search). The idea is to start with some predetermined maximal
0
>
0 and then decrease it
multiplicatively until a stopping condition is met. Spec ifically, we evaluate
C
(
·
) on the ray
(4.44) with candidate values such as
0
,
2
3
0
,
1
2
3
2
2
0
,
1
2
3
2
3
0
,...,
where the decrease factor of
2
3
can be tuned to be any factor in (0
,
1). We then stop at the
instant at which the stopping condition is met. Note that this search takes place at every
iteration t.
One very common stopping condition is known as the Armijo condition or the first Wolfe
condition. Under this condition, at each iteration
t
, we decrease
values as above until it
25
Note that in each iteration, the conjugate gradient traverses the steepest direction under the norm
ÎuÎ
A
:=
Ô
u
Au, u œ R
d
.
151
i
i
i
i
i
i
i
i
4 Optimization Algorithms
satisfies
C
1
(t)
+ –◊
(t)
d
2
Æ C(
(t)
)+ —–Ò
d
C(
(t)
), (4.48)
where
œ
[0
,
1] is a parameter and
Ò
d
is shorthand notation for the directional derivative
in the direction
(t)
d
; for directional derivatives recall (A.8) from Appendix A. That is,
Ò
d
C(
(t)
)=Ò
(t)
d
C(
(t)
)=
(t)
d
ÒC(
(t)
).
To get an intuition for
(4.48)
, suppose that
=1, then the right hand side of
(4.48)
is
equal to the first-order Taylor approximation of
C
1
(t)
+ –◊
(t)
d
2
. Therefore, the decrease
in the ob jective suggested by the first Wolfe condition is at least as good as the prediction
by the first-order approximation. On the other extreme, if
=0,weselect
such that
C
1
(t)
+ –◊
(t)
d
2
Æ C
(
(t)
), which is also acce ptable as it guarantees that the next step at
least does not increase the objective.
Beyond the condition
(4.48)
, there are additional common stopping conditions that one may
employ together with backtracking line search. One such condition known as the second
Wolfe condition is,
Ò
d
C
1
(t)
+ –◊
(t)
d
2
Ø Ò
d
C(
(t)
), (4.49)
where œ (0, 1). Another common condition know n as the strong Wolfe condition is,
-
-
-
Ò
d
C
1
(t)
+ –◊
(t)
d
2
-
-
-
Æ≠Ò
d
C(
(t)
). (4.50)
This condition forces
(t)
to lie clos e to the solution of exact line search. The first Wolfe
condition and the above strong Wolfe condition together are called the strong Wolfe conditions .
These conditions are sometimes used with quasi-Newton methods described below.
4.6 Concepts of Second-Order Methods
All the optimization methods we studied so far were based on gradient evaluation and im-
plicitly use the first-order Taylor approximation of the objective function
C
(
). In particular,
the gradient of the objective function helps us determine the direction of the next step.
However, by using curvature information captured via the second derivatives, or Hessian
Ò
2
C(), algorithms can take steps that move farther and are more precise.
In this section, we present some well-known second-order Taylor approximation based
optimization methods, simply referred to as second-order methods. These methods also fall
within the general descent direction framework of Algorithm 4.1, yet their evaluation of the
descent step
(t)
s
either uses the Hessian explicitly, or employs some approximated estimate
for Hessian vector products.
While today’s deep learning practice rarely involves second-order methods, these methods
have huge potential. Indeed, if e mployed eciently, their application in general deep learning
training can yie ld significant performance improvement. Hence, understanding basic principles
of second-order techniques is useful for the mathematical engineering of deep learning.
152
i
i
i
i
i
i
i
i
4.6 Concepts of Second-Order Methods
This section starts by exploring simple concepts of second-order univariate optimization
where we introduce Newton’s method which is based on the first and second derivative of
the objective function. We also present the secant method which is only based on the first
derivative. We then move on to Newton’s method in the general multivariate case where the
Hessian matrix plays a key role. Indeed, for simple neural networks such as logistic regression
of Chapter 3, a method such as Newton is highly applicable.
For general deep learning with
œ R
d
,where
d
is typically huge and the loss function is
complicated, it is hopeless to use explicit expressions or automatic dierentiation for the
Hessian
Ò
2
C
(
). In such a case, one may use quasi-Newton methods that only approximate
the Hessian via gradients. Such application of second-order methods to deep learning
is a contemporary active area of research which we are not able to cover fully. Instead,
our presentation builds up towards a popular method called the limited-memory BFGS
(Broyden–Fletcher–Goldfarb–Shanno) algorithm, or L-BFGS for short. With this we aim to
open up horizons for the reader to the multitude of optimization technique variations that
one may consider.
The Univariate Case
Newton’s method is an iterative method that uses the quadratic approximation, also called
the second-order Taylor approximation
(A.22)
, of the objective function around the current
point and finds the next step by optimizing the quadratic function. For ease of understanding,
first consider the univariate case where
œ R
. Assume that the objective function
C
(
)
is twice dierentiable. Suppose we are at
(t)
in the
t
-th iteration and let
Q
(t)
(
) be the
quadratic approximation of the objec tive function C() around
(t)
, given by
C() ¥ Q
(t)
()=C(
(t)
)+(
(t)
)C
Õ
(
(t)
)+
(
(t)
)
2
2
C
ÕÕ
(
(t)
),
where C
Õ
(·) and C
ÕÕ
(·) are the first and second derivatives respectively.
Note that
Q
(t)
(
) depends on
(t)
. Further observe that
C
(
) and
Q
(t)
(
) take the same value
when
=
(t)
.Tofindthenextstep,
(t+1)
,weminimize
Q
(t)
(
) by setting its derivative to
zero and obtain,
C
Õ
(
(t)
)+(
(t)
)C
ÕÕ
(
(t)
)=0,
with the corresponding solution (for the variable ),
(t+1)
=
(t)
C
Õ
(
(t)
)
C
ÕÕ
(
(t)
)
, (4.51)
whenever
C
ÕÕ
(
(t)
)
=0. Newton’s method (for univariate function optimization) starts at
some initial point
(0)
and then iterates
(4.51)
until some specified stopping criterion is met.
If converged to some point
ú
,then
C
Õ
(
ú
)=0implying that the point is a local minimum,
local maximum, or saddle point.
26
Any quadratic function is either strictly convex or strictly concave depending on the leading
coecient. Therefore, if
C
(
) itself a strictly convex quadratic function, it has a unique global
minimum, and since the third order derivative of
C
(
) is zero, the quadratic approximation
26
Note that when considered in the context of general root finding of the form
F
(
)=0beyond the case
of F (·)=C
Õ
(·),themethodissometimescalledNewton-Raphson.
153
i
i
i
i
i
i
i
i
4 Optimization Algorithms
Q
(t)
(
) is identically equal to
C
(
). In such a case, Newton’s method finds the minimum
in a single iteration of
(4.51)
for any initial point
(0)
. This property hints at the speed at
which Newton’s method can move towards the minimum since any smooth function is well
approximated by a quadratic function in the neighbourhood of a local minimum.
(a) (b)
Figure 4.10:
Illustration of updates in Newton’s method. (a) The approximating quadratic function
Q
(t)
( ) is concave. (b) The approximating quadratic function Q
(t)
( ) is convex.
While potentially very ecient, in its basic form, Newton’s method exhibits several instability
issues. First observe that the method can end up in a local maximum even though the goal is
minimization. This is made apparent in Figure 4.10 which illustrates single step updates with
Newton’s method on a function that has both a local maximum and a local minimum. In (a)
a step is taken and it approaches (yet overshoots) the local maximum. In (b) a step is taken
(and also ove rshoots) the local minimum. As is evident, the initial point
(t)
for both (a) and
(b) are near each other, yet the subtle dierence in the sign of
C
ÕÕ
(
(t)
) implies movement in
completely dierent directions. In both cases, with another iteration the extremum point
would have been nearly reached. That is (a) would end up very near the local maximum
and similarly with another iteration (b) ends up very near the local minimum.
In general, the Newton update
(4.51)
is very sensitive to the second derivative and if
C
ÕÕ
(
(t)
) is very close to zero, then
(t+1)
can b e too far from
(t)
, and thus, the quadratic
approximation is not appropriate. That is, inflection points
27
on the
œ R
axis cause
havoc with Newton’s method. Similarly, as we see in the sequel for the multivariate case,
points
œ R
d
where the Hessian
Ò
2
(
) is nearly singular (eigenvalues near zero) are also
problematic in the same way.
To gain insights into some of the issues that may occur with Ne wton’s method, let us consider
some examples. The sensitivity of the method can cause overshoot of the minimum. Consider
for example
C()=( 3)
2
( + 3). (4.52)
We have
C
Õ
(
) = 3(
3)(
+ 1),
C
ÕÕ
(
) = 6(
1), and the function has a local minimum
at
ú
=3with C(
ú
)=0. The descent step resulting from (4.51)isthus,
(t)
s
=
C
Õ
(
(t)
)
C
ÕÕ
(
(t)
)
=
1
2
(
(t)
3)(
(t)
+ 1)
1
(t)
.
27
An inflection point for a twice dierentiable function C : R æ R is a point where C
ÕÕ
()=0.
154
i
i
i
i
i
i
i
i
4.6 Concepts of Second-Order Methods
Say we start at
(0)
=1
.
5 where
C
(
(0)
) = 10
.
125. At this point the descent step is
(0)
s
=3
.
75 implying that
(1)
=
(0)
+
(0)
s
=5
.
25. This overshoot of the local minimum
yields
C
(
(1)
)
¥
41
.
77 which is about a four-fold increase in the loss value. Note though that
in this case (as the reader may readily verify via simple calculations), the iterations that
follow (presented to 4 decimal points precision) are,
(2)
=3.5956,
(3)
=3.0683,
(4)
=3.0011,
and in fact the error for
(6)
is already less than 10
13
. Thus we see in this example that
there is initial significant overshoot which is later corrected and is followed by very quick
convergence.
Worse than an overshoot is a situation which results in moving away from the local minimum
of interest. We have already seen an example of this in Figure 4.10 (a) where starting to the
left of the inflection point yields convergence to a local maximum. Further, for the example
function (4.52)thiswilloccurif
(0)
< 1.
An additional phenomenon that may occur is oscillation. As an extreme example, consider,
C()=
1
4
4
+
5
2
2
, which implies
(t)
s
=
(t)
(
(t)
2
5)
5 3
(t)
2
.
If
(t)
=1, then we get
(t)
s
=
2 and thus
(t+1)
=
1. Further, in the next step we get
(t+1)
s
=2which brings us back to
(t+2)
=1. Such undesirable perfect oscillation is a
singular phenomenon and is not likely to occur in practice. However approximate oscillation
may sometimes occur and persist for multiple iterations until dampening.
In practice, and especially when considering multi-dimensional generalizations, trust region
methods can mitigate problems such as overshoot, movement in the wrong direction, or
oscillations. These methods incorporate conditions which disallow steps of large step size
beyond some predefined or adaptive thresholds. Trust region methods also incorporate
first-order methods as a backup. We do not cover the details further.
A final concern with Newton’s method is that it requires both the first derivative
C
Õ
(
·
)
and the second derivative
C
ÕÕ
(
·
). In some scenarios these derivatives are easy to obtain,
yet in some other cases the univariate objective may be a complicated function and even
the application of automatic dierentiation techniques may be inadequate for evaluating
C
ÕÕ
(
·
). This motivates the introduction of the secant method which is a modification of
Newton’s method where the second derivative in each iteration is approximated by the last
two iterations. Namely, we use the approximation,
C
ÕÕ
(
(t)
) ¥
C
Õ
(
(t)
) C
Õ
(
(t1)
)
(t)
(t1)
, (4.53)
which becomes precise when
(t)
and
(t1)
are not too far from each other. With this
approximation, up dates of the s ec ant method are given by
(t+1)
=
(t)
(t)
(t1)
C
Õ
(
(t)
) C
Õ
(
(t1)
)
C
Õ
(
(t)
),
155
i
i
i
i
i
i
i
i
4 Optimization Algorithms
where we initialize
(0)
and
(1)
to be close but not equal. In general, the secant method
may require more iterations for convergence than Newton’s method. The study of general
numerical analysis and optimization theory contains an in-depth analysis of such convergence
rates. This is a topic that we do not cover further.
The secant method also suers from the problems associated with Newton’s method men-
tioned above, including overshoot, moving in the wrong direction, and oscillation. Neverthe-
less, like Newton’s method, the secant method is generally very powerful and arrives to a
local minimum as long as the initial points are within the vicinity of the minimum.
Importantly, note that the secant method can be viewed as a quasi-Newton method since it
approximates the Hessian (second derivative of a scalar function) with an expression that
only depends on the first derivative. As we now generalize to multivariate optimization
we should keep in mind that in the context of deep learning, almost any second-order
method that one may wish to use should be a quasi-Newton method based on first-order
derivatives (gradients), as opposed to explicit or automatically dierentiated Hessians (second
derivatives). Nevertheless, let us first study Newton’s method for multivariate functions.
The Multivariate Case and Hessians
For a twice dierentiable multivariate objective function
C
:
R
d
æ R
the update of New ton’s
method is very similar to the univariate case of
(4.51)
. Specifically, the gradient and the
Hessian play the roles of the first and the second-order derivatives, respectivly. We can
represent this update as,
(t+1)
=
(t)
H
1
t
ÒC(
(t)
), (4.54)
where
H
t
=
Ò
2
C
(
(t)
) is the Hessian of the objective function at
(t)
, and when
H
t
is
non-singular; compare (4.54)with(4.51).
To see the development of
(4.54)
, as in the univariate case consider the quadratic approxi-
mation Q
(t)
() of C() around
(t)
given by
C() ¥ Q
(t)
()=C(
(t)
)+(
(t)
)
ÒC(
(t)
)+
1
2
(
(t)
)
H
t
(
(t)
). (4.55)
Then, similar to the univariate case, the next point
(t+1)
can be determined as the
such
that the gradient of Q
(t)
() is zero. Observe that this gradient is
ÒQ
(t)
()=ÒC(
(t)
)+H
t
(
(t)
),
and by equating ÒQ
(t)
() to 0 and representing
(t)
as the descent step
(t)
s
,
H
t
(t)
s
= ≠ÒC(
(t)
). (4.56)
This linear systems of equations for
(t)
s
can be represented in terms of the elegant update
equation
(4.54)
with descent step
(t)
s
=
H
1
t
ÒC
(
(t)
), in case
H
t
is non-singular. However,
in practice, we typically try to solve
(4.56)
using a suitable linear solver.
28
Sometimes
28
Suitability of an algorithm depends on the size of
d
and any special structure in
H
t
that we can expect.
For example for small
d
,onemayuseLUdecomposition(Gaussianeliminiation),orQRfactorizationbased
methods, yet for large d conjugate gradient or other Krylov subspace methods may b e suitable.
156
i
i
i
i
i
i
i
i
4.6 Concepts of Second-Order Methods
approximate solutions for
(4.56)
are also an option. For example one may use the conjugate
gradient method when
d
is large and run for a small number of iterations to only approximately
solve (4.56).
In summary, Newton’s algorithm for multivariate
C
(
·
) starts with some
(0)
œ R
d
.Then
at each iteration
t
in the algorithm with current point
(t)
, we (in principle) evaluate the
gradient and Hessian at that point. We then solve or approximately solve
(4.56)
to obtain
(t)
s
for the up date
(t+1)
=
(t)
+
(t)
s
.
In principle, in each iteration, solving
(4.56)
requires the Hessian matrix
H
t
=
Ò
2
C
(
(t)
)
either via a closed form formula as for example in the case of logistic regression
(4.7)
, or
more generally via automatic dierentiation. However with large
d
, computing the Hessian
or Hessian vector products via automatic dierentiation is ge nerally not feasible. Hence
in general, Newton’s method on its own is not suitable for deep learning. Nevertheless,
quasi-Newton methods such as the L-BFGS method, presented below, or various adaptations
can be employed.
We also mention that all of the potential problems discussed above for the univariate case
can appear in the multivariate case. Overshoot, moving in the wrong direction, or oscillation
can occur due to similar reasons as the univariate case. Specifically at points
(t)
that are far
from the local minimum, the quadratic approximation
Q
(t)
(
) may be far from
C
(
).Further,
there are problems that may occur due to a singular, near-singular, or ill conditioned Hessian
matrix
H
t
. In this case the descent step solution to
(4.56)
may be very noisy.
29
Nevertheless,
Newton’s method is very powerful and often when improved via trust region methods and
similar techniques, it outperforms first-order methods very well.
To build intuition of why Newton’s method often outperforms first-order methods, let us
first explore a relationship between the two approaches. If we use the constant diagonal
matrix
1
I
in place of the Hessian
H
t
, then Newton’s method reduces to the basic gradient
descent method with update
(2.20)
where the descent step is
(t)
s
=
ÒC
(
).Toseethis,
return to the quadratic approximation (4.55)withH
t
replaced by
1
I, to obtain,
Â
Q
(t)
()=C(
(t)
)+(
(t)
)
ÒC(
(t)
)+
1
2
(
(t)
)
(
(t)
), (4.57)
and thus,
Ò
Â
Q
(t)
()=ÒC(
(t)
)+
1
(
(t)
).
Equating this gradient expression to 0 yields the update
(2.20)
. Hence, we notice that the
inverse of
H
t
in Newton’s method plays the role of the learning rate
of basic gradient
descent. As a consequence, with this basic form of Newton’s method, there is no need to
calibrate the learning rate, as we do in most gradient descent approaches.
30
Finally let us get intuition of why Newton’s method generally converges to a (local) minimum
ú
faster than gradient descent when starting at a point close to
ú
. For this, return to
Figure 2.8 which compares two loss landscapes, one of which is circular and one of which is
elliptical. While that figure is created via a quadratic form where Netwon’s method would
29
In pract i ce one often monitors the level of ill-conditioning via the condition number of
H
t
,orsimilar
measures.
30
Note that Newton’s method can also be cast as a steepest descent method like basic gradient descent,
but with descent direction chosen with respect to the Hessian norm, ÎuÎ
H
t
=
u
H
t
u, u œ R
d
.
157
i
i
i
i
i
i
i
i
4 Optimization Algorithms
reach the minimum in a single iteration, in more realistic cases when functions are not
quadratic forms, they can still be locally well approximated by quadratic forms in the vicinity
of
ú
and hence considering Figure 2.8 as a general plot is valid.
It is known that with gradient descent, highly elliptical loss landscapes are dicult for
optimization since descent steps may be trapped in valleys, and thus as argued in Section 2.4
in view of Figure 2.8, one often tries to carry out variable transformations to alleviate such
problems. In a sense the gradient descent quadratic approximation
Â
Q
(t)
(
·
) in
(4.57)
assumes
that the loss landscape is already perfectly circular since it uses a constant diagonal matrix
in place of the Hessian matrix. This is the best that basic gradient descent can oer. Howe ver,
when using Newton’s method, the correct local quadratic approximation
Q
(t)
(
·
) in
(4.55)
already adapts to the curvature and potential true elliptic nature of the loss landscape.
With such an approximation
Q
(t)
(
·
), minimization over quadratic loss landscapes occurs in
a single iteration, and since locally all smooth functions are well approximated by quadratic
loss landscapes, when
(t)
is in the vicinity of
ú
, converge is very quick.
Quasi-Newton Methods
In deep learning, the dimension of
is very large and it is thus dicult to solve the linear
equations
(4.56)
as required in each iteration of Newton’s method. Even in dimensions
where these equations are solvable, there is still the basic problem of computing the Hes sian
matrix, a feat which is typically intractable for mid-size problems and beyond. A popular
alternative is oered by the use of quasi-Newton methods where either the Hessian or the
inverse of the He ssian matrix are approximated with each iteration. While the study of
quasi-Newton methods is mature, and these are used in many areas of applied mathematics
and engineering, the development of be st practices for application in deep learning is still an
active area of research. Here we simply outlines the key ideas of such methods.
The central idea in the type of quasi-Newton methods that we cover is based on an evolving
sequence of
d d
matrices,
B
0
,B
1
,B
2
,...
. At each iteration, some update rule specifies how
B
t
is up dated based on
B
t1
and other quantities from the current and previous iteration.
By the design of the update rule, each such
B
t
is kept as positive semidefinite (and hence
also non-singular), and in our presentation,
B
1
t
is ideally a good approximation
31
for the
Hessian matrix H
t
= Ò
2
C(
(t)
) associated with the current point
(t)
of iteration t.
With such a B
t
matrix available, we now use the quadratic approximation,
Q
(t)
()=C(
(t)
)+(
(t)
)
ÒC(
(t)
)+
1
2
(
(t)
)
B
1
t
(
(t)
), (4.58)
which is similar to Newton’s method quadratic approximation
(4.55)
with the dierence that
now
B
1
t
plays the role of the Hessian. Similarly to the case in Newton’s method,
Q
(t)
(
) of
(4.58)
can be minimized where now the unique minimizer is
B
t
ÒC
(
(t)
). Hence using such
a quadratic approximation we have a descent direction,
(t)
d
= B
t
ÒC(
(t)
). (4.59)
With quasi-Newton methods, we take a step in the direction specified by
(t)
d
, but we also
scale the step with a scalar
(t)
>
0 which c ontrols the step size. Hence the key update rule
31
We note that alternative presentations may use B
t
as an approximation for the Hessian itself.
158
i
i
i
i
i
i
i
i
4.6 Concepts of Second-Order Methods
for quasi-Newton methods can be stated as,
(t+1)
=
(t)
(t)
B
t
ÒC(
(t)
), (4.60)
which is similar to
(4.54)
. The determination of
(t)
is typically obtained via a line search
technique such as inexact line search using the second Wolfe condition (4.49).
Algorithm 4.3: Quasi-Newton
Input: Dataset D = {(x
(1)
,y
(1)
),...,(x
(n)
,y
(n)
)},
objective function C(·)=C(·; D), and
initial parameter vector
init
Output: Approximately optimal
1 Ω
init
2 B Ω B
init
3 repeat
4 g ΩÒC () (Compute gradient)
5
d
Ω≠Bg (Determine descent direction)
6 Determine using and
d
(Line search)
7 Ω + –◊
d
(Update parameters)
8 Update B (Update approximation of Ò
2
C()
1
)
9 until termination condition is satisfied
10 return
The typical quasi-Newton method is presented in Algorithm 4.3. As in any gradient based
algorithm, computation of the gradient is needed, and this is specified in Step 4. The
computation of the descent direction
(4.59)
is in Step 5. In Step 6 we use line search, and
Step 7 updates the current parameter based on
(4.60)
. Finally, Step 8 is the update rule,
specific to the type of quasi-Newton me thod used. In the sequel we present BFGS and
L-BFGS, each with their own specification of this step.
As this method falls within the realm of a descent direction approach as in Algorithm 4.1,
ideally we wish to s et
init
in Step 1, at a point close to a minimum, just like all the other
descent direction approaches in this chapter. As for the initialization matrix
B
init
in Step 2,
ideally we would have a matrix whose inverse is the Hessian at
init
. However, since evaluation
of the Hessian matrix or its inverse is typically computationally costly or intractable, we
often initialize with B
init
= I,ad d identity matrix.
Now with the general outline of the quasi-Newton approach presented, one may seek go od
update rules for Step 8. Before we see concrete formulas for these updates within the BFGS
and L-BFGS algorithms, let us touch on theoretical asp ec ts that helped develop these
methods, and can be used for further analysis of their performance and properties.
Since the Hessian
H
t
captures the curvature information of the objective function, we may
consider the latest two p oints
(t1)
and
(t)
together with the gradients at these points,
ÒC(
(t1)
) and ÒC(
(t)
), for the approximate equality,
H
t
(
(t)
(t1)
) ¥ÒC(
(t)
) ≠ÒC(
(t1)
). (4.61)
159
i
i
i
i
i
i
i
i
4 Optimization Algorithms
This approximation follows from
(A.24)
of Appendix A, and is a goo d approximation when
(t)
and
(t1)
are close. It generalizes
(4.53)
used in the univariate secant method. Note
that (4.61) holds with equality if C( ) is a quadratic function. Further, with the notation,
(t1)
s
=(
(t)
(t1)
), and g
(t1)
s
= ÒC(
(t)
) ≠ÒC(
(t1)
), (4.62)
we have that
(4.61)
is represented as
H
t
(t1)
s
¥ g
(t1)
s
. Now inspired by this approximation,
if we are able to select the
dd
positive definite matrix
B
t
, such that
B
1
t
is an approximation
of H
t
, then by considering the approximation as an equality,
(t1)
s
= B
t
g
(t1)
s
. (4.63)
The equation
(4.63)
is known as the secant equation and it implies that
B
t
maps the change of
gradients
g
(t1)
s
into the descent direction
(t1)
s
. A condition related to the secant equation
is,
g
(t1)
s
(t1)
s
> 0, (4.64)
which is known as the curvature condition. Importantly, it can be shown that when the
curvature condition is satisfied, there is always a positive definite
B
t
such that the s ec ant
equation is satisfied.
32
In fact, there can be infinitely many solutions. In general, quasi-
Newton methods are designed to maintain the curvature condition (4.64) and to provide a
unique up date rule for B
t
.
The curvature condition
(4.64)
allows us to also gain insight into why line search as in
Step 6 of Algorithm 4.3 is needed. In simple or synthetic cases where the loss function
C
(
·
)
is strictly convex, it can be shown that the curvature condition
(4.64)
is always satisfied.
However, loss functions in deep learning are typically highly non-convex, and thus there is
no guarantee of existence of
B
t
such that the secant equation is satisfied, unless we impose
an additional condition on
(t)
of (4.60).
If in each iteration
t
we determine
(t)
using inexact line search with the sec ond Wolfe
condition
(4.49)
, then the curvature condition
(4.64)
is guaranteed to hold.
33
To see this,
suppose that the second Wolfe condition is applied, then from
(t)
=
(t1)
+
(t1)
(t1)
d
,
we obtain
ÒC(
(t)
)
(t1)
d
Ø ÒC(
(t1)
)
(t1)
d
, (4.65)
where we used the relationship between the gradient and the directional derivative shown in
(A.8) of App e ndix A. Now the inner product of the curvature condition (4.64) satisfies,
g
(t1)
s
(t1)
s
= ÒC(
(t)
)
(t1)
s
≠ÒC(
(t1)
)
(t1)
s
Ø “–
(t1)
ÒC(
(t1)
)
(t1)
d
≠ÒC(
(t1)
)
(t1)
s
=( 1)
(t1)
ÒC(
(t1)
)
(t1)
d
= ( 1)
(t1)
ÒC(
(t1)
)
B
t1
ÒC(
(t1)
)
> 0.
32
It can be shown that if for
u, v œ R
d
,
u
v>
0,thenthereexistsasymmetricpositivedenitematrix
B œ R
dd
such that Bu = v.
33
This also holds for the strong Wolfe condition (4.50).
160
i
i
i
i
i
i
i
i
4.6 Concepts of Second-Order Methods
The first equality uses the definition of
g
(t1)
s
and the step size resulting from
(4.60)
,
(t1)
s
=
(t1)
(t1)
d
. The inequality that follows is due to
(4.65)
. Finally, the last inequality
is because
<
1 and
B
t1
is assumed to be positive definite based on the update rule. We
thus see that the second Wolfe condition is a means to ensure that the curvature condition
is satisfied.
The BFG S and L-BFGS Update Rules
Let us now focus on the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, and its
variant, the limited-memory BFGS (L-BFGS) algorithm. In practice, some small deep
learning applications already make ecient use of L-BFGS. We mention that one common
approach in training, is to first carry out training using standard gradient descent or a
variant, and then apply L-BFGS to “complete” the training process. Nevertheless, to date,
the application of L-BFGS and other variants of advanced quasi-Newton methods is still not
widespread in large scale deep learning.
We present the update rules for these two algorithms which specify Step 8 of Algorithm 4.3,
keeping in mind that
B
0
=
B
init
. Each quasi-Newton iteration of Algorithm 4.3 already bears
the computational cost of computing the gradient, the cost of several function evaluations
within the line search, and other costs associated with determining the descent direction and
updating of parameters. The computational cost of the update rule in Step 8 is generally
heavy. We contrast the computational cost of the BFGS update rule and that of the L-BFGS
update rule, below.
The BFGS update rule is,
B
t
= V
t1
B
t1
V
t1
+
(t1)
(t1)
s
(t1)
s
, (4.66)
where the positive scalar
(t1)
and the matrix V
t1
are respectively given by,
(t1)
=
1
g
(t1)
s
(t1)
s
and V
t1
= I
(t1)
g
(t1)
s
(t1)
s
. (4.67)
Observe the use of the rank one matrices
(t1)
s
(t1)
s
and
g
(t1)
s
(t1)
s
using
(4.62)
.The
positivity of
(t1)
is maintained by the curvature condition
(4.64)
. Importantly, it can
be shown that each matrix in the sequence
B
0
,B
1
,...
, resulting from this update rule is
positive definite. We omit the details.
Note that due to the matrix multiplication
V
t1
B
t1
V
t1
, the computational cost of
updating
B
t
using
(4.66)
is
O
(
d
2
). This is a reasonable cost for small
d
, yet for larger
d
as
in some deep learning applications, it renders BFGS to o c ostly.
We mention that the development of the update rule in
(4.66)
and
(4.67)
can be cast as a
minimization problem where we seek the closest possible positive definite
B
t
to the previous
B
t1
under a specific matrix norm. The minimization is subject to the constraint that the
secant equation
(4.63)
holds. We omit the details of this derivation yet mention that the
matrix norm used for BFGS is a weighted Frobenius norm. See the notes and references at
the end of the chapter for more information.
161
i
i
i
i
i
i
i
i
4 Optimization Algorithms
The L-BFGS algorithm overcomes the computational cost of BFGS. Its update rule is not
for maintaining the matrix
B
t
in memory directly, but is rather based on keeping a limited
history (or limited memory) of the constituent vectors
(t)
s
and
g
(t)
s
. Observe that in principle,
one can unroll the BFGS up date rules (4.66) and (4.67) to obtain,
B
t
= V
t1
V
t2
···V
0
B
init
V
0
···V
t2
V
t1
+
(t1)
(t1)
s
(t1)
s
+
(t2)
V
t1
(t2)
s
(t2)
s
V
t1
.
.
.
+
(0)
V
t1
V
t2
···V
0
(0)
s
(0)
s
V
0
···V
t2
V
t1
.
(4.68)
Further observe that this representation of
B
t
is a function of the sequence
{
(k)
s
,
g
(k)
s
}
for
k
=0
,
1
,...,t
1 and
B
init
. Hence in principle, assuming that
B
init
is diagonal, if we wish to
compute the matrix-vector product
B
t
ÒC
(
(t)
), we can do so with all multiplications being
vector-vector multiplications. This is due to the fact that each
V
t
matrix is composed of the
outer products
g
(t1)
s
(t1)
s
. Yet obviously, for non-small
t
such multiplications b e com e
inecient as the number of terms in the expression (4.68) grows.
The L-BFGS idea is to use an approximate version
Â
B
t
of
B
t
that can be stored indirectly
by storing only the late st
m
pairs of vec tors
{g
(k)
s
,
(k)
s
}
, for
k
=
t m, . . . , t
1.The
hyper-parameter
m
is the length of the history (or memory) and is typically chosen to
be much smaller than
d
.Wemodify
(4.68)
with an approximation which limits the his-
tory used to the last
m
iterations. For this we drop the last
t m
terms of
(4.68)
, and
in the first term
V
t1
V
t2
···V
0
B
init
V
0
···V
t2
V
t1
, we replace the middle multiplicands,
V
tm1
···V
0
B
init
V
0
···V
tm1
, with a symmetric positive definite matrix
B
(t)
init
.Thismod-
ification yields
Â
B
t
¥ B
t
, with the form,
Â
B
t
= V
t1
V
t2
···V
tm
B
(t)
init
V
tm
···V
t2
V
t1
+
(t1)
(t1)
s
(t1)
s
+
(t2)
V
t1
(t2)
s
(t2)
s
V
t1
.
.
.
+
(tm)
V
t1
V
t2
···V
tm
(tm1)
s
(tm1)
s
V
tm
···V
t2
V
t1
.
(4.69)
Note that
B
(t)
init
is selected for each iteration
t
and is usually set as a diagonal matrix to
reduce the storage and computational costs. A practically eective choice is
B
(t)
init
=
g
(t1)
s
(t1)
s
g
(t1)
s
g
(t1)
s
I.
With this choice for
B
(t)
init
, the expression
(4.69)
allows us to compute
Â
B
t
ÒC
(
(t)
) eciently. In
particular,
Â
B
t
ÒC
(
(t)
) can be computed using
O
(
m
) inner products between
d
-dimensional
vectors, and thus the per iteration com putational cost for the L-BFGS algorithm is
O
(
md
),
while requiring to store only the latest m pairs of ve ctors {g
(k)
s
,
(k)
s
}, k = t m, . . . , t 1,
162
i
i
i
i
i
i
i
i
4.6 Concepts of Second-Order Methods
which takes
O
(
md
) storage space. With
m
much smaller than
d
, the L-BFGS algorithm
requires far less storage than the BFGS algorithm. Finally we note that in terms of Algo-
rithm 4.3, the up date rule in Step 8 simply needs to push the vectors
{g
(t)
s
,
(t)
s
}
onto a queue
while popping the vectors
{g
(tm)
s
,
(tm)
s
}
out of that queue. The bulk of the algorithm is
in Step 5 with the ecient matrix-vector multiplication using the e xpressions
(4.69)
and
(4.67).
163
i
i
i
i
i
i
i
i
4 Optimization Algorithms
Notes and References
Mathematical optimization is a classical subject and some of the general texts on this subject
are [
37
] where linear optimization is covered, [
272
] where both linear and non-linear optimization
methods are covered, and [
35
] where the focus is on non-linear methods. Further texts focusing on
convexity, convex analysis, and convex optimization are [
27
], [
32
], and [
55
]. Many of the algorithms
in these texts describe descent direction methods, yet some examples of optimization algorithms that
are not descent direction methods include direct methods (also known as zero-order or black box)
such as coordinate and pattern search methods; see [
238
]or[
311
] for an overview. The Rosenbrock
function was developed in [
354
] within the context of multivariate function optimization and since
then it became popular for basic optimization test cases and illustrations.
As mentioned in the Chapter 2 notes and references, gradient descent is attributed to Cauchy
from 1847 with [
72
];see[
253
] for an historical account. Stochastic gradient descent was initially
introduced in 1951 with [
350
] in the context of optimizing the expectation of a random variable
over a parametric family of distributions. See [
229
], [
230
], [
231
], [
255
], and [
349
] for early references
using sto chastic gradient descent approaches. After the initial popularity of this approach for
general optimization, other numerical optimization methods were developed and stochastic gradient
descent’s popularity decreased. Years later, it became relevant again due to developments in deep
learning; see [
48
] and [
52
] for early references in the context of deep learning. Interesting works that
attempt to explain the suitability of stochastic gradient descent for deep learning are [
49
], [
50
], and
[
51
]. Relationships between the use of mini-batches and stochastic gradient descent are studied in
[202], [203], and [261]. Practical recommendations of gradient based methods can be found in [29].
The ADAM algorithm was introduced in [
233
]. See [
43
] and [
345
] for additional papers that study
its performance. Ideas of momentum appeared in optimization early on in [
335
]. RMSprop appeared
in lectures by Geo Hinton in a 2014 course and has since become popular. Adagrad appeared
earlier on in [
112
]. An interesting empirical study considering ADAM and related methods is [
82
].
See [238] and [356] for a detailed study of momentum based methods.
The basic idea of automatic dierentiation dates back to the 1950’s ; see [
25
] and [
313
]. The idea
of computing the partial derivatives using forward mode automatic dierentiation is attributed
to Wengert [
420
]. Even though ideas of backward mode automatic dierentiation first appe ared
around 1960 and used in control theory by the end of that decade. The 1980’s paper [
386
]is
generally attributed for presenting backward mode automatic dierentiation in the form we known
it now. Backward mode automatic dierentiation for deep neural networks is referred to as the
backpropagation algorithm, a concept that we cover in detail in Chapter 5. In fact, ideas for the
backpropagation algorithm were developed independently of some of the automatic dierentiation
ideas; see [
173
] for an historical overview as well as the notes and references at the end of Chapter 5
for more details.
In recent years, with the deep learning revolution, automatic dierentiation has gained major
p opularity and is a core com ponent of deep learning software frameworks such as TensorFlow [
1
],
PyTorch [
324
], and others. In parallel, the application of automatic dierentiation for a variety
of scientific applications has also been popularized with technologies such as Flux.jl [
201
] in Julia
and JAX [
57
] in Python. See [
24
] for a survey in the context of machine learning and [
147
]fora
more dated review. Additional dierentiable programming frameworks that implement automatic
dierentiation include CasADi, Autograd, AutoDi, JAX, Zygote.jl, Enzyme, and Chainer. Much of
the complexity of dierentiable programming software frameworks is centered around the use of
GPUs as well as the placement and management of data in GPU memory.
Nesterov momentum was intro duced in [
305
]. See [
362
], [
363
], and [
431
] for some comparisons of
Nesterov momentum with the standard momentum approach. The Nadam algorithm was introduced
in [
109
] to incorporate Nesterov momentum into ADAM and is now part of some deep learning
frameworks. The Adadelta method appeared in [
440
] prior to the appearance and growth in
p opularity of ADAM. The Adamax method appeared in the same paper as ADAM, [
233
], as an
extension. Line search techniques are central in classical first-order optimization theory. Inexact
line search techniques using Wolfe conditions first appeared in [
422
] and [
423
]. Backtracking line
search, also known as Armijo line search, first appeared in [16]. See for example chapter 3 of [311]
and chapter 4 of [
238
] for overviews of exact and inexact line search algorithms. Concepts of the
conjugate gradient method first appeared in [
177
] for s olving linear equations with symmetric positive
164
i
i
i
i
i
i
i
i
4.6 Concepts of Second-Order Methods
semidefinite matrices. Since then, the method was incorporated in many numerical algorithms; see
for example chapter 11 of [141] and chapter 5 of [311] for more details.
Despite the success of second-order methods in statistics, in general numerical optimization, and in
engineering, they are yet to make it into mainstream large scale deep learning. Perhaps one reason
for this is that when implemented without care, they are much slower than first-order methods for
high-dimensional models such as deep neural networks. Nevertheless, their ecient incorporation in
deep learning is still an active area of research. The origins of Newton’s method, also known as the
Newton-Raphson method, can be dated back to 17th century; see chapter 2 of [
237
] for an account
on the history of optimization methods. For a detailed overview see [47].
Even though the secant method can be thought of as an approximation to Newton’s method, from
an historical perspective, it is several decades older than Newton’s method; see for example [
321
]
for origins and an account of the evolution of the secant method. Quasi-Newton methods are more
mo dern with the first technique, the Davidon-Fletcher-Powell (DFP) method, appearing in 1959
as a technical report [
99
] and later in 1991 published as [
100
]. In the interim, the DFP method
was further modified in the early 1960’s with [
124
]. The BFGS method materialized in the 1960’s
and 1970’s, and the L-BFGS extension was introduced in 1980 with [
310
]. See also [
123
] for an
overview of these methods. See [
98
] and [
266
] for a discussion of convergence properties of BFGS
and L-BFGS respectively. Information about the weighted Frobenius norm used for constructing
BFGS updates is in chapter 6 of [
311
]. A related second order method is the Levenberg–Marquardt
algorithm; see chapter 18 of [
56
] for an introduction. Trust region methods are believed to have
originated from [
254
]; see [
439
] for a survey. Finally we suggest chapter 5 of [
311
] as a review of
quasi-Newton methods.
165