i
i
i
i
i
i
i
i
Mathematical Engineering
of Deep Learning
Benoit Liquet, Sarat Moka and Yoni Nazarathy
March 3, 2026
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
4 Optimization Algorithms
The field of optimization is an important sub-field of applied mathematics. In the context
of deep learning it is just as important. Any form of training process for a deep learning
model requires optimization to minimize the loss. The decision variables are the learned
model parameters, and the data is typically considered fixed from the point of view of
optimization. In this chapter we explore general optimization methods and results, focusing
on the essential tools for optimization in the process of training deep learning models.
In Section 4.1 we formulate optimization problems in a general setup, discuss their forms,
and review local extrema, global extrema, and convexity. In Section 4.2 we specialize to
optimization of learned parameters for deep learning. We discuss the nature of such problems,
and common techniques including stochastic gradient descent, tracking performance measures,
and early stopping. In Section 4.3 we discuss various forms of first-order methods which
extend the basic gradient descent algorithm. The focus is on the popular ADAM optimizer
which has grown to become very popular for deep learning. In Section 4.4 we introduce
automatic differentiation, a computational paradigm for efficient evaluation of derivatives.
Here we present both forward and backward mode automatic differentiation in a general
context, and later in the next chapter we specialize to deep learning. We continue with
Section 4.5 where additional first-order methods are presented beyond gradient descent
and ADAM. In Section 4.6 we introduce and present an overview of second-order methods.
These powerful methods are sometimes used in deep learning, and when applicable they can
perform very well. Note that on a first reading of the book, one may focus on Sections 4.1,
4.2, 4.3, and 4.4 before continuing to understand the deep learning models of Chapter ??.
4.1 Formulation of Optimization
Optimization algorithms, such as gradient descent described in Algorithm
??
of Chapter
??
,
primarily aim to minimize an objective. We have already seen in previous chapters that
“learning”
“optimization”. Hence to do well in deep learning, it is important to understand
the implementation of optimization algorithms as they play a crucial role in minimizing the
training loss.
The General Setup
Consider a multivariate real-valued function
C
:
R
d
R
. An optimization problem can be
written as
minimize
θ
C(θ)
subject to θ Θ,
(4.1)
1
i
i
i
i
i
i
i
i
4 Optimization Algorithms
where Θ
R
d
is called the feasible set or the constraint set. In the context of learning, Θ
represents the parameter space and the objective function C(θ) is the loss function.
In other words, the aim of the optimization problem is to find a point
θ
Θ where
C
(
θ
) is
minimum. Such a point, denoted
θ
, is called a solution or minimizer of the optimization
problem (4.1). That is,
C(θ
) C(θ), for all θ Θ. (4.2)
There can be multiple solutions to this problem and the set of all the solutions
θ
is usually
denoted by
argmin
θΘ
C(θ).
The formulation
(4.1)
is general, in the sense that any optimization problem can be written
in this form. Particularly, the maximization problem
maximize
θ
C
(
θ
) subject to
θ
Θ
can be stated in the form
(4.1)
by replacing
C
(
θ
) in
(4.1)
with
C
(
θ
). We say that the
optimization problem is unconstrained if Θ =
R
d
; otherwise, it is a constrained optimization
problem.
Our focus is primarily on unconstrained optimization and hence the “subject to” component
of
(4.1)
may be omitted. This is the common case in the context of deep learning, and in
particular it is the case for the optimization problems that we already encountered in the
context of linear regression (Section
??
), logistic regression (sections
??
and
??
), multinomial
regression (Section ??), and most cases of autoencoders (Section ??).
General optimization theory and practice comes in many shapes and forms where different
methods can be used for different sub-types of the general problem
(4.1)
. In the context
of deep learning we typically make use of multivariate calculus
1
and assume (or construct)
the objective
C
(
·
) to have desirable properties, such as continuity, differentiability (almost
everywhere), and other smoothness properties. With this type of optimization, the gradient
C
and Hessian
2
C
are typically used. The former is the key component of first-order
methods as described in sections 4.3 and 4.5. The latter sometimes plays a role in second-
order methods as described in Section 4.6. Throughout this chapter when we make use of
the gradient or Hessian then we are implicitly assuming that these objects exist and are well
defined for the problem at hand.
Local and Global Minima
A solution of an unconstrained minimization problem is also known as a global minimum
because of
(4.2)
. However, there can be other points that are not solutions to the optimization
problem but exhibit similar properties within their vicinity. Such points are known as local
minima. In particular, a point
θ R
d
is called a local minimum of a continuous function
C
if there is an
r >
0 such that
C
(
θ
)
C
(
ϕ
) for all
ϕ B
(
θ, r
), where
B
(
θ, r
) is an open ball
with center θ and radius r > 0 defined by
B(θ, r) = {ϕ R
d
: θ ϕ < r}.
1
Refer to Appendix ?? for a review.
2
i
i
i
i
i
i
i
i
4.1 Formulation of Optimization
In other words, if
θ
is a local minimum of
C
, we cannot decrease the value of
C
by moving
an infinitesimal distance from
θ
in any direction; see Figure 4.1 for illustrations of global
and local minima.
Suppose that the objective function
C
is differentiable at a local minimum
θ
. Then, it is
well known that the gradient
C(θ) = 0
. This is a necessary condition for a point to be a
local minimum, but not sufficient. For example, the condition
C
(
θ
) = 0 also holds if
θ
is a
local maximum of
C
, that is,
θ
is a local minimum of
C
. Further the condition
C(θ) = 0
can hold even at a point
θ
that is neither a local minimum nor a local maximum, and such
a point is called a saddle point where the objective function slopes up in one direction and
slopes down in another direction (for smooth functions we define it below using the Hessian
matrix). See Figure 4.1 for examples.
(a)
8
6
4
2
0

l
-1
(b)
Figure 4.1: Two dimensional smooth functions. (a) The function
C
(
θ
) =
θ
2
1
+
θ
2
2
is convex with
a unique global minimum. (b) A well-known function called the peaks function is an example
non-convex function with several local minima, local maxima, and saddle points. Around each local
minimum, the function is locally convex. On the figure, two saddle points are marked with red dots.
Convexity and Saddle Points
There are classes of optimization problems for which a local minimum is also guaranteed to
be a global minimum. One such popular class is the family of convex problems. While most
deep learning based optimization problems are not convex, understanding convexity helps to
better understand optimization concepts.
We can establish stronger necessary conditions for a point to be a local minimum via
convexity. A set Θ
C
R
d
is said to be a convex set if for any θ, ϕ Θ
C
and λ [0, 1],
λθ + (1 λ)ϕ Θ
C
.
That is, for a convex set, the line segment connecting any two points in the set, remains in
the set. An example of a convex set is an open ball B(θ, r) for any θ R
d
and r > 0.
We are now ready to define convex functions. Suppose that Θ
C
R
d
is a convex set. Then,
a function
C
:
R
d
R
is called a convex function on Θ
C
if for every
θ, ϕ
Θ
C
and
λ
[0
,
1],
3
i
i
i
i
i
i
i
i
4 Optimization Algorithms
we have
C
λθ + (1 λ)ϕ
λC(θ) + (1 λ)C(ϕ). (4.3)
Furthermore, we say that the function
C
is strictly convex on Θ
C
if the inequality in
(4.3)
holds with strict inequality for all λ (0, 1) and for every choice of θ, ϕ Θ
C
with θ = ϕ.
If the function
C
is convex on the entire space
R
d
, then all the local minima are global
minima. Furthermore, if
C
is strictly convex on
R
d
, then there is a unique global minimum.
See Figure 4.1 (a) for an example of a convex function.
If we further assume that
C
is twice differentiable, then we can define convexity and strict
convexity in terms of the Hessian matrix. In particular,
C
is convex on Θ
C
if and only if the
Hessian
2
C
(
θ
) is positive semidefinite
2
at every interior point on Θ
C
. Furthermore,
C
is
strictly convex on Θ
C
if and only if the Hessian
2
C
(
θ
) is positive definite at every interior
point on Θ
C
.
Note that if a point
θ
is a local minimum of a twice differentiable function
C
, then there
exists some
r >
0 such that
C
is convex on the open ball
B
(
θ, r
), and hence we sometimes
say that
C
is locally convex around
θ
. See Figure 4.1 (b) for an example of a function with
several local minima with local convexity around each minimum point.
For twice differentiable functions, it is easy to define saddle points. A point
θ
with
C
(
θ
) = 0
is called a saddle point if the Hessian
2
C
(
θ
) at
θ
is neither positive semidefinite nor negative
semidefinite.
Objective Functions in Deep Learning
In the context of deep learning, the typical form of the objective function is
C(θ) := C(θ ; D) =
1
n
n
X
i=1
C
i
(θ), (4.4)
where
n
is the number of data samples in a given dataset
D
. We have already seen this
form in
(??)
where
C
i
(
θ
) is the loss computed for the
i
-th data sample. Specific forms for
this
i
-th data sample loss include the square loss, cross entropy loss, and other forms that
have already appeared in chapters
??
and
??
. In this chapter the data samples play a more
minor role and hence from now onwards in this chapter we suppress
D
from the notation
and use C(θ).
The optimization problem that corresponds to the objective function
(4.4)
is often called
the finite sum problem. For such problems, it is relatively easy to study the properties
of
C
(
θ
) by studying the corresponding properties of each
C
i
(
θ
), its gradient
C
i
(
θ
) and
Hessian
2
C
i
(
θ
). Most importantly, the form
(4.4)
plays an important role in finding the
descent direction for optimization methods such as stochastic gradient descent, described
in Section 4.2. When each
C
i
(
θ
) is twice differentiable, the gradient and Hessian of the
2
See (??) in Appendix ?? for definition of positive semidefinite and positive definite matrices.
4
i
i
i
i
i
i
i
i
4.1 Formulation of Optimization
objective C(θ) are respectively given by
C(θ) =
1
n
n
X
i=1
C
i
(θ), and
2
C(θ) =
1
n
n
X
i=1
2
C
i
(θ). (4.5)
A local minimum of
C
(
·
) at
θ
can be characterized via
2
C
(
θ
) being positive semidefinite
in addition to
C
(
θ
) = 0. It is useful to note that
2
C
(
θ
) is positive semidefinite at
θ
if
each
2
C
i
(θ) is positive semidefinite at θ, because for any ϕ R
d
,
ϕ
2
C(θ)ϕ =
1
n
n
X
i=1
ϕ
2
C
i
(θ)ϕ
. (4.6)
Convexity of Certain Shallow Neural Networks
While most deep learning optimization problems are not convex, the loss functions of the
logistic regression and multinomial regression models of Chapter
??
are convex. The same
is true for the linear regression model of Chapter
??
. For completeness, we now establish
convexity properties of these simple models. The details of this subsection may be skipped
on a first reading.
We begin with the linear regression model of Section
??
with the standard quadratic loss. In
this case the associated optimization problem is
(??)
and we may compute the Hessian for
the objective C(θ) = y Xθ
2
to be
2
C(θ) = 2X
X,
where we use the notation of the design matrix
X
from
(??)
. The Hessian is thus a Gram
matrix and this implies it is positive semidefinite.
3
Hence the optimization problem is convex.
Further if
X
has linearly independent columns then
2
C
(
θ
) is positive definite and the
problem is strictly convex and has a unique optimal point.
We now continue to logistic regression of sections
??
and
??
. An expression for the gradient
is in
(??)
and we may follow up with an expression for the Hessian. Specifically, the
d × d
Hessian matrix of each C
i
(θ) is
2
C
i
(θ) = σ
Sig
(b + w
x
(i)
)
1 σ
Sig
(b + w
x
(i)
)
1
x
(i)
h
1 x
(i)
i
, (4.7)
which is evidently a product of a strictly positive scalar and a Gram matrix. Thus it is always
positive semidefinite following the same argument used for linear regression with quadratic
loss.
4
Now using the argument at
(4.6)
, the matrix
2
C
(
θ
) is also positive semidefinite and
thus the total loss function
C
(
θ
) is convex. Further, in cases where the vectors (1
, x
(i)
) for
i
= 1
, . . . , n
span
R
d
(this can only hold when
d n
) then we cannot find a non-zero vector
ϕ R
d
orthogonal to all the vectors (1
, x
(i)
). Thus, for any vector
ϕ
= 0, there exists at
least one
i
such that
ϕ
2
C
i
(θ) ϕ >
0 and hence
2
C
(
θ
) is positive definite, making the
total loss function C(θ) strictly convex.
3
For any matrix
A
, the associated Gram matrix is
A
A
and thus
ϕ
A
=
2
0 for any vector
ϕ
. Further with linearly independent columns
= 0 for any
ϕ
= 0 and thus
2
>
0 implying that
A
is
positive definite.
4
Note however that
2
C
i
(θ)
is never (strictly) positive definite because
2
C
i
(
θ
) is a rank one matrix.
5
i
i
i
i
i
i
i
i
4 Optimization Algorithms
We now move onto multinomial regression as well as linear regression with other loss functions
introduced in Section
??
. The optimization problems of these models also enjoy convexity
properties, yet expressions of the Hessian are not easy to work with. We thus use other
means to argue for convexity.
For multinomial regression recall C
i
(θ) from (??) which we can represent as,
C
i
(θ) = log
K
X
j=1
e
b
j
+w
(j)
x
(i)
(b
k
+ w
(k)
x
(i)
), (4.8)
when
y
(i)
=
k
. Recall here that
θ
includes both bias terms (scalars)
b
1
, . . . , b
K
and weight
vectors w
(1)
, . . . , w
(K)
; see (??).
As a building block we first use the log-sum-exp function, h : R
K
R of the form
h(v) = log
K
X
j=1
e
v
j
.
We can show that
h
(
v
) is convex by explicitly calculating its Hessian expression and seeing
that it is positive semidefinite
5
for all
v R
K
. Further using first principles of the definition
of convexity,
(4.3)
, we can show that for any convex function
h
:
R
K
R
,
h
(
θ
u
1
, . . . , θ
u
K
)
is convex in
θ R
d
for any fixed
u
1
, . . . , u
K
R
d
. Since
θ
is the vector consisting of all the
weights and the biases, by using multiple zeros in each
u
j
R
d
, we can construct
u
j
from
the p-dimensional vector x
(i)
such that
v
j
= θ
u
j
= b
j
+ w
(j)
x
(i)
.
As a consequence, the first term of
(4.8)
is convex in
θ
. Further, the second term is also
convex as it is an affine function of θ. Hence (4.8) is a sum of convex functions and is thus
convex.
We now again consider linear regression, but this time with the absolute error loss
(??)
and
Huber loss (??). In both of these cases, like the standard quadratic loss, we have,
C
i
(θ) = g(y
(i)
θ
x
(i)
),
where
g
:
R R
is a convex function; see Figure
??
(a) in Chapter
??
. Here again we use
the fact that composition of a convex function and an affine function is convex and this
establishes convexity of C
i
(θ) and thus of C(θ) for all these loss function alternatives.
Before we depart from this discussion of convexity we suggest that the reader consider the
plots of Figure
??
of Chapter
??
. In that figure, it is evident that logistic regression with “the
wrong” loss function, namely squared loss in this case, may yield a non-convex problem. We
highlight this here to emphasize that the convexity properties discussed above are more of
an exception than the rule. Indeed, most deep learning models of this book have non-convex
objectives.
5
The gradient of the log-sum-exp function is the softmax function for which we present the Jacobian in
(??)
, and the Jacobian of the gradient of a function is equal to the Hessian of that function. It also turns out
that the matrix in
(??)
is the covariance matrix of a categorical distribution (multinomial distribution with
one trial) implying that it is positive semidefinite.
6
i
i
i
i
i
i
i
i
4.1 Formulation of Optimization
General Approach of Descent Direction Methods
We have seen that the loss functions of logistic regression and multinomial regression are
convex, and thus they have global minima. Unfortunately, even for these shallow neural
networks, despite being convex, there is no known analytical solution for the minimizer.
Further, in more general deep neural networks, we do not have such convexity properties
and the objective functions typically exhibit numerous local minima. Thus in general, it is
impossible to think of an analytical solution for the optimization problems in deep learning,
and thus we have to relay on more advanced numerical optimization methods.
All the popular optimization methods used in deep learning can be put under one general
framework called the descent direction method. It is an iterative method that attempts to
reduce the objective in each iteration. This might seem obvious, however the reader should
keep in mind that in general optimization methods beyond those covered in this book, steps
may sometimes increase the objective before further reduction. In contrast, with the descent
direction method, while increases may occur in certain iterations, the overall goal is to have
a reduction of the objective in each iteration. This method is presented in Algorithm 4.1.
Algorithm 4.1: The descent direction method
Input: Dataset D = {(x
(1)
, y
(1)
), . . . , (x
(n)
, y
(n)
)},
objective function C(·) = C(·; D), and
initial parameter vector θ
init
Output: Approximately optimal θ
1 θ θ
init
2 repeat
3 Determine the descent step θ
s
4 θ θ + θ
s
5 until termination condition is satisfied
6 return θ
As is evident, Algorithm 4.1 relies on the specific way in which we compute the descent step
and the termination condition. For each iteration
t
= 0
,
1
,
2
, . . .
of Algorithm 4.1, denote
the values of the parameter vector
θ
and the descent step
θ
s
by
θ
(t)
and
θ
(t)
s
, respectively.
With this notation, the sequence of points θ
(1)
, θ
(2)
, . . . , evolves via,
θ
(t+1)
= θ
(t)
+ θ
(t)
s
, with θ
(0)
= θ
init
,
until
θ
(t)
satisfies the specified termination condition. Note that the magnitude of the descent
step θ
(t)
s
denotes the step size in the t-th iteration.
The descent step
θ
(t)
s
at the
t
-th iteration is determined by information available in or until
the iteration
t
, where the exact manner in which it is computed depends on the variant of
the algorithm used. One specific example that we have already seen is the gradient descent
method, Algorithm
??
of Section
??
, which is a special case of Algorithm 4.1. With gradient
descent, the descent step is set at
θ
s
=
αC
(
θ
) where
α
is the learning rate and since
α >
0, at each step the gradient descent method moves along the steepest descent direction.
6
6
Note that some authors use the term steepest descent to denote a more general class of algorithms of
which gradient descent can be taken as a special case. However, in this book “steepest descent” and “gradient
descent” are considered synonymous.
7
i
i
i
i
i
i
i
i
4 Optimization Algorithms
Other variants encountered in the sequel are stochastic gradient descent (SGD) presented in
Section 4.2, ADAM presented in Section 4.3, second-order methods presented in Section 4.6,
and other variants.
When the goal is to minimize the objective function
C
, ideally, we would like to terminate
the algorithm only after reaching a minimum point. This convergence is difficult to guarantee
in general as the updates of the algorithm can slow down or oscillate in the vicinity of the
minimum. Therefore, in practice there are several termination conditions that allow some
tolerance so that the algorithm can output a point in the vicinity of the minimum. We now
list a few traditionally popular termination criteria. However, note that the next section
presents optimization in the context of deep learning where other criteria are often used.
Most termination conditions use a small tolerance value,
ε >
0. One standard termination
condition tracks absolute improvement, where we terminate when the change in the function
value is smaller than the tolerance. Namely,
C(θ
(t)
) C(θ
(t+1)
)
< ε.
An alternative is to use the gradient magnitude or step size where we terminate respectively
if,
∥∇C(θ
(t+1)
) < ε, or θ
(t)
s
< ε.
When the region around the (local) minimum is shallow, it is sometimes useful to use the
relative improvement criterion where we terminate based on the relative change in the
function value. Specifically, we terminate if
|C(θ
(t)
) C(θ
(t+1)
)|
|C(θ
(t)
)|
< ε.
In addition to these criteria which track the state of the optimization procedure, an alternative
is to simply run for a fixed number of iterations
t
= 0
, . . . , T
, or similarly to limit the running
(clock) time of the algorithm to a desired predefined value.
Similarly, the termination condition at Step 5 of Algorithm 4.1 can be different for different
methods. The optimal output point obtained by the algorithm may not necessarily minimize
the objective function
C
(
·
). Rather it depends on the termination condition. For instance,
as we see in Section 4.2, the final
θ
returned by Algorithm 4.1 could be a minimizer of the
validation set accuracy, but not a minimizer of the objective function C(·).
4.2 Optimization in the Context of Deep Learning
In the previous section we saw that certain optimization problems are tractable in the sense
that a local minimum is also the global minimum. Specifically, the loss functions of linear
regression, logistic regression, and multinomial regression are convex. However, in most
deep learning models, starting with the models that we cover in the next chapter, the loss
functions are not as “blessed”. For such deep learning models, the number of parameters is
huge and the problems are non-convex, often having a large number of local minima.
Nevertheless, basic optimization techniques are very useful for deep learning with the
compromise of not finding the global minimum of the loss, but rather finding a point close
to some local minimum with satisfactory performance. That is, the obtained point has
8
i
i
i
i
i
i
i
i
4.2 Optimization in the Context of Deep Learning
performance which is in general not far off from the performance at the global minimum.
Indeed, in deep learning, we use optimization algorithms trying to minimize the training
loss function
C
(
θ
) while our ultimate goal is finding
θ
that has good performance on the
unseen data; see Section
??
for a discussion of performance on unseen data. That is, the
goal is to actually minimize expectations or averages of performance functions
7
P
(
θ
), such
as the validation set error rate. We then hope that this minimization resembles performance
on unseen data. Hence, as the reader studies the optimization methods of this chapter, they
should keep in mind that minimization of the loss
C
(
θ
) is often a proxy for minimization
of P(θ).
Challenges Posed by Basic Gradient Descent
Almost all cases where optimization is used for deep learning involve some variant of gradient
descent; see Algorithm
??
of Section
??
for an introduction to gradient descent. However,
basic gradient descent poses a few important challenges that can make the algorithm
impractical in the deep learning context.
The primary challenge is that in each iteration, the gradient descent algorithm uses the
entire dataset to compute the gradient
C
(
θ
). In the context of deep learning, the sample
size
n
of the training data can be large or the dimension
d
of the parameter
θ
can be huge,
or both. As a consequence, the operation of computing the gradient using the entire dataset
for every update can be very slow, or even intractable in terms of memory requirements.
Further, as already mentioned above, loss functions of deep neural networks tend to be
highly non-convex, with multiple local minima. For such functions, the apriori selection
of a good learning rate
α
is very difficult. A low learning rate can lead to an increased
computational cost as a result of slow convergence of the algorithm, and it can lead to the
algorithm getting stuck at local minima or flat sections of the loss landscape, eliminating
the chance to explore other local minima. On the other hand, a large learning rate can result
in the objective function values and decision variables fluctuating in an haphazard manner.
Even for simple convex models, as we saw in Figure
??
of Section
??
, the choice of the
learning rate affects learning performance. As a second elementary example with
d
= 2,
assume hypothetically that the loss landscape takes the form of the following Rosenbrock
function,
C(θ) = (1 θ
1
)
2
+ 100(θ
2
θ
2
1
)
2
, (4.9)
which is non-convex yet has a unique global minimum at θ = (1, 1).
Figure 4.2 illustrates the performance of gradient descent on
(4.9)
. Here we also introduce
the option of varying the learning rate as optimization iterations progress based on a
predefined learning rate schedule in this case. Specifically, at each iteration
t
, instead of using
the fixed learning rate
α
we use
α
(t)
which is indexed by iteration
t
and parameterized by
the original
α
. This strategy is sometimes used in practice, yet as we see in this example, it
does not work well for the parameters that we choose, and in none of the four trajectories
starting at an initial point
θ
init
, do we find the minimum. Note that in this specific example
it is not hard to fine-tune the learning rate
α
and/or the exponential decay parameter of
7
Note that in Section
??
we denoted the performance measure function as a function of
ˆy
and
y
for each
observation in D where as here we are taking it as a function of the parameters θ.
9
i
i
i
i
i
i
i
i
4 Optimization Algorithms
α
(t)
, which is set at 0
.
99. However, this is a simple, hypothetical,
d
= 2 dimensional loss
landscape, whereas in actual problems with
d >
2, tuning
α
(t)
to achieve efficient learning
(with basic gradient descent) is often very difficult or impossible.
0.2 0.0 0.2 0.4 0.6 0.8 1.0
1
0.2
0.0
0.2
0.4
0.6
0.8
1.0
2
Optimal point
Initial point
(0)
= 0.005
= 0.0025
(a)
0.2 0.0 0.2 0.4 0.6 0.8 1.0
1
0.2
0.0
0.2
0.4
0.6
0.8
1.0
2
Optimal point
Initial point
(0)
= 0.005
= 0.0025
(b)
Figure 4.2: Attempting minimization of a Rosenbrock function with a minimum at
θ
= (1
,
1)
via basic gradient descent for two values of
α
with 10
5
iterations and
θ
init
= (0
.
2
,
0
.
2). (a) Fixed
learning rate: α
(t)
= α. (b) Exponentially decaying learning rate: α
(t)
= α 0.99
t
.
Instead of adjusting the learning rate according to a predefined schedule, other variants
try to adjust the learning rate according to other criteria such as the objective function
crossing some threshold values. Further, in some implementations,
α
(t)
is optimized in each
iteration, so that the function
C
(
θ
) decreases maximally in the direction of the step size; see
for example line search methods described in Section 4.5. However, such strategies are again
not fool-proof on their own, or are often not practical for deep learning.
Furthermore, even if a magical (“optimal”) learning rate
α
can be suggested for each iteration,
the basic gradient descent method applies the same learning rate to all parameter updates. It
turns out that it is better to learn the parameters
θ
1
, . . . , θ
d
at different rates in an adaptive
manner because larger rates on rarely changing parameter coordinates may result in faster
convergence of the algorithm. To handle such cases, it is common practice to use the ADAM
algorithm. This variant of gradient descent, together with its building blocks, are the focus
of Section 4.3.
Prior to exploring ADAM and related gradient descent adaptations, let us return to the key
issue mentioned above which is the computational complexity of obtaining exact gradients
C
(
θ
). This issue is typically handled by computing approximate gradients either via
stochastic gradient descent, which we describe now, or more practically with the use of mini-
batches described in the sequel. Importantly, both of these approaches yield fast computable
approximate gradients. These approaches also yield parameter trajectories that manage to
escape local minima, flat regions, or saddle points by introducing randomness or noise into
the optimization process.
10
i
i
i
i
i
i
i
i
4.2 Optimization in the Context of Deep Learning
Stochastic Gradient Descent
Recall that the loss function
C
(
θ
) for deep neural networks is typically a finite sum of the
form (4.4) with
C
i
(
θ
) being the loss computed for the
i
-th data sample. The method of
stochastic gradient descent exploits this structure by computing a noisy gradient using only
one randomly selected training sample. That is, it evaluates the gradient for a single
C
i
(
θ
)
instead of the gradient for all of C(θ).
More precisely, the algorithm operates similarly to Algorithm
??
(or variations of it), yet in
the
t
-th iteration of stochastic gradient descent, an index variable
I
t
is randomly selected
from the set
{
1
, . . . , n}
and the gradient
C
I
t
(
θ
) is computed only for the
I
t
-th data sample
in place of Step 3 of Algorithm ??. The update rule for the decision variable is then,
θ
(t+1)
= θ
(t)
α C
I
t
(θ
(t)
). (4.10)
This stochastic algorithm exhibits some interesting properties. Most importantly, the ex-
ecution of each iteration can be much faster than that of basic gradient descent because
only one sample is used at a time (the gain is of order
n
). Further, if the index variable
I
t
is
selected uniformly over
{
1
, . . . , n}
, the parameter update
(4.10)
is unbiased in the sense that
the expected descent step for each iteration
t
is the same as that of gradient descent. That
is, in each iteration, we have
E [C
I
t
(θ)] =
1
n
n
X
i=1
C
i
(θ) = C(θ), for a fixed θ. (4.11)
and hence the stochastic step (4.10) is “on average correct”.
8
While being unbiased, the stochastic nature of
(4.10)
introduces fluctuations in the descent
direction. Such noisy trajectories may initially appear like a drawback, yet one advantage
is that they enable the algorithm to escape flat regions, saddle points, and local minima.
Thus, while the fluctuations make it difficult to guarantee convergence, in the context of
deep learning their effect is often considered desirable in view of highly non-convex loss
landscapes.
Importantly, stochastic gradient descent is generally able to direct the parameters
θ
towards
the region where minima of
C
i
(
θ
) are located. To get a feel for this attribute of stochastic
gradient descent, we consider another hypothetical example where for i = 1, . . . , n,
C
i
(θ) = a
i
(θ
1
u
i,1
)
2
+ (θ
2
u
i,2
)
2
, (4.12)
for some constant
a
i
>
0 and
u
i
= (
u
i,1
, u
i,2
)
R
2
. Here
u
i
is the unique minimizer of
C
i
(
θ
).
These individual loss functions for each observation are convex and hence the total loss
function
C
(
θ
) of
(4.4)
is also convex.
9
Let
S
be the convex hull of
u
1
, . . . , u
n
, that is,
S
is
the smallest convex set that contains
{u
1
, . . . , u
n
}
. It is now obvious that the global minima
of C is also in S.
8
The reader might be tempted to conclude that
E
[
θ
(t)
SGD
] =
θ
(t)
GD
for all time
t
, where the subscript SGD
is for stochastic gradient descent starting at the same initial condition as GD (“Gradient Descent”). However,
in general this is not correct when C(θ) is non-linear in θ.
9
This example does not resemble a general case with multiple local minima since it is convex. Nevertheless,
it is useful for understanding some of the behaviour of stochastic gradient descent.
11
i
i
i
i
i
i
i
i
4 Optimization Algorithms
In Figure 4.3 we plot the contours of such a function when
n
= 5 where in Figure 4.3 (a) the
points
u
1
, . . . , u
5
are distinct and in (b) these points are identical. The convex hull is marked
by the region bounded in green. The figure also plots the evolution of gradient descent and
stochastic gradient descent in each of the cases. As is evident, while the stochastic gradient
descent trajectory is noisier, outside of the convex hull
S
, its trajectory is similar to that of
gradient descent.
Interestingly, in Figure 4.3 (a), once the trajectory hits
S
it mostly stays within
S
but with
a lot of fluctuations. The reason for this behavior is that at any point
θ
not in
S
, the descent
directions for both gradient descent and stochastic gradient descent are towards the set
S
as
it contains the minima of
C
(
θ
) and every
C
i
(
θ
). To make this a concrete argument, note
that the descent step in the t-th iteration of the stochastic gradient descent is
αC
I
t
(θ
(t)
) = α
n
X
i=1
1{I
t
= i}∇C
i
(θ
(t)
).
Since every
C
i
(
θ
) has its minimum in the set
S
, if
θ
(t)
/ S
, then every
−∇C
i
(
θ
(t)
), the
steepest direction of
C
i
(
θ
) at
θ
(t)
, guides the algorithm towards the set
S
. Therefore,
irrespective of the choice of I
t
, the update moves θ
(t+1)
towards S.
2 0 2 4 6
1
2
0
2
4
6
2
Initial point
(0)
SGD
GD
(a)
2 0 2 4 6
1
2
0
2
4
6
2
Initial point
(0)
SGD
GD
(b)
Figure 4.3: An hypothetical example where
C
(
θ
) is composed of individual
C
i
(
θ
) as in
(4.12)
.
Comparison of gradient descent (GD) and stochastic gradient descent (SGD). (a) A case where the
minimizers
u
1
, . . . , u
n
of
C
i
(
θ
) are different and are each marked by green dots in the plot. (b) A
case where the minimizers u
1
, . . . , u
n
are the same.
Working with Mini-batches and the Concept of Epochs
On one extreme, computing the full gradient for basic gradient descent is computationally
challenging but the obtained gradients are noiseless. On the other extreme, stochastic
gradient descent reduces the computational complexity of each update but the updates are
noisy. One middle ground approach is to work with mini-batches. Almost any practical deep
learning training uses some variant of this approach.
12
i
i
i
i
i
i
i
i
4.2 Optimization in the Context of Deep Learning
A mini-batch is simply a small subset of the data points, of size
n
b
. Mini-batch sizes are
typically quite small in comparison to the size of the dataset and are often chosen based on
computational capabilities and constraints. In practice, a mini-batch size
n
b
is often selected
so the computation of gradients associated with
n
b
observations can fit on GPU memory.
This generally makes computation more efficient in comparison to stochastic gradient descent
which cannot take full advantage of parallel computing or GPUs since the gradient evaluation
for a single observation index I
t
is not easy to parallelize.
When using mini-batches, for each iteration
t
, a mini-batch with indices
I
t
{
1
, . . . , n}
is
used to approximate the gradient via,
1
n
b
X
i∈I
t
C
i
(θ
(t)
). (4.13)
This estimated gradient is then used as part of gradient descent or one of its variants (such
as ADAM which we present in the sequel); for example it is used in place of Step 3 of
Algorithm
??
. One common practical approach with mini-batches is to shuffle the training
data apriori and then use the shuffled indices sequentially. With this process there are
n/n
b
mini-batches
10
in the training dataset and each mini-batch is a distinct random subset of the
indices. In such a case, when using some variant of gradient descent, every pass on the full
dataset has
n/n
b
iterations, with one iteration per mini-batch. Such a pass on all the data
via
n/n
b
iterations is typically called an epoch. The training process then involves multiple
epochs. It is quite common to diagnose and track the training process in terms of epochs.
Note that one may also randomly shuffle the data (assign new mini-batches) after every
epoch to reduce the correlation between the epochs and reduce bias in the optimization
process.
Similarly to
(4.11)
when using the mini-batch approach, the estimated gradient is unbiased
in the sense that
E
1
n
b
n
b
X
j=1
C
I
t,j
(θ)
=
1
n
b
n
b
X
j=1
1
n
n
X
i=1
C
i
(θ) = C(θ), for a fixed θ.
Here
I
t,j
denotes the
j
-th element in
I
t
and the first equality holds because shuffling
makes each
I
t,j
to be uniform over
{
1
,
2
, . . . , n}
. To get a feel for the performance of
learning with mini-batches, a mathematical simplification is to consider a variant where
I
t
=
(
I
t,1
, . . . , I
t,n
b
) is a collection of randomly selected indices on
{
1
, . . . , n}
with replacement.
11
As a consequence, each
I
t,j
is not only uniform over
{
1
,
2
, . . . , n}
, but also independent of all
the rest. This manner of computing gradients is an approximation to the above mentioned
mini-batches with shuffling the data after every epoch.
Importantly, in this setup, we notice that the noise of the gradients decreases as the mini-
batch size
n
b
increases. To see this, recall that for each
t
, the index variables
I
t,1
, . . . , I
t,n
b
are chosen independently. Thus, the variance of the gradient estimate is
Var
1
n
b
n
b
X
j=1
C
I
t,j
(θ
(t)
)
=
1
n
b
Var
C
I
t,1
(θ
(t)
)
.
10
We assume here that
n
b
divides
n
. If it is not the case then there are
n/n
b
mini-batches with the last
one having less than n
b
samples.
11
Some authors refer to this as mini-batch gradient descent.
13
i
i
i
i
i
i
i
i
4 Optimization Algorithms
Since
Var
C
I
t,1
(θ
(t)
)
is the variance of the gradient for an arbitrary data sample, we
see that mini-batch gradient descent has
n
b
times smaller variance than that of stochastic
gradient descent.
Loss Minimization is a Proxy for Optimal Performance
Recall the discussion in Section
??
where we saw that the ultimate focus of learning is to
optimize the expected performance on unseen data. This is typically quantified in terms
of some error metric which we wish to be minimal. In this section we simply refer to the
estimated expected performance function as
P
(
θ
), keeping in mind that it is typically an
average over the validation set. When training, it is often more tractable to minimize the
loss C(θ), however, the true goal is minimization of P(θ).
In deep learning, the model choice and training process typically involves choosing a model
with a loss function
C
(
θ
), such that loss minimization is computationally easy to carry out
and yields low
P
(
θ
). In that sense, the loss function
C
(
θ
) is often much more amenable to
derivative computations and gradient based learning, in comparison to trying to use the
performance function
P
(
θ
) directly. Further, by tracking a performance function
12
on a
validation dataset not used for training, we are often able to make sure that overtraining (also
called overfitting) is avoided. This interplay of the loss function
C
(
θ
) and the performance
function
P
(
θ
) implies that deep learning optimization is not traditional optimization where
one seeks
θ
that truly optimizes the loss, but is rather a means of using optimization of
C(θ) as a surrogate to good performance on P(θ).
We now explore this setting of deep learning optimization in two ways. In one direction we
discuss a notion of stopping criteria that differs from the termination conditions presented
at the end of Section 4.1. In another direction we highlight that the actual model choice
affects the loss function
C
(
θ
), and some loss functions are more amenable to optimization in
comparison to others.
First, in terms of stopping criteria, as already mentioned above, for training a deep learning
model, standard practice is to use mini-batches and epochs. This means that some form
of the general Algorithm 4.1 is executed where multiple iterations are grouped into an
epoch. As training progresses (this is sometimes a process of hours, days, or more), we track
the evolution of the loss,
13
C
(
θ
(t)
), as well as one or more performance functions
P
(
θ
(t)
),
typically focusing on iterations t that are at the end of an epoch. In that sense, a common
stopping criterion is to stop at the epoch where
P
(
θ
(t)
) is small when evaluated over the
validation set.
A popular strategy for this approach is called early stopping. With this strategy, training
often continues for multiple epochs beyond the epoch where
P
(
θ
(t)
) is minimized since at
that epoch it is still not evident that the minimum was reached. However, as the training
loop progresses, the model parameters for the best epoch are stored such that once training
is complete the “best model” in terms of
P
(
θ
) on the validation set can be used. A typical
trajectory of the loss and performance metric is displayed in Figure 4.4. With such a typical
trajectory, increasing the number of epochs of training is somewhat similar to the increase of
model complexity discussed in Section
??
; see Figure
??
of that section. Running the model
beyond the minimum point of
P
(
θ
) is called over-training, similar to overfitting discussed in
12
In practice we may often track multiple performance functions.
13
In this context it is often called the training loss as it is the loss over the training set.
14
i
i
i
i
i
i
i
i
4.2 Optimization in the Context of Deep Learning
Section
??
. The early stopping strategy is then a method to use validation performance to
try and avoid such over-training. Importantly, the actual minimum point, say
θ
, of
C
(
·
) is
not a desirable point for this purpose since it is typically a point of over-training and
P
(
θ
)
is sub-optimal in terms of the performance function.
Figure 4.4: Training on a small subset of the MNIST dataset. The validation error rate is tracked
together with the training loss. The early stopping strategy uses the model from epoch 76 where
the validation performance is best.
As further details for Figure 4.4, we note that it is based on a deep fully connected network
(see Chapter
??
) with 5 layers and hundreds of thousands of parameters. We train the model
using only 1
,
000 MNIST images and use 500 additional images for the validation set. The
performance function used on the validation set is one minus the accuracy, and as is evident,
an accuracy of about 90% is obtained. Mini-batch sizes are taken at
n
b
= 10. As we can see,
the model is trained for 200 epochs but retrospectively, the early stopping strategy pulls the
parameters
θ
associated with epoch 76. Note that our chosen model is far from a realistic
model which one would use for such a low data scenario, yet we present this example here
to illustrate the idea of early stopping.
The second way in which deep learning optimization differs from general optimization is that
the model’s loss function itself is often specifically engineered for efficient optimization. In
classical applications of optimization appearing in applied mathematics, operations research,
and other fields, one often needs to optimize a given objective without the ability to modify
the objective. In contrast, deep learning models are versatile enough so that the actual
model, and subsequently the loss function can be chosen with the goal of facilitating efficient
optimization. Discussions of this nature appear in the sequel where for example the choice of
activation functions within a deep neural network significantly affects the nature of learning;
see for example the discussion of vanishing gradients in Chapter ??.
On this matter, we have already seen in Figure
??
of Section
??
how the cross entropy loss
for logistic regression presents a much better loss landscape for optimization in comparison
with the squared loss. In the case of logistic regression, cross entropy loss even implies having
a convex loss function, whereas that property is lost with the squared loss function. Further,
in more complex models using cross entropy is also beneficial in terms of the loss landscape
even if it does not guarantee convexity per-se. We now present one such illustrative example.
15
i
i
i
i
i
i
i
i
4 Optimization Algorithms
(a) (b)
Figure 4.5: The loss landscape of a very simple two-layer neural network. (a) Using the squared
loss C
i
(θ) = (y
(i)
ˆy
(i)
)
2
. (b) Using the binary cross entropy loss C
i
(θ) = CE
binary
(y
(i)
, ˆy
(i)
).
Similarly to the synthetic data used for the logistic regression example yielding Figure
??
,
let us use the same synthetic data on the model,
f(x) = σ
Sig
w
2
σ
Sig
(w
1
x)
=
1
1 + e
w
2
1+e
w
1
x
.
In the context of deep neural networks presented in Chapter
??
, this is a two-layer network
without bias terms, a single neuron per layer, and sigmoid activation functions. It is not a
realistic model one would use in practice, yet for our expository purposes it has the benefit
of having only two parameters, and hence we may visualize the loss surface as a function of
w
1
and w
2
.
Interestingly, as we see in Figure 4.5, using the cross entropy loss presents a much smoother
loss landscape in comparison to the squared error loss. Hence using gradient based opti-
mization with cross entropy would be much more efficient and have the algorithm less likely
to get stuck in local minima or saddle points. Further, in much more realistic models with
thousands or millions of parameters, while not possible to fully visualize, using cross entropy
often yields more sensible loss landscapes as well. Hence in summary we see that when
handling optimization in the context of deep learning, there are often other considerations
at play, beyond standard optimization.
4.3 Adaptive Optimization with ADAM
Essentially, all deep learning optimization techniques can be cast as some form of the descent
direction method outlined in Algorithm 4.1. Basic gradient descent (Algorithm
??
) is one
specific form of this descent direction method, and as we outlined in the previous section, it
suffers from multiple drawbacks. While some of these drawbacks are alleviated when using
noisy gradients via stochastic gradient descent or mini-batches, there is still much more
room for improvement.
16
i
i
i
i
i
i
i
i
4.3 Adaptive Optimization with ADAM
In this section we present one of the most popular improvements over basic gradient descent
called the ADAM algorithm which is short for adaptive moment estimation. This algorithm
has become the default practical choice for training deep learning models. Like basic gradient
descent, ADAM is considered as a first-order method since it uses gradients (first derivatives)
to determine the descent step used in Algorithm 4.1. For this, ADAM combines the ideas of
momentum, and adaptive learning rates per coordinate into a simple mechanism of computing
the descent step. Hence to understand ADAM, we study these two concepts, keeping in
mind that the associated algorithms can also be viewed as independent methods in their
own right. Other variations of these concepts together with additional first-order techniques
that are not employed in ADAM are further summarized in Section 4.5.
Adaptive Optimization and Exponential Smoothing
A basic theme of ADAM and many other first-order methods has to do with adapting to
the local nature of the loss landscape as iterations progress. For simple
d
= 2 examples
as visualized in Figure 4.2 and Figure 4.5, we can quite easily get a taste for the nature
of the loss landscape, and it is generally not difficult to choose descent steps that yield
good overall performance. However, for practical examples with larger values of
d
, at any
iteration
t
of the general Algorithm 4.1, the only information available at our disposal is
the history of
θ
(0)
, . . . , θ
(t1)
and the associated gradients. It thus makes sense to try and
“adapt” the descent step based on this history, perhaps with more emphasis on iterations
from the near-history. One may loosely call such an approach adaptive optimization.
One simple principle used in adaptive optimization is exponential smoothing. Beyond opti-
mization, exponential smoothing is popular in time series analysis and many other branches
of data science and applied mathematics. In general, exponential smoothing operates on a
sequence of vectors
u
(0)
, u
(1)
, u
(2)
. . .
. In the context of this section, the sequence may be a
sequence of gradients or a sequence of the squares of gradients.
The exponentially smoothed sequence is denoted as ¯u
(0)
, ¯u
(1)
, . . . and is computed via,
¯u
(t+1)
= β¯u
(t)
+ (1 β)u
(t)
, with ¯u
(0)
= 0, (4.14)
and
β
[0
,
1). Thus, for
t
0, each smoothed vector
¯u
(t+1)
is a convex combination of the
previous smoothed vector and the most recent (non-smoothed vector)
u
(t)
. When
β
= 0, no
smoothing takes place,
14
and when
β
is a high value near 1, the new vector
u
(t)
only has a
small effect on the new smoothed vector ¯u
(t+1)
.
It is simple to use the update formula
(4.14)
to represent the smoothed vector
¯u
(t)
as a
convex combination of the complete history. Specifically, by iterating (4.14) we have,
¯u
(t+1)
= (1 β)
t
X
τ=0
β
tτ
u
(τ)
, for t = 0, 1, 2, . . . , (4.15)
and thus each
u
(0)
, . . . , u
(t)
contributes to
¯u
(t+1)
, with exponentially decaying weights. As
we see now, within the context of adaptive optimization, such straightforward exponential
smoothing can go a long way.
14
Note that with our convention of indices, when
β
= 0 the exponentially smoothed sequence is actually
a time shift of the original sequence.
17
i
i
i
i
i
i
i
i
4 Optimization Algorithms
Momentum
Recall that the descent step of basic gradient descent is
α
times the negative gradient. This
implies that in flat regions of the loss landscape, gradient descent essentially stops. Further,
at saddle points or local minima, gradient descent stops completely. Such drawbacks may be
alleviated with the use of momentum.
To get an intuitive feel for the use of momentum in optimization, consider an analogy
of rolling a ball downhill on the loss landscape. It gains momentum as it roles downhill,
becoming increasingly faster until it reaches the bottom of a valley. Clearly, the ball keeps
memory of the past forces in the form of acceleration. This motivates us to use exponential
smoothing on the gradients to obtain an extra step called the momentum update. It is then
used as follows,
v
(t+1)
= βv
(t)
+ (1 β)C(θ
(t)
) (Momentum Update), (4.16)
θ
(t+1)
= θ
(t)
αv
(t+1)
(Parameter Update), (4.17)
for scalar momentum parameter β [0, 1) and learning rate α > 0, starting with v
(0)
= 0.
Similar to the general exponential smoothing of
(4.14)
, the momentum update
15
is an
exponential smoothing of the gradient vectors. When
β
= 0, we have the basic gradient
descent method and for larger
β
, information of previous gradients plays a more significant
role. In practice, for deep neural networks, the gradient
C
(
θ
(t)
) is replaced with a noisy
gradient based on stochastic gradient descent or mini-batches.
The vector
v
(t)
in
(4.16)
is called the momentum
16
at the
t
-th update. Using
(4.15)
, we have
v
(t+1)
= (1 β)
t
X
τ=0
β
tτ
C(θ
(τ)
), for t = 0, 1, 2, . . . . (4.18)
That is, the momentum accumulates all the past (weighted) gradients, providing acceleration
to the parameter
θ
updates on downward surfaces. Therefore, for
β >
0, the next step taken
via
(4.17)
is not necessarily taken in the steepest descent, instead the direction is dictated
by the (exponential smoothing) average of all the gradients up to the current iteration.
Figure 4.6 compares the performance of the gradient descent method with momentum for
different values of
β
on the Rosenbrock function
(4.9)
. We observe that for large
β
, the
momentum method accelerates as it takes downward steps.
Adaptive Learning Rates per Coordinate
Observe that the updates of the gradient descent method, with or without momentum, use
the same learning rate
α
for all components of
θ
. However, it is often better to learn the
parameters θ
1
, . . . , θ
d
with potentially a specific learning rate for each coordinate θ
i
.
15
One may alternatively parameterize the momentum update
(4.16)
as
v
(t+1)
=
βv
(t)
+
C
(
θ
(t)
). This
alternative parameterization has the benefit of allowing one to adjust the momentum parameter
β
without
having significant effects on the step size exhibited in
(4.17)
. Our choice of the parameterization as in
(4.16)
is for consistency with ADAM presented below.
16
In physics, momentum is defined as the product of the mass of an object and its velocity vector. The
object’s momentum points in the direction of an object’s movement. In contrast, here the momentum vector
v
(t)
points in the direction opposite to the step taken.
18
i
i
i
i
i
i
i
i
4.3 Adaptive Optimization with ADAM
0.2 0.0 0.2 0.4 0.6 0.8 1.0
1
0.2
0.0
0.2
0.4
0.6
0.8
1.0
2
Optimal point
Initial point
= 0.9
= 0.5
= 0.0
Figure 4.6: Application of momentum to the Rosenbrock function
(4.9)
for three different values
of
β
with learning rate
α
= 0
.
001
/
(1
β
) and a fixed number of iterations. Note that
β
= 0 is basic
gradient descent.
To get a feel for this issue, let us assume a situation where specific parameters are associated
with specific features in the data.
17
For instance, in applications of deep neural networks for
natural language processing (see also Chapter
??
), there are typically words that are less
frequent than others. For example in most texts on trees, the word
chaulmoogra
most likely
appears less frequently than the word
coconut
, even though both are names of tree species.
Now assume that there are associated parameters
θ
chaulmoogra
and
θ
coconut
, and consider
how these would be updated during learning.
What is likely to occur is that each mini-batch would have on average much fewer occurrences
of
chaulmoogra
in comparison to
coconut
. However, if we update all the parameters at the
same learning rate, then
θ
coconut
would be updated more than
θ
chaulmoogra
. Hence with fixed
learning rate
α
, the
θ
coconut
parameter is likely to converge quickly, while the
θ
chaulmoogra
parameter will be slow to respond.
A general approach for overcoming this issue is to scale each coordinate of the descent
step with a different factor. That is, instead of considering a basic gradient descent step,
θ
s
= αC(θ), consider descent steps of the form,
θ
s
= α r C(θ), (4.19)
17
In some simple models such as linear regression, logistic regression, or multinomial regression this
is exactly the case for categorical one-hot encoded variables. In more advanced models there is a more
complicated implicit mapping between parameters and features.
19
i
i
i
i
i
i
i
i
4 Optimization Algorithms
where
r
is some vector of positive entries which recalibrates the descent steps and
is
the element-wise product of two vectors.
18
That is, for a parameter such as
θ
chaulmoogra
we
would expect the coordinate
r
chaulmoogra
to be high in comparison to the coordinate
r
coconut
which is associated with
θ
coconut
. Such recalibration of the descent steps would result in
large steps for
θ
chaulmoogra
whenever
chaulmoogra
appears in a mini-batch, and smaller
steps for θ
coconut
.
However, in general, finding a fixed good vector
r
for
(4.19)
is difficult, especially when the
number of parameters is huge and the actual relationship between parameters and features
is not clear. Instead, we use adaptive methods that adjust the descent step during the
optimization process.
We now introduce two such approaches called adaptive subgradient (or Adagrad) and root
mean square propagation (or RMSprop). In both of these approaches the update of coordinate
i
in iteration t can be represented as,
θ
(t+1)
i
= θ
(t)
i
α
q
s
(t+1)
i
+ ε
C(θ
(t)
)
θ
i
, (4.20)
where
s
(0)
i
, s
(1)
i
, s
(2)
i
, . . .
is a sequence of non-negative values that are updated adaptively
and
ε
taken to be a small value of order 1
×
10
8
to avoid division by zero. That is,
with this representation, the descent step at time
t
represented in terms of
(4.19)
has
r
i
=
q
s
(t+1)
i
+ ε
1
. Generally we aim for
s
i
to be some smoothed representation of the
square of the derivative
C(θ
(t)
)/∂θ
i
. Hence,
r
i
is roughly the inverse of the magnitude of
the derivative, and r
i
is low when the changes for coordinate i are steep and vise-versa.
Vector-wise, the parameter update (4.20) can be represented as
θ
(t+1)
= θ
(t)
α
s
(t+1)
+ ε
C(θ
(t)
), (4.21)
where
s
(t)
= (
s
(t)
1
, . . . , s
(t)
d
),
ε
is considered a vector, and the addition, division, and square-
root operations are all element-wise operations.
Using the representation
(4.20)
or
(4.21)
, let us now define how the sequence of
{s
(t)
i
}
is
computed both for Adagrad and RMSprop. Specifically,
s
(t+1)
i
=
P
t
τ=0
C(θ
(τ )
)
θ
i
2
, for Adagrad,
γs
(t)
i
+ (1 γ)
C(θ
(t)
)
θ
i
2
, for RMSprop,
(4.22)
where the recursive relationship for RMSprop has the initial value at
s
(0)
i
= 0. RMSprop is
also parameterized by a decay parameter
γ
[0
,
1) with typical values at
γ
= 0
.
999 implying
that for RMSprop the sequence
{s
(t)
i
}
is a relatively slow exponential smoothing of the
square of the derivative.
18
The element-wise product is also called the Hadamard product or Schur product.
20
i
i
i
i
i
i
i
i
4.3 Adaptive Optimization with ADAM
Adagrad may appear simpler than RMSprop since it does not involve any recursive step or
any hyper-parameter like
γ
and it is simply an accumulation of the squares of the derivatives.
However, the crucial drawback of Adagrad is that for each
i
, the sequence
{s
(t)
i
}
is strictly
non-deceasing. As a consequence, the effective learning rate decreases during training, often
making it infinitesimally small before convergence. RMSprop was introduced in the context
of deep learning after Adagrad, and overcomes this problem via exponential smoothing of
the squares of the partial derivatives.
It is also useful to use
(4.15)
to see that for RMSprop, the explicit (all-time) representation
of the vector s
(t+1)
is,
s
(t+1)
= (1 γ)
t
X
τ=0
γ
tτ
C(θ
(τ)
) C(θ
(τ)
)
, (4.23)
while for Adagrad a similar representation holds without the (1
γ
) and
γ
tτ
elements in
the formula.
Bias Correction for Exponential Smoothing
We have seen that both momentum and RMSprop involve exponential smoothing of the
gradients and squares of the gradients respectively. In both of these cases, we set zero initial
conditions at iteration
t
= 0. Specifically, for momentum, we have
v
(0)
= 0 and for RMSprop,
we have
s
(0)
= 0. In the absence of any better initial conditions, this is a sensible choice,
yet such initial conditions may introduce some initial (short-term) bias. This bias would
eventually “wash away” as
t
grows but it can play a significant role for short term
t
. The
effect of such bias is especially pronounced when the respective exponential smoothing
parameters β or γ are close to 1.
To mitigate such temporal bias, a common technique is to use bias correction. To see this, first
observe from
(4.14)
that since the initial value
¯u
(0)
= 0, the first element of the exponential
smoothing is
¯u
(1)
= (1 β)u
(0)
.
Since
β
is usually set near 1,
¯u
(1)
is close to zero even when
u
(0)
is not close to 0. Consequently,
¯u
(2)
stays close to zero even if u
(1)
is away from 0 because
¯u
(2)
= β¯u
(1)
+ (1 β)u
(1)
.
Thus, initial values of the exponential smoothing sequence
¯u
(0)
, ¯u
(1)
, . . .
remain close to 0
even when the elements of original sequence u
(0)
, u
(1)
, . . . are far from 0.
One way to handle such bias is to consider the special case where the input vectors in
(4.15)
are fixed at u
(t)
= u. In such a case, via simple evaluation of a geometric sum we have,
¯u
(t+1)
= (1 β)
t
X
τ=0
β
tτ
u = (1 β)
t
X
τ=0
β
τ
u = (1 β
t+1
)u.
Hence if we were to divide the exponentially smoothed value
¯u
(t)
by 1
β
t+1
, we would
get constant vectors when
u
(t)
=
u
(constant). Further, in (more realistic) cases where the
underlying input sequence
u
(0)
, u
(1)
, u
(2)
. . .
is not constant but close to a constant, the bias
21
i
i
i
i
i
i
i
i
4 Optimization Algorithms
corrected sequence
{¯u
(t)
/
(1
β
t
)
}
may still do a better job at mitigating initial effects.
Clearly as t grows, β
t
0, and the effect of this bias correction disappears.
Now we may apply such bias correction to momentum or RMSprop where we have the
specific forms
(4.18)
or
(4.23)
, respectively. In these cases, the bias corrected updates for
v
and s are,
ˆv
(t+1)
=
1
1 β
t+1
v
(t+1)
and ˆs
(t+1)
=
1
1 γ
t+1
s
(t+1)
, t = 0, 1, 2, . . . . (4.24)
Putting the Pieces Together: ADAM
Now that we understand ideas of momentum, RMSprop, and bias correction, we can piece
these together into a single algorithm, namely the adaptive moment estimation method, or
simply ADAM. Metaphorically, if the execution of the momentum method is a ball rolling
down a slope, the execution of ADAM can be seen as a heavy ball with friction rolling down
the slope.
The key update formula for ADAM is
θ
(t+1)
= θ
(t)
α
1
ˆs
(t+1)
+ ε
ˆv
(t+1)
, (4.25)
where all vector operations (division, square root, and addition of
ε
) are interpreted element
wise. Here
ˆv
(t+1)
and
ˆs
(t+1)
are bias corrected exponential smoothing of the gradient and
the squared gradients as given in (4.24).
Algorithm 4.2: ADAM
Input: Dataset D = {(x
(1)
, y
(1)
), . . . , (x
(n)
, y
(n)
)},
objective function C(·) = C(·; D), and
initial parameter vector θ
init
Output: Approximatly optimal θ
1 t 0 (Initialize iteration counter)
2 θ θ
init
, v 0, and s 0 (Initialize state vectors)
3 repeat
4 g C (θ) (Compute gradient)
5 v β v + (1 β) g (Momentum update)
6 s γ s + (1 γ) (g g) (Second moment update)
7 ˆv
v
1 β
t+1
(Bias correction)
8 ˆs
s
1 γ
t+1
(Bias correction)
9 θ θ α
1
ˆs + ε
ˆv (Update parameters)
10 t t + 1
11 until termination condition is satisfied
12 return θ
22
i
i
i
i
i
i
i
i
4.4 Automatic Differentiation
With ADAM,
α
is still called the learning rate and is still the most important parameter
which one needs to tune. The other parameters are
β
[0
,
1) for the momentum exponential
smoothing as used in
(4.16)
and
γ
[0
,
1) for the RMSprop exponential smoothing as is
used in (4.22). The common defaults for these parameters are β = 0.9 and γ = 0.999.
ADAM is presented in Algorithm 4.2 where Step 4 is the gradient computation, steps 5 and
6 are exponential smoothing (momentum and RMSprop), steps 7 and 8 are bias corrections,
and finally Step 9 is the descent step.
Since the introduction of ADAM in 2014, this algorithm became the most widely used
algorithm (or “optimizer”) in deep learning frameworks. In the next section we start to
focus on what is typically the most costly computation within the algorithm, namely Step 4,
where we evaluate the gradient. Further generalizations of ADAM and other variations of
first-order methods are presented in Section 4.5.
4.4 Automatic Differentiation
The computation of gradient vectors is a crucial step in the ADAM algorithm or in any
other first-order optimization method. Practically, in deep learning, the most common
way for computing such gradients is called automatic differentiation and is embodied in
the backpropagation algorithm. In this section we introduce general concepts of automatic
differentiation and then in Section
??
we specialize to the backpropagation algorithm for
deep neural networks.
All popular methods for computing derivatives, gradients, Jacobians, and Hessians follow
one of two approaches. In one approach, the algebraic expressions of the derivatives are
computed first and then the derivatives for a particular given point are evaluated. For this
approach the expressions of the derivatives are either derived manually, which is generally
a tedious process, or using computer based symbolic differentiation methods. For example,
we have seen explicit gradient expressions in
(??)
for linear regression and
(??)
for logistic
regression. However, for general deep learning models, explicit expressions are not with such
compact form and instead suffer from the problem of expression swell. Thus an alternative
approach is to compute the numerical values of the derivatives at a given point directly.
This approach includes standard methods of numerical differentiation as well as automatic
differentiation which is our focus here.
In this section, we first present an overview of numerical and symbolic differentiation. We
then outline key ideas of differentiable programming which is a programming paradigm
that encompasses automatic differentiation. We then present forward mode automatic
differentiation which is a stepping stone towards understanding backward mode automatic
differentiation. Finally, we present backward mode automatic differentiation.
Numerical and Symbolic Differentiation
Suppose we would like to compute the gradient of the objective function
C
(
θ
) at a given
parameter vector
θ
. Numerical differentiation methods use the definition of the partial
derivative, see
(??)
in Section
??
, to compute the gradient
C
(
θ
) approximately. In particular,
23
i
i
i
i
i
i
i
i
4 Optimization Algorithms
the most basic form of numerical differentiation approximates the partial derivative via,
C(θ)
θ
i
C(θ + he
i
) C(θ)
h
, (4.26)
for a small constant
h >
0, where
e
i
is the
i
-th unit vector. Thus to obtain a numerical
estimate of
C
(
θ
) we need to evaluate
C
(
·
) at
θ
, and
θ
+
he
i
for
i
= 1
, . . . , d
, and each time
use (4.26).
A classic problem with numerical differentiation is selection of the constant
h
. While
mathematically, smaller
h
provides a better approximation, numerically, small
h
yields
numerical instability due to round-off errors.
19
Further, in the context of deep learning
where
θ
is of very high dimension, the major drawback of numerical differentiation is that it
requires us to perform an order of
d
function evaluations to compute the gradient at a point.
The basic alternative to numerical differentiation is symbolic differentiation. With this
paradigm, instead of directly obtaining numerical values of derivatives, we represent ex-
pressions using computer algebra systems and obtain mathematical expressions for the
derivatives using automated algorithms based on the rules of calculus. At its core, symbolic
differentiation is a very useful tool. However, with deep learning, trying to rely on symbolic
differentiation solely is not practical. A key problem is expression swell where the exact
mathematical representation of partial derivatives of loss functions associated with deep
neural networks may often require excessive resources due to the complexity of the resulting
expressions.
Since in deep learning we are mainly concerned with numerical values of the derivatives
but not with their analytical expressions, the application of symbolic differentiation can
be unwieldy. However, instead of trying to find the expression of the objective function
directly, if it is decomposed into several elementary operations, then we can numerically
compute the gradients of the objective function by obtaining symbolic derivative expressions
of these elementary operations while keeping track of the intermediate numerical values in a
sequential manner. This idea of interleaving between symbolic expressions and numerical
evaluations at elementary levels is the basis of automatic differentiation.
Overview of Differentiable Programming
Let us now introduce basic terminology of automatic differentiation within the world of
differentiable programming. While for deep learning, the central application is associated
with the multi-input scalar-output loss
C
:
R
d
R
, to get a general feel for differentiable
programming and automatic differentiation, let us first consider general functions,
g
:
R
d
R
m
. That is, when m > 1, the outputs are vector valued.
Differentiable programming refers to a programming paradigm where numerical computer
programs, or code for computer functions, are transformed into other functions which
execute the derivatives, gradients, Jacobians, Hessians, or higher-order derivatives of the
original computer functions. For example, consider a case with
d
= 3 and
m
= 2 and some
hypothetical computer programming language with some programmed function
g()
which
appears in code as follows:
19
There are other similar schemes that generally exhibit less numerical error than
(4.26)
, yet any numerical
differentiation scheme requires tuning of h.
24
i
i
i
i
i
i
i
i
4.4 Automatic Differentiation
g(t1, t2, t3) = (t1 - 7*t2 + 5*t3, t2*t3)
We interpret this code as a function which operates on three input numbers
θ
1
, θ
2
, and
θ
3
and then returns two numbers of which the first is a linear combination of the three inputs,
θ
1
7
θ
2
+ 5
θ
3
, and the second is a product of two of the inputs,
θ
2
θ
3
. From a programming
perspective it is a pure function in the sense that it does not depend on any other variables
and does not change any other variables in the system. Thus, the computer function
g()
implements a mathematical function g : R
3
R
2
with,
g(θ
1
, θ
2
, θ
3
) =
g
1
(θ
1
, θ
2
, θ
3
), g
2
(θ
1
, θ
2
, θ
3
)
,
where g
1
: R
3
R and g
2
: R
3
R.
With differentiable programming and automatic differentiation, we are interested in evaluation
of partial derivatives such as
g
i
/∂θ
j
for
i
= 1
,
2 and
j
= 1
,
2
,
3, at specific values of
θ
1
,
θ
2
, and
θ
3
. For this simple example these derivatives are obviously known analytically and
constitute the Jacobian matrix of g; see (??) in Appendix ??. In our case, the Jacobian is,
J
g
(θ
1
, θ
2
, θ
3
) =
1 7 5
0 θ
3
θ
2
,
where the i, j element of the Jacobian J
g
is g
i
/∂θ
j
.
In its most basic form, a differentiable programming system provides constructs such as
partial_derivative() which gives us the ability to write code such as for example:
partial_derivative(target=g, t=[2.5, 2.1, 1.4], i=2, j=3)
The meaning of such code is a request of the computer to evaluate the numerical value of the
Jacobian of the target function
g
at
θ
= (2
.
5
,
2
.
1
,
1
.
4) for
i
= 2 and
j
= 3. That is, we expect
such code to compute
g
2
/∂θ
3
and return 2
.
1 as a result in this case. In contrast to symbolic
or numerical differentiation, this computation is to be carried out through mechanisms of
automatic differentiation.
In addition to such general partial derivatives via
partial_derivative()
, we may also ex-
pect differentiable programming frameworks to expose software constructs such as
gradient()
for computing gradient vectors of functions
g
:
R
d
R
,
jacobian()
for computing the
Jacobian matrix of functions
g
:
R
d
R
m
, and
hessian()
for Hessian matrices of functions
g
:
R
d
R
. In any case, the mechanism of computation of these operations typically involves
a transformation of the source code of the original functions, or some data structures that
represent those functions, into new code or data structures that encode how derivative
evaluation is carried out. See the notes and references at the end of the chapter for references
to contemporary concrete software frameworks that execute such operations.
Beyond
partial_derivative()
,
gradient()
,
jacobian()
, and
hessian()
, there are also
two other basic constructs in a differentiable programming framework. These are jacobian
vector products,
jvp()
, and vector Jacobian products,
vjp()
. A Jacobian vector product is
based on
θ R
d
and
v R
d
and is the evaluation of
J
g
(
θ
)
v
which results in a vector in
R
m
.
Similarly a vector Jacobian product is based on
θ R
d
and
u R
m
and is the evaluation of
u
J
g
(
θ
) which results in a row vector whose transpose is in
R
d
. Due to the way in which
automatic differentiation operates,
jvp()
and
vjp()
are in fact the basic operations that
one may expect from a differentiable programming framework. Whereas the other operations,
partial_derivative(), etc. are all built on top of either jvp() or vjp().
25
i
i
i
i
i
i
i
i
4 Optimization Algorithms
Automatic differentiation generally works in two types of modes, namely forward mode which
is the algorithm behind
jvp()
and backward mode, also known as reverse mode, which is
the algorithm behind
vjp()
. In both cases the multivariate chain rule plays a key role; see
Section
??
in the appendix for a review of the chain rule. The general principle of automatic
differentiation is based on computing intermediate partial derivatives and incorporating these
computed values in an iterative computation. Forward mode automatic differentiation is a
straightforward application of the multivariable chain rule while backward mode is slightly
more involved. Note that the backpropagation algorithm for deep learning is a special case
of backward mode. The subsections below focus on the inner workings of forward mode
and backward mode automatic differentiation, but first let us see how
jvp()
and
vjp()
are
generally used in differentiable programming.
As stated above, the basic service provided by forward mode automatic differentiation is the
computation of Jacobian vector products. For instance, returning to the simple
d
= 3,
m
= 2
example above, we may invoke forward mode automatic differentiation via code such as,
jvp(target=g, t=[2.5, 2.1, 1.4], v=[0,1,0]),
Here we choose
θ
= (2
.
5
,
2
.
1
,
1
.
4) as before and
v
to be the unit vector
e
2
. The result is in
this case is
[-7, 1.4]
. Setting
v
as a unit vector
20
can be useful because in general when
using
v
=
e
j
, the output
J
g
(
θ
)
e
j
is the
j
-th column of the Jacobian matrix. Hence in this
example of
jvp()
, we obtain as output the effect of
θ
2
on the two outputs
g
1
(
θ
) and
g
2
(
θ
)
which is the vector (g
1
/∂θ
2
, g
2
/∂θ
2
) = (7, 1.4).
This shows that to implement
partial_derivative
for some
g
i
/∂θ
j
at
θ
, we can compute
J
g
(
θ
)
e
j
(the
j
-th column of the Jacobian matrix) using
jvp()
and take the
i
-th index of
the output. Specifically a single application of
jvp()
on
v
=
e
j
provides us with the partial
derivatives
g
i
/∂θ
j
for all
i
= 1
, . . . , m
. Hence vector Jacobian products provided by forward
mode automatic differentiation are useful when the output
m
is large and we wish to only
see the effect of a single, or a few input variables θ
j
on the output.
However, note that in the deep learning loss function application where
m
= 1 and
d
is large, vector Jacobian products are not an efficient means for evaluating the gradient.
Specifically, when we wish to evaluate
gradient()
using
jvp()
we need to invoke
jvp() d
separate times, each time with
v
=
e
j
for
j
= 1
, . . . , d
. Thus using forward mode automatic
differentiation (Jacobian vector products) is not efficient for deep learning loss landscape
gradient evaluation.
The alternative to
jvp()
is to use vector Jacobian products as exposed via backward mode
automatic differentiation. With this approach we may invoke code such as,
vjp(target=g, t=[2.5, 2.1, 1.4], u=[1,0])
In this example we use the same
θ
as above and set
u
=
e
1
R
2
. Since the vector Jacobian
product evaluates
u
J
g
(
θ
) the output is a row vector of dimension
d
= 3 which is the first
row of
J
g
(
θ
), namely
g
1
(
θ
)
. More generally applying Jacobian vector products on
u
=
e
i
yields the transposed gradient g
i
(θ)
.
In the deep learning loss minimization scenario
d
is very large and
m
= 1. Thus the Jacobian
is simply the transpose of the gradient and is a long row vector. In such a case, computing
20
We can also use more general (non unit vectors) and obtain directional derivatives, yet in the remainder
of our discussion inputs to Jacobian vector products (
v
) and vector Jacobian products (
u
) are unit vectors.
26
i
i
i
i
i
i
i
i
4.4 Automatic Differentiation
the Jacobian vector product with
u
= 1 (scalar) yields the gradient of the loss function
directly. Thus since backward mode automatic differentiation implements
vjp()
, using it on
the loss function, is directly suited for deep learning.
We now explore the inner workings of forward mode automatic differentiation followed by
backward mode automatic differentiation. In both cases we restrict to
m
= 1. Understanding
forward mode is a useful stepping stone to understanding backward mode. Note that forward
mode and backward mode involve internal implicit computations of Jacobian vector products
or vector Jacobian products, respectively.
21
The Computational Graph and Forward Mode Automatic Differentiation
Computational graphs play a key role in the implementation of automatic differentiation. A
computational graph for a multivariate function
C
:
R
d
R
is a directed graph where the
nodes correspond to elementary operations involved in the mathematical expression of
C
(
θ
).
Inputs to the graph are the variables
θ
1
, . . . , θ
d
and constants if required, and the output is
the function value C(θ).
θ
1
×
sin
θ
2
+
exp
+ C(θ)
θ
3
×
2
1
2
3
4
5 6
a
1
a
4
a
2
a
5
a
6
a
3
Figure 4.7: The computational graph of
(4.27)
with nodes numbered 1 to 6. The primal
a
i
of each
node is the result of an elementary operation applied on its inputs.
As a simple example with d = 3, consider,
C(θ) = sin(θ
1
θ
2
) + exp(θ
2
+ θ
3
) 2θ
3
. (4.27)
The computational graph for this function, presented in Figure 4.7, is composed of nodes with
inputs and outputs. The output of each node is an elementary operation on the inputs to the
21
It may be also useful to see the mathematical representations of Jacobian vector products and vector
Jacobian products through a composition of functions. For this we may see
(??)
,
(??)
, and
(??)
, in
Appendix ??.
27
i
i
i
i
i
i
i
i
4 Optimization Algorithms
node. These elementary operations include summation, multiplication,
sin
(
·
)
, cos
(
·
)
, exp
(
·
),
etc. Nodes are numbered 1
,
2
, . . .
, and we denote the output of node
i
with
a
i
=
f
i
(
u
1
, . . . , u
i
)
when there are
i
inputs to node
i
, denoted as
u
1
, . . . , u
i
. In this example there are 6 nodes,
and taking node 6, as an illustration, we have
a
6
=
f
6
(
a
3
, a
4
, a
5
) where
6
= 3 and
f
6
(
·
) is
the summation of its input arguments. Note that
a
i
is sometimes called the primal of that
node in a given computation.
The evaluation at each node in the computational graph is executed at the instance in which
inputs to the node become available. This then implies the notion of iterations, where a node
is considered to be evaluated at a given iteration if the inputs to that node are all available
at that iteration but not previously. As an example, with
θ
= (
π/
6
,
2
,
5), the iterations for
(4.27) are summarized in Table 4.1.
Table 4.1: Evaluation of the value of
C
(
θ
) in
(4.27)
at an example
θ
via the computational graph
in Figure 4.7 computing the primals a
i
for i = 1, . . . , 6.
General expressions of primals Values of primals at specific θ
Input θ = (θ
1
, θ
2
, θ
3
) θ = (π/6, 2, 5)
Iteration 1
a
1
= θ
1
θ
2
a
2
= θ
2
+ θ
3
a
3
= 2θ
3
a
1
= π/3
a
2
= 7
a
3
= 10
Iteration 2
a
4
= sin(a
1
)
a
5
= exp(a
2
)
a
4
=
3/2
a
5
= e
7
Iteration 3 (output)
a
6
= a
4
+ a
5
+ a
3
= C(θ)
a
6
= C(θ)
=
3/2 + e
7
10
As mentioned above, forward mode automatic differentiation implements Jacobian vector
products. In our exposition here, since the target function
C
(
·
) is scalar valued (
m
= 1),
the Jacobian is the transpose of the gradient and with
v R
d
, the Jacobian vector product,
J
C
(
θ
)
v
, is a scalar.
22
Further, our exposition focuses on
v
=
e
j
, implying that the output of
forward mode automatic differentiation is
C
(
θ
)
/∂θ
j
. Hence for our exposition here,
j
is
fixed.
The key idea of forward mode automatic differentiation is to maintain a record of intermediate
derivatives as computation progresses through the computational graph. For this we denote,
˙a
i
=
a
i
θ
j
, (4.28)
and call
˙a
i
the tangent at node
i
, for fixed
j
. Now since
a
i
=
f
i
(
u
1
, . . . , u
i
), using the
multivariable chain rule (see Appendix ??), we have
˙a
i
=
i
X
k=1
f
i
(u
1
, . . . , u
i
)
u
k
˙u
k
= f
i
(u
1
, . . . , u
i
)
˙u, (4.29)
22
It is in fact the directional derivative of C(·) in the direction v.
28
i
i
i
i
i
i
i
i
4.4 Automatic Differentiation
with
˙u
k
denoting the tangent of
u
k
and with
˙u
denoting the vector of these tangents.
Returning to Figure 4.7 as an example, we have that
f
6
(
u
1
, u
2
, u
3
) = (1
,
1
,
1) and thus,
˙a
6
= ˙a
3
+ ˙a
4
+ ˙a
5
. Similarly, in that figure, f
4
(u
1
) = cos(u
1
) and thus ˙a
4
= cos(a
1
) ˙a
1
.
Since
(4.29)
defines a recursive computation of the tangents, we also require the inputs to
the function
θ
1
, . . . , θ
d
to have tangents. We set these as the elements of the vector
v R
d
used in the Jacobian vector product. Specifically, here we focus on Jacobian vector products
with
v
=
e
j
since we are seeking the derivative with respect to a specific input
θ
j
. Hence we
set the tangent values for the inputs of the function as
˙
θ
j
= 1 and
˙
θ
i
= 0 for i = j.
For forward mode automatic differentiation, we progress on the computational graph in
the forward direction from the input to the output, evaluating both the primal
a
i
and
the tangent
˙a
i
at each node. As we progress, in each iteration, we consider the set of all
nodes
i
whose predecessors already have primal values and tanget values available, and for
each of these nodes we compute the primal values as was already shown in Table 4.1, and
importantly also compute the tangent values via
(4.29)
. This computation may be broken
into iterations, in a similar way to the iterations of the function evaluation computed in
Table 4.1.
In particular, since
f
i
(
·
) denotes an elementary operation, such as the function
exp
(
·
) at
node 5 in Figure 4.7, we assume the availability of a symbolic expression for each
f
i
(u
1
,...,u
i
)
u
k
,
or alternatively an ability to numerically compute it exactly. Therefore, since we know the
numerical primal values
u
1
, . . . , u
i
and the numerical tangent values
˙u
1
, . . . , ˙u
i
, we can
easily compute
˙a
i
using
(4.29)
. Consequently, the numerical value of
C(θ)
θ
j
can be computed
at the end of forward mode as it is the tangent value of the output node of the computational
graph.
Table 4.2 illustrates forward mode automatic differentiation using the example function
C
(
θ
)
in
(4.27)
. The first column focuses on the general expression and the other column computes
the partial derivative of C(θ) with respect to θ
2
at a specific point θ.
Table 4.2: Forward mode automatic differentiation for the example function
(4.27)
and
j
= 2,
i.e., we seek the derivative with respect to
θ
2
. The right column illustrates computation
23
of the
derivative
C(θ)
θ
2
at θ = (π/6, 2, 5) which is obtained via the final tangent value ˙a
6
= π/12 + e
7
.
General expressions of Values of primals and tangents at
primals and tangents specific θ for j = 2
Start
θ = (θ
1
, θ
2
, θ
3
)
˙
θ = (
˙
θ
1
,
˙
θ
2
,
˙
θ
3
)
θ = (π/6, 2, 5)
˙
θ = (0, 1, 0)
Iteration 1
(a
1
, ˙a
1
) = (θ
1
θ
2
, θ
1
˙
θ
2
+
˙
θ
1
θ
2
)
(a
2
, ˙a
2
) = (θ
2
+ θ
3
,
˙
θ
2
+
˙
θ
3
)
(a
3
, ˙a
3
) = (2θ
3
, 2
˙
θ
3
)
(a
1
, ˙a
1
) = (π/3, π/6)
(a
2
, ˙a
2
) = (7, 1)
(a
3
, ˙a
3
) = (10, 0)
Iteration 2
(a
4
, ˙a
4
) = (sin(a
1
), cos(a
1
) ˙a
1
)
(a
5
, ˙a
5
) = (exp(a
2
), exp(a
2
) ˙a
2
)
(a
4
, ˙a
4
) = (
3/2, π/12)
(a
5
, ˙a
5
) = (e
7
, e
7
)
Iteration 3
(a
6
, ˙a
6
) =
a
3
+ a
4
+ a
5
˙a
3
+ ˙a
4
+ ˙a
5
(a
6
, ˙a
6
) =
3/2 + e
7
10
π/12 + e
7
29
i
i
i
i
i
i
i
i
4 Optimization Algorithms
Backward Mode Automatic Differentiation
As mentioned above, backward mode automatic differentiation implements vector Jacobian
products. This allows us to use a single run of backward mode automatic differentiation
to compute the gradient vector for
C
:
R
d
R
. For deep learning this is a significant
improvement over forward mode automatic differentiation which would require
d
executions
to obtain such a gradient.
For a given point
θ
, backward mode automatic differentiation is executed in two phases often
called the forward pass and backward pass. The forward pass phase is executed before the
backward pass phase. In the forward pass, the algorithm simply evaluates the computational
graph to compute the values of the intermediate primal variables
a
i
while recording the
dependencies of these variables in a bookkeeping manner, similar to the evaluation in
Table 4.1.
In the backward pass, all the derivatives of the objective function are computed by using
intermediate derivatives called adjoints. Specifically, for each intermediate variable
a
i
, the
adjoint is
ζ
i
=
C
a
i
. (4.30)
Adjoint
ζ
i
captures the rate of change in the final output
C
with respect to variable
a
i
.
Contrast this with the tangent
˙a
i
from forward mode automatic differentiation, as in
(4.28)
,
which captures the rate of change of a
i
with respect to an input variable.
In the backward pass we populate the values of the adjoint variables
ζ
i
. Specifically, we
progress on the computational graph in the backward direction starting from the output
node and propagating towards the input. For this, say that the output of node
i
is input to
˜
i
nodes with indices
i
1
, . . . , i
˜
i
. Then the only way that
a
i
can influence the output
C
is by
influencing a
i
1
, . . . , a
i
˜
i
. Now using the multivariable chain rule,
C
a
i
=
˜
i
X
k=1
a
i
k
a
i
C
a
i
k
,
and thus using the notation of (4.30),
ζ
i
=
˜
i
X
k=1
a
i
k
a
i
ζ
i
k
. (4.31)
The relationship
(4.31)
is used in iterations of the backward pass. In every iteration, the
adjoints
ζ
i
k
in the right hand side of
(4.31)
are available from previous iterations and are
used to compute
ζ
i
. In particular at the start of the backward pass we initilize the adjoint
of the output node at 1. Further, recalling the notation used for forward mode automatic
differenation,
a
i
k
=
f
i
k
(
u
1
, . . . , u
i
k
) where
u
1
, . . . , u
i
k
are the inputs to node
i
k
and
f
i
k
(
·
)
has readily computable derivatives with respect to its inputs. Hence, since the primals of
u
1
, . . . , u
i
k
were already populated during the forward pass, we compute
a
i
k
a
i
locally at
node i and use it in (4.31).
23
To verify the computation recall the basic linearity and product rules of scalar differentiation. Also,
remember that the derivative of
sin
(
·
) is
cos
(
·
), that the derivative of
exp
(
·
) is
exp
(
·
), and that
cos
(
π/
3) = 1
/
2.
30
i
i
i
i
i
i
i
i
4.5 Additional Techniques for First-Order Methods
Table 4.3: The backward pass of backward mode automatic differentiation for the example function
(4.27)
to compute its gradient at
θ
= (
π/
6
,
2
,
5). The result is
C
(
θ
) = (1
, π/
12 +
e
7
, e
7
2). This
is computed after the forward pass is executed; see Table 4.1.
General expressions of adjoints Values of adjoints at specific θ
Start
ζ
6
=
C(θ)
a
6
ζ
6
= 1
Iteration 1
ζ
4
=
a
6
a
4
ζ
6
= 1 × ζ
6
ζ
5
=
a
6
a
5
ζ
6
= 1 × ζ
6
ζ
3
=
a
6
a
3
ζ
6
= 1 × ζ
6
ζ
4
= 1
ζ
5
= 1
ζ
3
= 1
Iteration 2
ζ
1
=
a
4
a
1
ζ
4
= cos(a
1
)ζ
4
ζ
2
=
a
5
a
2
ζ
5
= exp(a
2
)ζ
5
ζ
1
= cos(π/3) = 1/2
ζ
2
= e
7
Iteration 3
C(θ)
θ
1
=
a
1
θ
1
ζ
1
= θ
2
ζ
1
C(θ)
θ
2
=
a
1
θ
2
ζ
1
+
a
2
θ
2
ζ
2
= θ
1
ζ
1
+ 1 × ζ
2
C(θ)
θ
3
=
a
2
θ
3
ζ
2
+
a
3
θ
3
ζ
3
= 1 × ζ
2
2 × ζ
3
C(θ)
θ
1
= 2 × 1/2 = 1
C(θ)
θ
2
= π/6 × 1/2 + e
7
= π/12 + e
7
C(θ)
θ
3
= e
7
2
The elements of the gradient
C
(
θ
) depend on the adjoints in a similar manner to
(4.31)
.
Specifically,
C
θ
j
=
X
k
a
j
k
θ
j
ζ
j
k
, (4.32)
where the summation is over all nodes indexed via
j
k
that take
θ
j
as input in the computa-
tional graph. Thus, in the final iteration of the backward pass, we use
(4.32)
to obtain the
gradient C(θ).
Let us return to the example in
(4.27)
with computational graph
(4.7)
evaluated at
θ
=
(
π/
6
,
2
,
5). Evaluation of
C
(
θ
) via backward mode automatic differentiation first executes
a forward pass as previously illustrated in Table 4.1. This populates the primals
a
1
, . . . , a
6
.
Then the backward pass progresses via repeated application of
(4.31)
and with application
of
(4.32)
in the final iteration. This backward pass computation is summarized in Table 4.3.
4.5 Additional Techniques for First-Order Methods
In Section 4.3 we have seen one of the most popular deep learning optimization algorithms,
ADAM, as well as key milestone techniques that have led to its development. Then in
31
i
i
i
i
i
i
i
i
4 Optimization Algorithms
Section 4.4 we explored ways to compute the gradient
C
(
θ
). In this section, we explore
other first-order optimization ideas that are less popular in practical deep learning, but still
embody useful principles.
We begin the discussion by considering a modification of the momentum technique, to a
variant called Nesterov momentum.
24
The idea of this method has also found its way into
an algorithm called Nadam (Nesterov ADAM). We continue by considering variants of
Adagrad and RMSProp for adaptive learning rates per coordinate. The variants we describe
are called Adadelta and Adamax. While as of today, these additional techniques are not
common in deep learning practice, they have made significant impact earlier on and may
be fruitful in the future as well. In a sense, momentum (covered in Section 4.3), ADAM,
Nesterov momentum, Nadam, Adagrad (covered in Section 4.3), Adadelta, and Adamax are
all variants and improvements of basic gradient descent. Thus, our exposition in this section
aims to complete the picture on how basic gradient descent may be improved.
We close this section with an overview of line search techniques. With such techniques,
once a descent direction is determined, we seek to move in that direction with a step size
which minimizes loss over the next step in that direction. There are several variants for this
approach, and we present an overview of these variants.
Nesterov Momentum and the Nadam Algorithm
We have seen that the momentum updates as in
(4.16)
and
(4.17)
accelerate like a ball
rolling downhill. An issue with this method is that the steps do not slow down after reaching
the bottom of a valley, and hence, there is a tendency to overshoot the valley floor. Therefore,
if we know approximately where the ball will be after each step, we can slow down the ball
before the hill slopes up again.
The idea of Nesterov momentum tries to improve on standard momentum by using the
gradient at the predicted future position instead of the gradient of the current position. The
update equations are,
v
(t+1)
= βv
(t)
+ (1 β)C(θ
(t)
αβv
(t)
| {z }
predicted
next point
), (4.33)
θ
(t+1)
= θ
(t)
αv
(t+1)
| {z }
actual
next point
, (4.34)
for constants
β
[0
,
1) and
α >
0, with
v
(1)
= 0. Compare
(4.33)
and
(4.34)
with
(4.16)
and
(4.17)
. The difference here is that the gradient is computed at a predicted next point
θ
(t)
αβv
(t)
, which is a proxy for the (unseen) actual next point
θ
(t+1)
=
θ
(t)
αv
(t+1)
when β 1.
Implementing Nesterov momentum via
(4.33)
and
(4.34)
requires evaluation of the gradient
at the predicted points instead of actual points. This may incur overheads, especially when
incorporated as part of other algorithms. For example, if we also require
C
(
θ
(t)
) at each
iteration for purpuses such as RMSprop, then the gradient needs to be computed twice
instead of once per iteration; once for
C
(
θ
(t)
) and once for
C
(
θ
(t)
αβv
(t)
). For this
24
Nesterov momentum is also sometimes called Nesterov acceleration or Nesterov accelerated gradient.
32
i
i
i
i
i
i
i
i
4.5 Additional Techniques for First-Order Methods
reason, and also for simplicity of implementing gradients only at
θ
(t)
, a look-ahead momentum
mechanism is sometimes used in place of (4.33) and (4.34).
To see how this mechanism works, first revisit the basic momentum update equations
(4.16)
and (4.17) and observe that the parameter update can be represented as,
θ
(t+1)
= θ
(t)
α
βv
(t)
+ (1 β)C(θ
(t)
)
. (4.35)
Observe that
v
(t)
is based on gradients at points
θ
(0)
, . . . , θ
(t1)
, but not on the gradient at
θ
(t)
. Hence a way to incorporate this last gradient is with look-ahead momentum where we
replace v
(t)
of (4.35) by v
(t+1)
. This achieves behaviour similar to Nesterov momentum.
With such a replacement, we arrive at update equations of the form,
v
(t+1)
= βv
(t)
+ (1 β)C
θ
(t)
, (4.36)
θ
(t+1)
= θ
(t)
α
βv
(t+1)
+ (1 β)C
θ
(t)
| {z }
look-ahead
momentum
. (4.37)
Using
(4.36)
and
(4.37)
aims to provide similar behaviour to the Nesterov momentum
equations
(4.33)
and
(4.34)
, yet only requires evaluation of gradients at
θ
(t)
as opposed
to a shifted point as in
(4.33)
. While look-ahead momentum is not equivalent to Nesterov
momentum, the gist of both methods is similar.
Now with update equations like
(4.36)
and
(4.37)
, it is straightforward to adapt ADAM to
include behavior similar to Nesterov momentum. This is done using similar steps to those
used to construct ADAM using momentum, RMSprop, and bias correction in Section 4.3.
The difference here is that we use the look-ahead momentum equations
(4.36)
and
(4.37)
.
This algorithm is called Nadam.
In short, Nadam can be viewed exactly as the ADAM procedure in Algorithm 4.2, yet with
line 7 replaced by,
ˆv
β
1 β
t+2
v +
1 β
1 β
t+1
g.
In this case,
ˆv
is not just a bias corrected version of
v
as in ADAM since it also incorporates
look-ahead momentum, which is the key extra feature that Nadam adds to ADAM.
Adadelta
Recall the computation of
s
(t+1)
for the Adagrad method as in
(4.22)
. When Adagrad’s
s
(t+1)
is used in
(4.21)
, it has the problem of monotonically decreasing effective learning rates.
This is one of the reasons that eventually RMSprop became more popular than Adagrad.
However, there are other alternatives that are also very popular. One such alternative is the
Adadelta method which uses exponential smoothing of the squared gradients as in RMSprop,
but also uses another exponentially smoothed sequence of the descent step’s squares.
A key motivation for Adadelta is the observation that the update equation
(4.21)
for
RMSprop or Adagrad uses a unit-less quantity as the descent step. Specifically, in
(4.21)
the
only unit-full quantity in the coefficient multiplying
C
(
θ
(t)
) is the reciprocal of
s
(t+1)
.
33
i
i
i
i
i
i
i
i
4 Optimization Algorithms
Hence that coefficient has units which are the inverse of the gradient and these cancel out
the units of the gradient implying that the descent step is unit-less.
With Adadelta, the update equation
(4.21)
is modified so that the descent step maintains
the same units of the gradient, via
θ
(t+1)
= θ
(t)
θ
(t)
+ ε
s
(t+1)
+ ε
C(θ
(t)
)
| {z }
Descent step
e
C(θ
(t)
)
, (4.38)
where
θ
(t)
is adaptively adjusted yet has the same units as
s
(t+1)
, making the coefficient of
the gradient unit-free. Compare
(4.38)
with
(4.21)
to observe that
θ
(t)
+
ε
replaces the
learning rate
α
. This also means that Adagrad is “learning rate free”. Instead, a potentially
more robust parameter
ρ
[0
,
1) is used similarly to the
γ
parameter for RMSProp. This
parameter specifies how to exponentially smooth squares of the descent steps.
Using the descent step of (4.38), the update equation for θ
(t)
is,
θ
(t)
= ρθ
(t1)
+ (1 ρ)
e
C(θ
(t1)
)
e
C(θ
(t1)
)
,
starting with
θ
(0)
= 0. Then at iteration
t
, updating
θ
(t+1)
via
(4.38)
, we use both
θ
(t)
and s
(t+1)
, where the latter is updated via the RMSprop update equation as in (4.22).
Other Norms and Adamax
We have already seen that the quantities
s
(t)
i
used in RMSprop or Adadelta as in
(4.22)
are useful for adaptive learning rates per coordinate. With these methods, we may loosley
interpret
q
s
(t)
i
as an estimate of the standard (
L
2
) Euclidean norm of the sequence of
gradients for coordinate i, smoothed up to time t.
In principal, one may wish to use other
L
p
norms (see Appendix
??
) in place of the Euclidean
norm. For this we may modify the RMSprop recursion as in (4.22) to the form,
s
(t+1)
i
= γ
p
s
(t)
i
+ (1 γ
p
)|g
(t)
i
|
p
, with g
(t)
i
=
C(θ
(t)
)
θ
i
, (4.39)
and then interpret
s
(t)
i
1
p
as an estimate of the smoothed
L
p
norm of the sequence of
gradients for coordinate
i
up to time
t
. Note that for convenience of the calculations below,
we reparameterize the exponential smoothing parameter to be γ
p
in place of γ.
With a recursion such as
(4.39)
we can obtain an “ADAM like” algorithm that updates
parameters via an update rule such as,
θ
(t+1)
= θ
(t)
α
1
ˆs
(t+1)
1/p
+ ε
ˆv
(t+1)
, (4.40)
where
ˆv
(t+1)
is some bias-corrected momentum value, and
ˆs
(t+1)
is a bias corrected value
obtained from
(4.39)
; compare
(4.40)
to
(4.25)
. This additional hyper-parameter
p
, deter-
mining which
L
p
norm to use, can then potentially be tuned. However, for non-small
p
34
i
i
i
i
i
i
i
i
4.5 Additional Techniques for First-Order Methods
computations may often exhibit excessive numerical error due to overflow since we are raising
the gradient coordinate value to high powers and then taking a low power of it.
Adamax is a variant of this approach which instead of using the
L
p
norm uses the
L
norm.
Specifically in place of (s
(t)
i
)
1
p
we use r
(t)
i
which is recursively defined as,
r
(t+1)
i
= max
h
γ r
(t)
i
, g
(t)
i
i
, with r
(0)
i
= 0. (4.41)
Then in place of (4.40) we use
θ
(t+1)
= θ
(t)
α
1
r
(t+1)
+ ε
ˆv
(t+1)
. (4.42)
Note that the updates
(4.41)
and
(4.42)
are well justified as approximations of
(4.39)
and
(4.40). To see this recall that a sequence of L
p
norms converges to the L
norm and thus,
r
(t+1)
i
= lim
p→∞
s
(t+1)
i
1
p
= lim
p→∞
(1 γ
p
)
1/p
t
X
τ=0
γ
p(tτ)
|g
(τ)
i
|
p
!
1
p
= lim
p→∞
t
X
τ=0
γ
(tτ)
|g
(τ)
i
|
p
!
1
p
= max
τ=0,...,t
γ
(tτ)
g
(τ)
i
= max
γ max
τ=0,...,t1
γ
(t1τ)
|g
(τ)
i
|, |g
(t)
i
|
.
Here, moving from the first line to the second line, we made use of the general representation
of exponential smoothing
(4.15)
, and moving from the third line to the fourth line we use
the fact that
L
p
vector norms converge to the
L
vector norm as
p
. This then justifies
(4.41) as a limiting case.
Line Search
So far all the methods we considered were variants of gradient descent. When these are
viewed as a special case of the general descent direction method of Algorithm 4.1, updates
are of the form
θ
(t+1)
=
θ
(t)
+
θ
(t)
s
. In the case of basic gradient descent we have descent
steps
θ
(t)
s
=
αC
(
θ
(t)
), yet in general we can represent the descent step as
θ
(t)
s
=
αθ
(t)
d
where we call
θ
(t)
d
a descent direction. With basic gradient descent
θ
(t)
d
=
−∇C
(
θ
(t)
). In any
case, for some prescribed descent direction, the step size
θ
(t)
s
is determined by the choice
of α.
Line search is an approach where we seek the best step size for a given descent direction in
each iteration. Specifically, in iteration
t
, given a descent direction
θ
(t)
d
, we determine the
α
for that iteration, denoted as α
(t)
, via,
α
(t)
= arg min
α
C
θ
(t)
+ αθ
(t)
d
. (4.43)
35
i
i
i
i
i
i
i
i
4 Optimization Algorithms
That is, each iteration involves a one dimensional minimization problem determining
α
(t)
which is used for the update
θ
(t+1)
= θ
(t)
+ α
(t)
θ
(t)
d
.
Hence with line search, the next point θ
(t+1)
is a minimizer of C(·) along the ray,
{θ
(t)
+ αθ
(t)
d
: α > 0}. (4.44)
When we pick
α
(t)
to exactly optimize
(4.43)
, the method is called exact line search. In very
specific cases exact line search can be achieved explicitly with closed form formulas. However
in most cases, if we wish to carry out exact line search, we need to numerically optimize the
one dimensional optimization problem
(4.43)
for the best
α
. An alternative to exact line
search is inexact line search where we only probe the loss on the ray
(4.44)
for a few values
of α. Details are in the sequel.
Next point
(
t
+ 1)
0.2 0.0 0.2 0.4 0.6 0.8 1.0
1
0.2
0.0
0.2
0.4
0.6
0.8
1.0
2
Optimal point
Current point
(
t
)
(a)
0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030
0
10
20
30
40
C
(
(
t
)
+
(
t
)
d
)
(
t
)
(b)
Figure 4.8: One iteration of line search applied to a Rosenbrook function of
(4.9)
. (a) With the
current point
θ
(t)
= (0
.
8
,
0) and the descent direction
θ
(t)
d
= −∇C(θ
(t)
)
, we optimize over a given
ray. (b) The value of the loss function along the ray with optimal α
(t)
0.00165.
Note that in practice, for deep learning, neither exact nor inexact line search are used
directly. One reason is that loss function evaluations
C
(
·
) are computationally demanding
and are of a similar order to the computational cost of gradient evaluation. Hence practice
has shown that one might as well update descent directions instead of optimizing loss over a
fixed ray
(4.44)
. A second reason is that with deep learning we almost always have noisy
gradients using stochastic gradient descent or mini-batches, and hence searching for an
optimal
α
on a noisy direction can be practically less useful. Empirical evidence has shown
that in such cases, line search techniques are seldom useful on their own right. Nevertheless,
understanding line search is important as part of a general background for optimization and
is also used in quasi-Newton methods described in Section 4.6 which are sometimes used in
deep learning.
As a basic illustration of line search, return to the Rosenbrock function of
(4.9)
with contours
plotted in Figure 4.8 (a). In this plot we are at a point
θ
(t)
with the descent direction
36
i
i
i
i
i
i
i
i
4.5 Additional Techniques for First-Order Methods
θ
(t)
d
=
−∇C
(
θ
(t)
). The plot in Figure 4.8 (b) presents the one dimensional cross section along
the ray
(4.44)
with that descent direction where we see that the optimal (exact numerical
line search) α is at α
(t)
.
It is interesting to analyze the case
θ
(t)
d
= −∇C(θ
(t)
)
which we call basic gradient descent
with exact line search. In this case, trajectories of such optimization result in “zig-zagging”.
Specifically we have that every two consecutive descent steps are orthogonal. Namely,
θ
(t)
s
θ
(t+1)
s
= 0 or g
(t)
g
(t+1)
= 0,
with
g
(t)
denoting
C
(
θ
(t)
). This property is illustrated in Figure 4.9 where we consider
gradient descent with exact line search applied to the Rosenbrock function
(4.9)
. We start
with initial condition
θ
(0)
= (0
.
4
,
0
.
2) from which the first two steps take very long jumps.
However, afterwards once the search is in a narrow valley, the zig-zagging phenomenon is
apparent and incurs a significant effective slow down on the rate of advance towards the
optimal point.
Figure 4.9: Evolution of basic gradient descent with exact line search. We see the zig-zagging
property which slows down progress in narrow valleys.
To see the derivation of the zig-zagging property, observe that at
α
which optimizes
(4.43)
,
we have,
C
θ
(t)
αg
(t)
α
= 0,
37
i
i
i
i
i
i
i
i
4 Optimization Algorithms
or written in terms of the basic definition of the derivative we have,
lim
h0
C
θ
(t)
(α + h)g
(t)
C
θ
(t)
αg
(t)
h
= 0. (4.45)
We now use the first-order Taylor series expansion as in
(??)
of
C
(
·
) around the point
θ
(t)
αg
(t)
for the point θ
(t)
αg
(t)
hg
(t)
. This is,
C(θ
(t)
αg
(t)
hg
(t)
) = C(θ
(t)
αg
(t)
) h g
(t)
C
θ
(t)
αg
(t)
+ O(h
2
),
where
O
(
h
2
) is a function such that
O
(
h
2
)
/h
2
goes to a constant as
h
0. Substituting in
(4.45) we obtain,
0 = lim
h0
h g
(t)
C
θ
(t)
αg
(t)
+ O(h
2
)
h
= g
(t)
C
θ
(t)
αg
(t)
= g
(t)
g
(t+1)
,
where the last equality holds because g
(t+1)
= C
θ
(t)
αg
(t)
.
We now visit a powerful numerical technique called the conjugate gradient method. This is a
topic which may receive special treatment in a general optimization text, yet here we only
present it in brief as an application of exact line search. Conjugate gradient is not typically
used directly for deep learning, yet since it is useful for approximately solving certain large
systems of linear equations, it sometimes finds its way as an aid to other deep learning
algorithms.
Suppose the objective function is of the following quadratic form,
C(θ) =
1
2
θ
A θ b
θ, (4.46)
where
A
is a
d × d
dimensional symmetric positive definite matrix, and
b R
d
. We have
the Hessian
2
C
(
θ
) =
A
for every
θ
and since
A
is positive definite,
C
(
·
) is strictly convex
and there is a unique global minimum. We further have
C
(
θ
) =
b
and this means
that the equation
C
(
θ
) = 0 with the unique solution
θ
=
A
1
b
minimizes
C
(
θ
). Hence
the system of equations,
= b, with A positive definite, (4.47)
can be solved by solving the optimization problem of minimizing
(4.46)
. Thus one may
view the conjugate gradient method as an algorithm for (approximately) solving a linear
system of equations such as
(4.47)
. In particular when the dimension
d
is excessively large,
the conjugate gradient method can be efficient while standard techniques such as Gaussian
elimination may fail. We have already seen one application of such equations in Chapter 2,
with the normal equations (??) where A = X
X and b = X
y.
When carrying out line search on an objective function as
(4.46)
, the cross section above
the ray
(4.44)
for any direction is a parabola and hence exact line search has an explicit
38
i
i
i
i
i
i
i
i
4.5 Additional Techniques for First-Order Methods
solution for
(4.43)
. Conjugate gradient carries out iterations of such line search, where in
each iteration the search direction θ
(t)
d
is wisely chosen.
Specifically, at iteration
t
with current point
θ
(t)
and search direction
θ
(t)
d
, the next point
determined via (4.43) with (4.46) is,
θ
(t+1)
= θ
(t)
+ α
(t)
θ
(t)
d
, with α
(t)
=
θ
(t)
d
A θ
(t)
b
θ
(t)
d
A θ
(t)
d
.
Then the next search direction is determined as linear combination of the next gradient and
the current search direction. Specifically,
θ
(t+1)
d
= −∇C(θ
(t+1)
) + β
(t)
θ
(t)
d
, with β
(t)
=
C(θ
(t+1)
)
A θ
(t)
d
θ
(t)
d
A θ
(t)
d
.
This coefficient
β
(t)
is designed such that the current search direction
θ
(t)
d
and the next
search direction
θ
(t+1)
d
are conjugate with respect to
A
in the sense that,
θ
(t+1)
d
A θ
(t)
d
= 0.
This conjugacy ensures desirable properties for the method and in particular implies that a
minimum is reached in
d
iterations. Analysis and further motivation of conjugate gradient
are beyond our scope. Our presentation here is merely to show that conjugate gradient is an
application of exact line search.
25
Inexact Line Search
As discussed above, one may use exact line search to solve the univariate optimization
(4.43)
optimally in each iteration. An alternative is to only probe a few specific values of
α
(t)
according to some predefined set of rules and stop at the best probed value. This is called
inexact line search and it is especially attractive when there is no analytical solution of
(4.43)
. Inexact line search may significantly reduce the computational cost per iteration in
comparison to the repetitive application of a univariate numerical optimization technique.
A typical approach for inexact line search is backtracking (also known as backtracking line
search). The idea is to start with some predetermined maximal
α
0
>
0 and then decrease it
multiplicatively until a stopping condition is met. Specifically, we evaluate
C
(
·
) on the ray
(4.44) with candidate α values such as
α
0
,
2
3
α
0
,
2
3
2
α
0
,
2
3
3
α
0
, . . . ,
where the decrease factor of
2
3
can be tuned to be any factor in (0
,
1). We then stop at the
instant at which the stopping condition is met. Note that this search takes place at every
iteration t.
One very common stopping condition is known as the Armijo condition or the first Wolfe
condition. Under this condition, at each iteration
t
, we decrease
α
values as above until it
25
Note that in each iteration, the conjugate gradient traverses the steepest direction under the norm
u
A
:=
u
Au, u R
d
.
39
i
i
i
i
i
i
i
i
4 Optimization Algorithms
satisfies
C
θ
(t)
+ αθ
(t)
d
C(θ
(t)
) + β α
d
C(θ
(t)
), (4.48)
where
β
[0
,
1] is a parameter and
d
is shorthand notation for the directional derivative
in the direction θ
(t)
d
; for directional derivatives recall (??) from Appendix ??. That is,
d
C(θ
(t)
) =
θ
(t)
d
C(θ
(t)
) = θ
(t)
d
C(θ
(t)
).
To get an intuition for
(4.48)
, suppose that
β
= 1, then the right hand side of
(4.48)
is
equal to the first-order Taylor approximation of
C
θ
(t)
+ α θ
(t)
d
. Therefore, the decrease
in the objective suggested by the first Wolfe condition is at least as good as the prediction
by the first-order approximation. On the other extreme, if
β
= 0, we select
α
such that
C
θ
(t)
+ α θ
(t)
d
C
(
θ
(t)
), which is also acceptable as it guarantees that the next step at
least does not increase the objective.
Beyond the condition
(4.48)
, there are additional common stopping conditions that one may
employ together with backtracking line search. One such condition known as the second
Wolfe condition is,
d
C
θ
(t)
+ α θ
(t)
d
γ
d
C(θ
(t)
), (4.49)
where γ (0, 1). Another common condition known as the strong Wolfe condition is,
d
C
θ
(t)
+ α θ
(t)
d
γ
d
C(θ
(t)
). (4.50)
This condition forces
α
(t)
to lie close to the solution of exact line search. The first Wolfe
condition and the above strong Wolfe condition together are called the strong Wolfe conditions.
These conditions are sometimes used with quasi-Newton methods described below.
4.6 Concepts of Second-Order Methods
All the optimization methods we studied so far were based on gradient evaluation and im-
plicitly use the first-order Taylor approximation of the objective function
C
(
θ
). In particular,
the gradient of the objective function helps us determine the direction of the next step.
However, by using curvature information captured via the second derivatives, or Hessian
2
C(θ), algorithms can take steps that move farther and are more precise.
In this section, we present some well-known second-order Taylor approximation based
optimization methods, simply referred to as second-order methods. These methods also fall
within the general descent direction framework of Algorithm 4.1, yet their evaluation of the
descent step
θ
(t)
s
either uses the Hessian explicitly, or employs some approximated estimate
for Hessian vector products.
While today’s deep learning practice rarely involves second-order methods, these methods
have huge potential. Indeed, if employed efficiently, their application in general deep learning
training can yield significant performance improvement. Hence, understanding basic principles
of second-order techniques is useful for the mathematical engineering of deep learning.
40
i
i
i
i
i
i
i
i
4.6 Concepts of Second-Order Methods
This section starts by exploring simple concepts of second-order univariate optimization
where we introduce Newton’s method which is based on the first and second derivative of
the objective function. We also present the secant method which is only based on the first
derivative. We then move on to Newton’s method in the general multivariate case where the
Hessian matrix plays a key role. Indeed, for simple neural networks such as logistic regression
of Chapter ??, a method such as Newton is highly applicable.
For general deep learning with
θ R
d
, where
d
is typically huge and the loss function is
complicated, it is hopeless to use explicit expressions or automatic differentiation for the
Hessian
2
C
(
θ
). In such a case, one may use quasi-Newton methods that only approximate
the Hessian via gradients. Such application of second-order methods to deep learning
is a contemporary active area of research which we are not able to cover fully. Instead,
our presentation builds up towards a popular method called the limited-memory BFGS
(Broyden–Fletcher–Goldfarb–Shanno) algorithm, or L-BFGS for short. With this we aim to
open up horizons for the reader to the multitude of optimization technique variations that
one may consider.
The Univariate Case
Newton’s method is an iterative method that uses the quadratic approximation, also called
the second-order Taylor approximation
(??)
, of the objective function around the current
point and finds the next step by optimizing the quadratic function. For ease of understanding,
first consider the univariate case where
θ R
. Assume that the objective function
C
(
θ
)
is twice differentiable. Suppose we are at
θ
(t)
in the
t
-th iteration and let
Q
(t)
(
θ
) be the
quadratic approximation of the objective function C(θ) around θ
(t)
, given by
C(θ) Q
(t)
(θ) = C(θ
(t)
) + (θ θ
(t)
)C
(θ
(t)
) +
(θ θ
(t)
)
2
2
C
′′
(θ
(t)
),
where C
(·) and C
′′
(·) are the first and second derivatives respectively.
Note that
Q
(t)
(
θ
) depends on
θ
(t)
. Further observe that
C
(
θ
) and
Q
(t)
(
θ
) take the same value
when
θ
=
θ
(t)
. To find the next step,
θ
(t+1)
, we minimize
Q
(t)
(
θ
) by setting its derivative to
zero and obtain,
C
(θ
(t)
) + (θ θ
(t)
)C
′′
(θ
(t)
) = 0,
with the corresponding solution (for the variable θ),
θ
(t+1)
= θ
(t)
C
(θ
(t)
)
C
′′
(θ
(t)
)
, (4.51)
whenever
C
′′
(
θ
(t)
)
= 0. Newton’s method (for univariate function optimization) starts at
some initial point
θ
(0)
and then iterates
(4.51)
until some specified stopping criterion is met.
If converged to some point
θ
, then
C
(
θ
) = 0 implying that the point is a local minimum,
local maximum, or saddle point.
26
Any quadratic function is either strictly convex or strictly concave depending on the leading
coefficient. Therefore, if
C
(
θ
) itself a strictly convex quadratic function, it has a unique global
minimum, and since the third order derivative of
C
(
θ
) is zero, the quadratic approximation
26
Note that when considered in the context of general root finding of the form
F
(
θ
) = 0 beyond the case
of F (·) = C
(·), the method is sometimes called Newton-Raphson.
41
i
i
i
i
i
i
i
i
4 Optimization Algorithms
Q
(t)
(
θ
) is identically equal to
C
(
θ
). In such a case, Newton’s method finds the minimum
in a single iteration of
(4.51)
for any initial point
θ
(0)
. This property hints at the speed at
which Newton’s method can move towards the minimum since any smooth function is well
approximated by a quadratic function in the neighbourhood of a local minimum.
(
t
)(
t
+ 1)
C
(
(
t
)
)
C
( )
Q
(
t
)
( )
(a)
(
t
) (
t
+ 1)
C
(
(
t
)
)
C
( )
Q
(
t
)
( )
(b)
Figure 4.10: Illustration of updates in Newton’s method. (a) The approximating quadratic function
Q
(t)
(θ) is concave. (b) The approximating quadratic function Q
(t)
(θ) is convex.
While potentially very efficient, in its basic form, Newton’s method exhibits several instability
issues. First observe that the method can end up in a local maximum even though the goal is
minimization. This is made apparent in Figure 4.10 which illustrates single step updates with
Newton’s method on a function that has both a local maximum and a local minimum. In (a)
a step is taken and it approaches (yet overshoots) the local maximum. In (b) a step is taken
(and also overshoots) the local minimum. As is evident, the initial point
θ
(t)
for both (a) and
(b) are near each other, yet the subtle difference in the sign of
C
′′
(
θ
(t)
) implies movement in
completely different directions. In both cases, with another iteration the extremum point
would have been nearly reached. That is (a) would end up very near the local maximum
and similarly with another iteration (b) ends up very near the local minimum.
In general, the Newton update
(4.51)
is very sensitive to the second derivative and if
C
′′
(
θ
(t)
) is very close to zero, then
θ
(t+1)
can be too far from
θ
(t)
, and thus, the quadratic
approximation is not appropriate. That is, inflection points
27
on the
θ R
axis cause
havoc with Newton’s method. Similarly, as we see in the sequel for the multivariate case,
points
θ R
d
where the Hessian
2
(
θ
) is nearly singular (eigenvalues near zero) are also
problematic in the same way.
To gain insights into some of the issues that may occur with Newton’s method, let us consider
some examples. The sensitivity of the method can cause overshoot of the minimum. Consider
for example
C(θ) = (θ 3)
2
(θ + 3). (4.52)
We have
C
(
θ
) = 3(
θ
3)(
θ
+ 1),
C
′′
(
θ
) = 6(
θ
1), and the function has a local minimum
at θ
= 3 with C(θ
) = 0 . The descent step resulting from (4.51) is thus,
θ
(t)
s
=
C
(θ
(t)
)
C
′′
(θ
(t)
)
=
1
2
(θ
(t)
3)(θ
(t)
+ 1)
1 θ
(t)
.
27
An inflection point for a twice differentiable function C : R R is a point θ where C
′′
(θ) = 0.
42
i
i
i
i
i
i
i
i
4.6 Concepts of Second-Order Methods
Say we start at
θ
(0)
= 1
.
5 where
C
(
θ
(0)
) = 10
.
125. At this point the descent step is
θ
(0)
s
= 3
.
75 implying that
θ
(1)
=
θ
(0)
+
θ
(0)
s
= 5
.
25. This overshoot of the local minimum
yields
C
(
θ
(1)
)
41
.
77 which is about a four-fold increase in the loss value. Note though that
in this case (as the reader may readily verify via simple calculations), the iterations that
follow (presented to 4 decimal points precision) are,
θ
(2)
= 3.5956, θ
(3)
= 3.0683, θ
(4)
= 3.0011,
and in fact the error for
θ
(6)
is already less than 10
13
. Thus we see in this example that
there is initial significant overshoot which is later corrected and is followed by very quick
convergence.
Worse than an overshoot is a situation which results in moving away from the local minimum
of interest. We have already seen an example of this in Figure 4.10 (a) where starting to the
left of the inflection point yields convergence to a local maximum. Further, for the example
function (4.52) this will occur if θ
(0)
< 1.
An additional phenomenon that may occur is oscillation. As an extreme example, consider,
C(θ) =
1
4
θ
4
+
5
2
θ
2
, which implies θ
(t)
s
=
θ
(t)
(θ
(t)
2
5)
5 3θ
(t)
2
.
If
θ
(t)
= 1, then we get
θ
(t)
s
=
2 and thus
θ
(t+1)
=
1. Further, in the next step we get
θ
(t+1)
s
= 2 which brings us back to
θ
(t+2)
= 1. Such undesirable perfect oscillation is a
singular phenomenon and is not likely to occur in practice. However approximate oscillation
may sometimes occur and persist for multiple iterations until dampening.
In practice, and especially when considering multi-dimensional generalizations, trust region
methods can mitigate problems such as overshoot, movement in the wrong direction, or
oscillations. These methods incorporate conditions which disallow steps of large step size
beyond some predefined or adaptive thresholds. Trust region methods also incorporate
first-order methods as a backup. We do not cover the details further.
A final concern with Newton’s method is that it requires both the first derivative
C
(
·
)
and the second derivative
C
′′
(
·
). In some scenarios these derivatives are easy to obtain,
yet in some other cases the univariate objective may be a complicated function and even
the application of automatic differentiation techniques may be inadequate for evaluating
C
′′
(
·
). This motivates the introduction of the secant method which is a modification of
Newton’s method where the second derivative in each iteration is approximated by the last
two iterations. Namely, we use the approximation,
C
′′
(θ
(t)
)
C
(θ
(t)
) C
(θ
(t1)
)
θ
(t)
θ
(t1)
, (4.53)
which becomes precise when
θ
(t)
and
θ
(t1)
are not too far from each other. With this
approximation, updates of the secant method are given by
θ
(t+1)
= θ
(t)
θ
(t)
θ
(t1)
C
(θ
(t)
) C
(θ
(t1)
)
C
(θ
(t)
),
43
i
i
i
i
i
i
i
i
4 Optimization Algorithms
where we initialize
θ
(0)
and
θ
(1)
to be close but not equal. In general, the secant method
may require more iterations for convergence than Newton’s method. The study of general
numerical analysis and optimization theory contains an in-depth analysis of such convergence
rates. This is a topic that we do not cover further.
The secant method also suffers from the problems associated with Newton’s method men-
tioned above, including overshoot, moving in the wrong direction, and oscillation. Neverthe-
less, like Newton’s method, the secant method is generally very powerful and arrives to a
local minimum as long as the initial points are within the vicinity of the minimum.
Importantly, note that the secant method can be viewed as a quasi-Newton method since it
approximates the Hessian (second derivative of a scalar function) with an expression that
only depends on the first derivative. As we now generalize to multivariate optimization
we should keep in mind that in the context of deep learning, almost any second-order
method that one may wish to use should be a quasi-Newton method based on first-order
derivatives (gradients), as opposed to explicit or automatically differentiated Hessians (second
derivatives). Nevertheless, let us first study Newton’s method for multivariate functions.
The Multivariate Case and Hessians
For a twice differentiable multivariate objective function
C
:
R
d
R
the update of Newton’s
method is very similar to the univariate case of
(4.51)
. Specifically, the gradient and the
Hessian play the roles of the first and the second-order derivatives, respectivly. We can
represent this update as,
θ
(t+1)
= θ
(t)
H
1
t
C(θ
(t)
), (4.54)
where
H
t
=
2
C
(
θ
(t)
) is the Hessian of the objective function at
θ
(t)
, and when
H
t
is
non-singular; compare (4.54) with (4.51).
To see the development of
(4.54)
, as in the univariate case consider the quadratic approxi-
mation Q
(t)
(θ) of C(θ) around θ
(t)
given by
C(θ) Q
(t)
(θ) = C(θ
(t)
) + (θ θ
(t)
)
C(θ
(t)
) +
1
2
(θ θ
(t)
)
H
t
(θ θ
(t)
). (4.55)
Then, similar to the univariate case, the next point
θ
(t+1)
can be determined as the
θ
such
that the gradient of Q
(t)
(θ) is zero. Observe that this gradient is
Q
(t)
(θ) = C(θ
(t)
) + H
t
(θ θ
(t)
),
and by equating Q
(t)
(θ) to 0 and representing θ θ
(t)
as the descent step θ
(t)
s
,
H
t
θ
(t)
s
= −∇C(θ
(t)
). (4.56)
This linear systems of equations for
θ
(t)
s
can be represented in terms of the elegant update
equation
(4.54)
with descent step
θ
(t)
s
=
H
1
t
C
(
θ
(t)
), in case
H
t
is non-singular. However,
in practice, we typically try to solve
(4.56)
using a suitable linear solver.
28
Sometimes
28
Suitability of an algorithm depends on the size of
d
and any special structure in
H
t
that we can expect.
For example for small
d
, one may use LU decomposition (Gaussian eliminiation), or QR factorization based
methods, yet for large d conjugate gradient or other Krylov subspace methods may be suitable.
44
i
i
i
i
i
i
i
i
4.6 Concepts of Second-Order Methods
approximate solutions for
(4.56)
are also an option. For example one may use the conjugate
gradient method when
d
is large and run for a small number of iterations to only approximately
solve (4.56).
In summary, Newton’s algorithm for multivariate
C
(
·
) starts with some
θ
(0)
R
d
. Then
at each iteration
t
in the algorithm with current point
θ
(t)
, we (in principle) evaluate the
gradient and Hessian at that point. We then solve or approximately solve
(4.56)
to obtain
θ
(t)
s
for the update θ
(t+1)
= θ
(t)
+ θ
(t)
s
.
In principle, in each iteration, solving
(4.56)
requires the Hessian matrix
H
t
=
2
C
(
θ
(t)
)
either via a closed form formula as for example in the case of logistic regression
(4.7)
, or
more generally via automatic differentiation. However with large
d
, computing the Hessian
or Hessian vector products via automatic differentiation is generally not feasible. Hence
in general, Newton’s method on its own is not suitable for deep learning. Nevertheless,
quasi-Newton methods such as the L-BFGS method, presented below, or various adaptations
can be employed.
We also mention that all of the potential problems discussed above for the univariate case
can appear in the multivariate case. Overshoot, moving in the wrong direction, or oscillation
can occur due to similar reasons as the univariate case. Specifically at points
θ
(t)
that are far
from the local minimum, the quadratic approximation
Q
(t)
(
θ
) may be far from
C
(
θ
). Further,
there are problems that may occur due to a singular, near-singular, or ill conditioned Hessian
matrix
H
t
. In this case the descent step solution to
(4.56)
may be very noisy.
29
Nevertheless,
Newton’s method is very powerful and often when improved via trust region methods and
similar techniques, it outperforms first-order methods very well.
To build intuition of why Newton’s method often outperforms first-order methods, let us
first explore a relationship between the two approaches. If we use the constant diagonal
matrix
1
α
I
in place of the Hessian
H
t
, then Newton’s method reduces to the basic gradient
descent method with update
(??)
where the descent step is
θ
(t)
s
=
αC
(
θ
). To see this,
return to the quadratic approximation (4.55) with H
t
replaced by
1
α
I, to obtain,
e
Q
(t)
(θ) = C(θ
(t)
) + (θ θ
(t)
)
C(θ
(t)
) +
1
2α
(θ θ
(t)
)
(θ θ
(t)
), (4.57)
and thus,
e
Q
(t)
(θ) = C(θ
(t)
) +
1
α
(θ θ
(t)
).
Equating this gradient expression to 0 yields the update
(??)
. Hence, we notice that the
inverse of
H
t
in Newton’s method plays the role of the learning rate
α
of basic gradient
descent. As a consequence, with this basic form of Newton’s method, there is no need to
calibrate the learning rate, as we do in most gradient descent approaches.
30
Finally let us get intuition of why Newton’s method generally converges to a (local) minimum
θ
faster than gradient descent when starting at a point close to
θ
. For this, return to
Figure
??
which compares two loss landscapes, one of which is circular and one of which is
elliptical. While that figure is created via a quadratic form where Netwon’s method would
29
In practice one often monitors the level of ill-conditioning via the condition number of
H
t
, or similar
measures.
30
Note that Newton’s method can also be cast as a steepest descent method like basic gradient descent,
but with descent direction chosen with respect to the Hessian norm, u
H
t
=
p
u
H
t
u, u R
d
.
45
i
i
i
i
i
i
i
i
4 Optimization Algorithms
reach the minimum in a single iteration, in more realistic cases when functions are not
quadratic forms, they can still be locally well approximated by quadratic forms in the vicinity
of θ
and hence considering Figure ?? as a general plot is valid.
It is known that with gradient descent, highly elliptical loss landscapes are difficult for
optimization since descent steps may be trapped in valleys, and thus as argued in Section
??
in view of Figure
??
, one often tries to carry out variable transformations to alleviate such
problems. In a sense the gradient descent quadratic approximation
e
Q
(t)
(
·
) in
(4.57)
assumes
that the loss landscape is already perfectly circular since it uses a constant diagonal matrix
in place of the Hessian matrix. This is the best that basic gradient descent can offer. However,
when using Newton’s method, the correct local quadratic approximation
Q
(t)
(
·
) in
(4.55)
already adapts to the curvature and potential true elliptic nature of the loss landscape.
With such an approximation
Q
(t)
(
·
), minimization over quadratic loss landscapes occurs in
a single iteration, and since locally all smooth functions are well approximated by quadratic
loss landscapes, when θ
(t)
is in the vicinity of θ
, converge is very quick.
Quasi-Newton Methods
In deep learning, the dimension of
θ
is very large and it is thus difficult to solve the linear
equations
(4.56)
as required in each iteration of Newton’s method. Even in dimensions
where these equations are solvable, there is still the basic problem of computing the Hessian
matrix, a feat which is typically intractable for mid-size problems and beyond. A popular
alternative is offered by the use of quasi-Newton methods where either the Hessian or the
inverse of the Hessian matrix are approximated with each iteration. While the study of
quasi-Newton methods is mature, and these are used in many areas of applied mathematics
and engineering, the development of best practices for application in deep learning is still an
active area of research. Here we simply outlines the key ideas of such methods.
The central idea in the type of quasi-Newton methods that we cover is based on an evolving
sequence of
d ×d
matrices,
B
0
, B
1
, B
2
, . . .
. At each iteration, some update rule specifies how
B
t
is updated based on
B
t1
and other quantities from the current and previous iteration.
By the design of the update rule, each such
B
t
is kept as positive semidefinite (and hence
also non-singular), and in our presentation,
B
1
t
is ideally a good approximation
31
for the
Hessian matrix H
t
=
2
C(θ
(t)
) associated with the current point θ
(t)
of iteration t.
With such a B
t
matrix available, we now use the quadratic approximation,
Q
(t)
(θ) = C(θ
(t)
) + (θ θ
(t)
)
C(θ
(t)
) +
1
2
(θ θ
(t)
)
B
1
t
(θ θ
(t)
), (4.58)
which is similar to Newton’s method quadratic approximation
(4.55)
with the difference that
now
B
1
t
plays the role of the Hessian. Similarly to the case in Newton’s method,
Q
(t)
(
θ
) of
(4.58)
can be minimized where now the unique minimizer is
B
t
C
(
θ
(t)
). Hence using such
a quadratic approximation we have a descent direction,
θ
(t)
d
= B
t
C(θ
(t)
). (4.59)
With quasi-Newton methods, we take a step in the direction specified by
θ
(t)
d
, but we also
scale the step with a scalar
α
(t)
>
0 which controls the step size. Hence the key update rule
31
We note that alternative presentations may use B
t
as an approximation for the Hessian itself.
46
i
i
i
i
i
i
i
i
4.6 Concepts of Second-Order Methods
for quasi-Newton methods can be stated as,
θ
(t+1)
= θ
(t)
α
(t)
B
t
C(θ
(t)
), (4.60)
which is similar to
(4.54)
. The determination of
α
(t)
is typically obtained via a line search
technique such as inexact line search using the second Wolfe condition (4.49).
Algorithm 4.3: Quasi-Newton
Input: Dataset D = {(x
(1)
, y
(1)
), . . . , (x
(n)
, y
(n)
)},
objective function C(·) = C(·; D), and
initial parameter vector θ
init
Output: Approximately optimal θ
1 θ θ
init
2 B B
init
3 repeat
4 g C (θ) (Compute gradient)
5 θ
d
Bg (Determine descent direction)
6 Determine α using θ and θ
d
(Line search)
7 θ θ + αθ
d
(Update parameters)
8 Update B (Update approximation of
2
C(θ)
1
)
9 until termination condition is satisfied
10 return θ
The typical quasi-Newton method is presented in Algorithm 4.3. As in any gradient based
algorithm, computation of the gradient is needed, and this is specified in Step 4. The
computation of the descent direction
(4.59)
is in Step 5. In Step 6 we use line search, and
Step 7 updates the current parameter based on
(4.60)
. Finally, Step 8 is the update rule,
specific to the type of quasi-Newton method used. In the sequel we present BFGS and
L-BFGS, each with their own specification of this step.
As this method falls within the realm of a descent direction approach as in Algorithm 4.1,
ideally we wish to set
θ
init
in Step 1, at a point close to a minimum, just like all the other
descent direction approaches in this chapter. As for the initialization matrix
B
init
in Step 2,
ideally we would have a matrix whose inverse is the Hessian at
θ
init
. However, since evaluation
of the Hessian matrix or its inverse is typically computationally costly or intractable, we
often initialize with B
init
= I, a d × d identity matrix.
Now with the general outline of the quasi-Newton approach presented, one may seek good
update rules for Step 8. Before we see concrete formulas for these updates within the BFGS
and L-BFGS algorithms, let us touch on theoretical aspects that helped develop these
methods, and can be used for further analysis of their performance and properties.
Since the Hessian
H
t
captures the curvature information of the objective function, we may
consider the latest two points
θ
(t1)
and
θ
(t)
together with the gradients at these points,
C(θ
(t1)
) and C(θ
(t)
), for the approximate equality,
H
t
(θ
(t)
θ
(t1)
) C(θ
(t)
) C(θ
(t1)
). (4.61)
47
i
i
i
i
i
i
i
i
4 Optimization Algorithms
This approximation follows from
(??)
of Appendix
??
, and is a good approximation when
θ
(t)
and
θ
(t1)
are close. It generalizes
(4.53)
used in the univariate secant method. Note
that (4.61) holds with equality if C(θ) is a quadratic function. Further, with the notation,
θ
(t1)
s
= (θ
(t)
θ
(t1)
), and g
(t1)
s
= C(θ
(t)
) C(θ
(t1)
), (4.62)
we have that
(4.61)
is represented as
H
t
θ
(t1)
s
g
(t1)
s
. Now inspired by this approximation,
if we are able to select the
d×d
positive definite matrix
B
t
, such that
B
1
t
is an approximation
of H
t
, then by considering the approximation as an equality,
θ
(t1)
s
= B
t
g
(t1)
s
. (4.63)
The equation
(4.63)
is known as the secant equation and it implies that
B
t
maps the change of
gradients
g
(t1)
s
into the descent direction
θ
(t1)
s
. A condition related to the secant equation
is,
g
(t1)
s
θ
(t1)
s
> 0, (4.64)
which is known as the curvature condition. Importantly, it can be shown that when the
curvature condition is satisfied, there is always a positive definite
B
t
such that the secant
equation is satisfied.
32
In fact, there can be infinitely many solutions. In general, quasi-
Newton methods are designed to maintain the curvature condition (4.64) and to provide a
unique update rule for B
t
.
The curvature condition
(4.64)
allows us to also gain insight into why line search as in
Step 6 of Algorithm 4.3 is needed. In simple or synthetic cases where the loss function
C
(
·
)
is strictly convex, it can be shown that the curvature condition
(4.64)
is always satisfied.
However, loss functions in deep learning are typically highly non-convex, and thus there is
no guarantee of existence of
B
t
such that the secant equation is satisfied, unless we impose
an additional condition on α
(t)
of (4.60).
If in each iteration
t
we determine
α
(t)
using inexact line search with the second Wolfe
condition
(4.49)
, then the curvature condition
(4.64)
is guaranteed to hold.
33
To see this,
suppose that the second Wolfe condition is applied, then from
θ
(t)
=
θ
(t1)
+
α
(t1)
θ
(t1)
d
,
we obtain
C(θ
(t)
)
θ
(t1)
d
γ C(θ
(t1)
)
θ
(t1)
d
, (4.65)
where we used the relationship between the gradient and the directional derivative shown in
(??) of Appendix ??. Now the inner product of the curvature condition (4.64) satisfies,
g
(t1)
s
θ
(t1)
s
= C(θ
(t)
)
θ
(t1)
s
C(θ
(t1)
)
θ
(t1)
s
γα
(t1)
C(θ
(t1)
)
θ
(t1)
d
C(θ
(t1)
)
θ
(t1)
s
= (γ 1)α
(t1)
C(θ
(t1)
)
θ
(t1)
d
= (γ 1)α
(t1)
C(θ
(t1)
)
B
t1
C(θ
(t1)
)
> 0.
32
It can be shown that if for
u, v R
d
,
u
v >
0, then there exists a symmetric positive definite matrix
B R
d×d
such that Bu = v.
33
This also holds for the strong Wolfe condition (4.50).
48
i
i
i
i
i
i
i
i
4.6 Concepts of Second-Order Methods
The first equality uses the definition of
g
(t1)
s
and the step size resulting from
(4.60)
,
θ
(t1)
s
=
α
(t1)
θ
(t1)
d
. The inequality that follows is due to
(4.65)
. Finally, the last inequality
is because
γ <
1 and
B
t1
is assumed to be positive definite based on the update rule. We
thus see that the second Wolfe condition is a means to ensure that the curvature condition
is satisfied.
The BFGS and L-BFGS Update Rules
Let us now focus on the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, and its
variant, the limited-memory BFGS (L-BFGS) algorithm. In practice, some small deep
learning applications already make efficient use of L-BFGS. We mention that one common
approach in training, is to first carry out training using standard gradient descent or a
variant, and then apply L-BFGS to “complete” the training process. Nevertheless, to date,
the application of L-BFGS and other variants of advanced quasi-Newton methods is still not
widespread in large scale deep learning.
We present the update rules for these two algorithms which specify Step 8 of Algorithm 4.3,
keeping in mind that
B
0
=
B
init
. Each quasi-Newton iteration of Algorithm 4.3 already bears
the computational cost of computing the gradient, the cost of several function evaluations
within the line search, and other costs associated with determining the descent direction and
updating of parameters. The computational cost of the update rule in Step 8 is generally
heavy. We contrast the computational cost of the BFGS update rule and that of the L-BFGS
update rule, below.
The BFGS update rule is,
B
t
= V
t1
B
t1
V
t1
+ ρ
(t1)
θ
(t1)
s
θ
(t1)
s
, (4.66)
where the positive scalar ρ
(t1)
and the matrix V
t1
are respectively given by,
ρ
(t1)
=
1
g
(t1)
s
θ
(t1)
s
and V
t1
= I ρ
(t1)
g
(t1)
s
θ
(t1)
s
. (4.67)
Observe the use of the rank one matrices
θ
(t1)
s
θ
(t1)
s
and
g
(t1)
s
θ
(t1)
s
using
(4.62)
. The
positivity of
ρ
(t1)
is maintained by the curvature condition
(4.64)
. Importantly, it can
be shown that each matrix in the sequence
B
0
, B
1
, . . .
, resulting from this update rule is
positive definite. We omit the details.
Note that due to the matrix multiplication
V
t1
B
t1
V
t1
, the computational cost of
updating
B
t
using
(4.66)
is
O
(
d
2
). This is a reasonable cost for small
d
, yet for larger
d
as
in some deep learning applications, it renders BFGS too costly.
We mention that the development of the update rule in
(4.66)
and
(4.67)
can be cast as a
minimization problem where we seek the closest possible positive definite
B
t
to the previous
B
t1
under a specific matrix norm. The minimization is subject to the constraint that the
secant equation
(4.63)
holds. We omit the details of this derivation yet mention that the
matrix norm used for BFGS is a weighted Frobenius norm. See the notes and references at
the end of the chapter for more information.
49
i
i
i
i
i
i
i
i
4 Optimization Algorithms
The L-BFGS algorithm overcomes the computational cost of BFGS. Its update rule is not
for maintaining the matrix
B
t
in memory directly, but is rather based on keeping a limited
history (or limited memory) of the constituent vectors
θ
(t)
s
and
g
(t)
s
. Observe that in principle,
one can unroll the BFGS update rules (4.66) and (4.67) to obtain,
B
t
= V
t1
V
t2
···V
0
B
init
V
0
···V
t2
V
t1
+ ρ
(t1)
θ
(t1)
s
θ
(t1)
s
+ ρ
(t2)
V
t1
θ
(t2)
s
θ
(t2)
s
V
t1
.
.
.
+ ρ
(0)
V
t1
V
t2
···V
0
θ
(0)
s
θ
(0)
s
V
0
···V
t2
V
t1
.
(4.68)
Further observe that this representation of
B
t
is a function of the sequence
{θ
(k)
s
,
g
(k)
s
}
for
k
= 0
,
1
, . . . , t
1 and
B
init
. Hence in principle, assuming that
B
init
is diagonal, if we wish to
compute the matrix-vector product
B
t
C
(
θ
(t)
), we can do so with all multiplications being
vector-vector multiplications. This is due to the fact that each
V
t
matrix is composed of the
outer products
g
(t1)
s
θ
(t1)
s
. Yet obviously, for non-small
t
such multiplications become
inefficient as the number of terms in the expression (4.68) grows.
The L-BFGS idea is to use an approximate version
e
B
t
of
B
t
that can be stored indirectly
by storing only the latest
m
pairs of vectors
{g
(k)
s
, θ
(k)
s
}
, for
k
=
t m, . . . , t
1. The
hyper-parameter
m
is the length of the history (or memory) and is typically chosen to
be much smaller than
d
. We modify
(4.68)
with an approximation which limits the his-
tory used to the last
m
iterations. For this we drop the last
t m
terms of
(4.68)
, and
in the first term
V
t1
V
t2
···V
0
B
init
V
0
···V
t2
V
t1
, we replace the middle multiplicands,
V
tm1
···V
0
B
init
V
0
···V
tm1
, with a symmetric positive definite matrix
B
(t)
init
. This mod-
ification yields
e
B
t
B
t
, with the form,
e
B
t
= V
t1
V
t2
···V
tm
B
(t)
init
V
tm
···V
t2
V
t1
+ ρ
(t1)
θ
(t1)
s
θ
(t1)
s
+ ρ
(t2)
V
t1
θ
(t2)
s
θ
(t2)
s
V
t1
.
.
.
+ ρ
(tm)
V
t1
V
t2
···V
tm
θ
(tm1)
s
θ
(tm1)
s
V
tm
···V
t2
V
t1
.
(4.69)
Note that
B
(t)
init
is selected for each iteration
t
and is usually set as a diagonal matrix to
reduce the storage and computational costs. A practically effective choice is
B
(t)
init
=
g
(t1)
s
θ
(t1)
s
g
(t1)
s
g
(t1)
s
I.
With this choice for
B
(t)
init
, the expression
(4.69)
allows us to compute
e
B
t
C
(
θ
(t)
) efficiently. In
particular,
e
B
t
C
(
θ
(t)
) can be computed using
O
(
m
) inner products between
d
-dimensional
vectors, and thus the per iteration computational cost for the L-BFGS algorithm is
O
(
md
),
while requiring to store only the latest m pairs of vectors {g
(k)
s
, θ
(k)
s
}, k = t m, . . . , t 1,
50
i
i
i
i
i
i
i
i
4.6 Concepts of Second-Order Methods
which takes
O
(
md
) storage space. With
m
much smaller than
d
, the L-BFGS algorithm
requires far less storage than the BFGS algorithm. Finally we note that in terms of Algo-
rithm 4.3, the update rule in Step 8 simply needs to push the vectors
{g
(t)
s
, θ
(t)
s
}
onto a queue
while popping the vectors
{g
(tm)
s
, θ
(tm)
s
}
out of that queue. The bulk of the algorithm is
in Step 5 with the efficient matrix-vector multiplication using the expressions
(4.69)
and
(4.67).
51
i
i
i
i
i
i
i
i
4 Optimization Algorithms
Notes and References
Mathematical optimization is a classical subject and some of the general texts on this subject are
[
9
] where linear optimization is covered, [
47
] where both linear and non-linear optimization methods
are covered, and [
8
] where the focus is on non-linear methods. Further texts focusing on convexity,
convex analysis, and convex optimization are [
5
], [
7
], and [
17
]. Many of the algorithms in these
texts describe descent direction methods, yet some examples of optimization algorithms that are
not descent direction methods include direct methods (also known as zero-order or black box) such
as coordinate and pattern search methods; see [
41
] or [
50
] for an overview. The Rosenbrock function
was developed in [
58
] within the context of multivariate function optimization and since then it
became popular for basic optimization test cases and illustrations.
As mentioned in the Chapter
??
notes and references, gradient descent is attributed to Cauchy
from 1847 with [
20
] ; see [
42
] for an historical account. Stochastic gradient descent was initially
introduced in 1951 with [
57
] in the context of optimizing the expectation of a random variable
over a parametric family of distributions. See [
36
], [
37
], [
38
], [
44
], and [
56
] for early references
using stochastic gradient descent approaches. After the initial popularity of this approach for
general optimization, other numerical optimization methods were developed and stochastic gradient
descent’s popularity decreased. Years later, it became relevant again due to developments in deep
learning; see [
12
] and [
16
] for early references in the context of deep learning. Interesting works that
attempt to explain the suitability of stochastic gradient descent for deep learning are [
13
], [
14
], and
[
15
]. Relationships between the use of mini-batches and stochastic gradient descent are studied in
[34], [35], and [45]. Practical recommendations of gradient based methods can be found in [6].
The ADAM algorithm was introduced in [
39
]. See [
10
] and [
55
] for additional papers that study its
performance. Ideas of momentum appeared in optimization early on in [
54
]. RMSprop appeared in
lectures by Geoff Hinton in a 2014 course and has since become popular. Adagrad appeared earlier
on in [
26
]. An interesting empirical study considering ADAM and related methods is [
21
]. See [
41
]
and [59] for a detailed study of momentum based methods.
The basic idea of automatic differentiation dates back to the 1950’s; see [
4
] and [
51
]. The idea
of computing the partial derivatives using forward mode automatic differentiation is attributed
to Wengert [
63
]. Even though ideas of backward mode automatic differentiation first appeared
around 1960 and used in control theory by the end of that decade. The 1980’s paper [
62
] is
generally attributed for presenting backward mode automatic differentiation in the form we known
it now. Backward mode automatic differentiation for deep neural networks is referred to as the
backpropagation algorithm, a concept that we cover in detail in Chapter
??
. In fact, ideas for the
backpropagation algorithm were developed independently of some of the automatic differentiation
ideas; see [
31
] for an historical overview as well as the notes and references at the end of Chapter
??
for more details.
In recent years, with the deep learning revolution, automatic differentiation has gained major
popularity and is a core component of deep learning software frameworks such as TensorFlow [
1
],
PyTorch [
53
], and others. In parallel, the application of automatic differentiation for a variety of
scientific applications has also been popularized with technologies such as Flux.jl [33] in Julia and
JAX [
19
] in Python. See [
3
] for a survey in the context of machine learning and [
30
] for a more dated
review. Additional differentiable programming frameworks that implement automatic differentiation
include CasADi, Autograd, AutoDiff, JAX, Zygote.jl, Enzyme, and Chainer. Much of the complexity
of differentiable programming software frameworks is centered around the use of GPUs as well as
the placement and management of data in GPU memory.
Nesterov momentum was introduced in [
48
]. See [
60
], [
61
], and [
66
] for some comparisons of Nesterov
momentum with the standard momentum approach. The Nadam algorithm was introduced in [
25
]
to incorporate Nesterov momentum into ADAM and is now part of some deep learning frameworks.
The Adadelta method appeared in [
68
] prior to the appearance and growth in popularity of ADAM.
The Adamax method appeared in the same paper as ADAM, [
39
], as an extension. Line search
techniques are central in classical first-order optimization theory. Inexact line search techniques
using Wolfe conditions first appeared in [
64
] and [
65
]. Backtracking line search, also known as
Armijo line search, first appeared in [
2
]. See for example chapter 3 of [
50
] and chapter 4 of [
41
] for
overviews of exact and inexact line search algorithms. Concepts of the conjugate gradient method
first appeared in [
32
] for solving linear equations with symmetric positive semidefinite matrices.
52
i
i
i
i
i
i
i
i
4.6 Concepts of Second-Order Methods
Since then, the method was incorporated in many numerical algorithms; see for example chapter 11
of [29] and chapter 5 of [50] for more details.
Despite the success of second-order methods in statistics, in general numerical optimization, and in
engineering, they are yet to make it into mainstream large scale deep learning. Perhaps one reason
for this is that when implemented without care, they are much slower than first-order methods for
high-dimensional models such as deep neural networks. Nevertheless, their efficient incorporation in
deep learning is still an active area of research. The origins of Newton’s method, also known as the
Newton-Raphson method, can be dated back to 17th century; see chapter 2 of [
40
] for an account on
the history of optimization methods. For a detailed overview see [11].
Even though the secant method can be thought of as an approximation to Newton’s method, from
an historical perspective, it is several decades older than Newton’s method; see for example [
52
] for
origins and an account of the evolution of the secant method. Quasi-Newton methods are more
modern with the first technique, the Davidon-Fletcher-Powell (DFP) method, appearing in 1959 as
a technical report [
23
] and later in 1991 published as [
24
]. In the interim, the DFP method was
further modified in the early 1960’s with [
28
]. The BFGS method materialized in the 1960’s and
1970’s, and the L-BFGS extension was introduced in 1980 with [
49
]. See also [
27
] for an overview of
these methods. See [
22
] and [
46
] for a discussion of convergence properties of BFGS and L-BFGS
respectively. Information about the weighted Frobenius norm used for constructing BFGS updates
is in chapter 6 of [
50
]. A related second order method is the Levenberg–Marquardt algorithm; see
chapter 18 of [
18
] for an introduction. Trust region methods are believed to have originated from
[
43
]; see [
67
] for a survey. Finally we suggest chapter 5 of [
50
] as a review of quasi-Newton methods.
53
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
Bibliography
[1]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, et al. TensorFlow: Large-scale machine learning on
heterogeneous distributed systems. arXiv:1603.04467, 2016.
[2]
L. Armijo. Minimization of functions having Lipschitz continuous first partial derivatives.
Pacific Journal of mathematics, 1966.
[3]
A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic dif-
ferentiation in machine learning: A survey. Journal of Machine Learning Research,
2018.
[4]
L. M. Beda, L. N. Korolev, N. V. Sukkikh, and T. S. Frolova. Programs for automatic
differentiation for the machine BESM (in Russian). Technical report, Institute for
Precise Mechanics and Computation Techniques, Academy of Science, Moscow, USSR,
1959.
[5]
A. Ben-Tal and A. Nemirovski. Lectures on modern convex optimization. SIAM,
Philadelphia, PA; MPS, Philadelphia, PA, 2001.
[6]
Y. Bengio. Practical recommendations for gradient-based training of deep architectures.
In Neural Networks: Tricks of the Trade. 2012.
[7]
D. Bertsekas, A. Nedić, and A. E. Ozdaglar. Convex analysis and optimization. Athena
Scientific, 2003.
[8] D. P. Bertsekas. Nonlinear programming. Athena Scientific, Third edition, 2016.
[9]
D. Bertsimas and J. N. Tsitsiklis. Introduction to linear optimization. Athena Scientific,
1997.
[10]
S. Bock and Weiß. A Proof of Local Convergence for the Adam Optimizer. In 2019
International Joint Conference on Neural Networks (IJCNN), 2019.
[11]
J. F. Bonnans, J. C. Gilbert, C. Lemaréchal, and C. A. Sagastizábal. Numerical
optimization: Theoretical and practical aspects. Springer Science & Business Media,
2006.
[12]
L. Bottou. Online algorithms and stochastic approximations. Online learning in neural
networks, 1998.
[13]
L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings
of COMPSTAT’2010: 19th International Conference on Computational Statistics, 2010.
55
i
i
i
i
i
i
i
i
Bibliography
[14]
L. Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade.
2012.
[15]
L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine
learning. SIAM review, 2018.
[16]
L. Bottou and Y. LeCun. Large scale online learning. In Advances in Neural Information
Processing Systems, 2003.
[17]
S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004.
[18]
S. Boyd and L. Vandenberghe. Introduction to applied linear algebra: Vectors, matrices,
and least squares. Cambridge university press, 2018.
[19]
J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula,
A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang. JAX: Composable
Transformations of Python+NumPy Programs. http://github.com/google/jax, 2018.
[20]
A. Cauchy. Méthode générale pour la résolution des systèmes d’équations simultanées.
Comp. Rend. Sci. Paris, 1847.
[21]
D. Choi, C. J. Shallue, Z. Nado, J. Lee, C. J. Maddison, and G. E. Dahl. On empirical
comparisons of optimizers for deep learning. arXiv:1910.05446, 2019.
[22]
Y. H. Dai. Convergence properties of the BFGS algorithm. SIAM Journal on Opti-
mization, 2002.
[23]
W. C. Davidon. Variable metric method for minimization. Technical report, Argonne
National Lab., Lemont, Ill., 1959.
[24]
W. C. Davidon. Variable metric method for minimization. SIAM Journal on Optimiza-
tion, 1991.
[25]
T. Dozat. Incorporating Nesterov momentum into Adam. International Conference on
Learning Representations (ICLR) Workshop, 2016.
[26]
J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning
and stochastic optimization. Journal of Machine Learning Research, 2011.
[27] R. Fletcher. Practical methods of optimization. John Wiley & Sons, 2013.
[28]
R. Fletcher and M. J. D. Powell. A rapidly convergent descent method for minimization.
The Computer Journal, 1963.
[29] G. H. Golub and C. F. Van Loan. Matrix computations. JHU press, 2013.
[30]
A. Griewank and A. Walther. Evaluating derivatives: Principles and techniques of
algorithmic differentiation. SIAM, 2008.
[31]
R. Hecht-Nielsen. Theory of the backpropagation neural network. In Neural Networks
for Perception. 1992.
56
i
i
i
i
i
i
i
i
Bibliography
[32]
M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving Linear
Systems. Journal of research of the National Bureau of Standards, 1952.
[33]
M. Innes. Flux: Elegant Machine Learning with Julia. Journal of Open Source Software,
2018.
[34]
S. Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-
normalized models. Advances in Neural Information Processing Systems, 2017.
[35]
S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In International conference on machine learning, 2015.
[36]
J. Kiefer. Sequential minimax search for a maximum. Proceedings of the American
mathematical society, 1953.
[37]
J. Kiefer. Optimum experimental designs. Journal of the Royal Statistical Society:
Series B (Methodological), 1959.
[38]
J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression
function. The Annals of Mathematical Statistics, 1952.
[39]
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980,
2014.
[40]
S. Kiranyaz, T. Ince, and M. Gabbouj. Optimization techniques: An overview. Multidi-
mensional Particle Swarm Optimization for Machine Learning and Pattern Recognition,
2014.
[41]
M. J. Kochenderfer and T. A. Wheeler. Algorithms for optimization. MIT Press, 2019.
[42] C. Lemaréchal. Cauchy and the gradient method. Doc Math Extra, 2012.
[43]
K. Levenberg. A method for the solution of certain non-linear problems in least squares.
Quarterly of applied mathematics, 1944.
[44]
H. Levitt. Transformed Up-Down Methods in Psychoacoustics. The Journal of the
Acoustical Society of America, 1971.
[45]
T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi. Don’t Use Large Mini-batches, Use
Local SGD. In International Conference on Learning Representations, 2020.
[46]
D. C. Liu and J. Nocedal. On the limited memory BFGS method for large-scale
optimization. Mathematical Programming, 1989.
[47]
D. G. Luenberger and Y. Ye. Linear and nonlinear programming. Springer, Cham,
Fifth edition, 2021.
[48]
Y. E. Nesterov. A method for solving the convex programming problem with convergence
rate O(1/k
2
). Dokl. Akad. Nauk SSSR, 1983.
57
i
i
i
i
i
i
i
i
Bibliography
[49]
J. Nocedal. Updating quasi-Newton matrices with limited storage. Mathematics of
Computation, 1980.
[50] J. Nocedal and S. J. Wright. Numerical optimization. Springer, 1999.
[51]
J. F. Nolan. Analytical differentiation on a digital computer. PhD thesis, Massachusetts
Institute of Technology, 1953.
[52]
J. M. Papakonstantinou and R. A. Tapia. Origin and evolution of the secant method in
one dimension. The American Mathematical Monthly, 2013.
[53]
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,
L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. 31st Conference on
Neural Information Processing Systems (NIPS2017), 2017.
[54]
B. T. Polyak. Some methods of speeding up the convergence of iteration methods.
USSR Computational Mathematics and Mathematical Physics, 1964.
[55]
S. J. Reddi, S. Kale, and S. Kumar. On the Convergence of Adam and Beyond.
arXiv:1904.09237, 2019.
[56]
H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the
American Mathematical Society, 1952.
[57]
H. Robbins and S. Monro. A stochastic approximation method. The annals of
mathematical statistics, 1951.
[58]
H. H. Rosenbrock. An automatic method for finding the greatest or least value of a
function. The computer journal, 1960.
[59]
S. Ruder. An overview of gradient descent optimization algorithms. arXiv:1609.04747,
2016.
[60]
S. Sarao Mannelli and P. Urbani. Analytical study of momentum-based acceleration
methods in paradigmatic high-dimensional non-convex problems. Advances in Neural
Information Processing Systems, 2021.
[61]
M. Sarigül and M. Avci. Performance comparison of different momentum techniques on
deep reinforcement learning. Journal of Information and Telecommunication, 2018.
[62]
B. Speelpenning. Compiling fast partial derivatives of functions given by algorithms.
University of Illinois at Urbana-Champaign, 1980.
[63]
R. E. Wengert. A simple automatic derivative evaluation program. Communications of
the ACM, 1964.
[64] P. Wolfe. Convergence conditions for ascent methods. SIAM Review, 1969.
[65]
P. Wolfe. Convergence conditions for ascent methods. II: Some corrections. SIAM
Review, 1971.
58
i
i
i
i
i
i
i
i
Bibliography
[66]
T. Yang, Q. Lin, and Z. Li. Unified convergence analysis of stochastic momentum
methods for convex and non-convex optimization. arXiv:1604.03257, 2016.
[67]
Y. Yuan. Recent advances in trust region algorithms. Mathematical Programming, 2015.
[68] M. D. Zeiler. Adadelta: An adaptive learning rate method. arXiv:1212.5701, 2012.
59