
i
i
i
i
i
i
i
i
4 Optimization Algorithms
Notes and References
Mathematical optimization is a classical subject and some of the general texts on this subject are
[
9
] where linear optimization is covered, [
47
] where both linear and non-linear optimization methods
are covered, and [
8
] where the focus is on non-linear methods. Further texts focusing on convexity,
convex analysis, and convex optimization are [
5
], [
7
], and [
17
]. Many of the algorithms in these
texts describe descent direction methods, yet some examples of optimization algorithms that are
not descent direction methods include direct methods (also known as zero-order or black box) such
as coordinate and pattern search methods; see [
41
] or [
50
] for an overview. The Rosenbrock function
was developed in [
58
] within the context of multivariate function optimization and since then it
became popular for basic optimization test cases and illustrations.
As mentioned in the Chapter
??
notes and references, gradient descent is attributed to Cauchy
from 1847 with [
20
] ; see [
42
] for an historical account. Stochastic gradient descent was initially
introduced in 1951 with [
57
] in the context of optimizing the expectation of a random variable
over a parametric family of distributions. See [
36
], [
37
], [
38
], [
44
], and [
56
] for early references
using stochastic gradient descent approaches. After the initial popularity of this approach for
general optimization, other numerical optimization methods were developed and stochastic gradient
descent’s popularity decreased. Years later, it became relevant again due to developments in deep
learning; see [
12
] and [
16
] for early references in the context of deep learning. Interesting works that
attempt to explain the suitability of stochastic gradient descent for deep learning are [
13
], [
14
], and
[
15
]. Relationships between the use of mini-batches and stochastic gradient descent are studied in
[34], [35], and [45]. Practical recommendations of gradient based methods can be found in [6].
The ADAM algorithm was introduced in [
39
]. See [
10
] and [
55
] for additional papers that study its
performance. Ideas of momentum appeared in optimization early on in [
54
]. RMSprop appeared in
lectures by Geoff Hinton in a 2014 course and has since become popular. Adagrad appeared earlier
on in [
26
]. An interesting empirical study considering ADAM and related methods is [
21
]. See [
41
]
and [59] for a detailed study of momentum based methods.
The basic idea of automatic differentiation dates back to the 1950’s; see [
4
] and [
51
]. The idea
of computing the partial derivatives using forward mode automatic differentiation is attributed
to Wengert [
63
]. Even though ideas of backward mode automatic differentiation first appeared
around 1960 and used in control theory by the end of that decade. The 1980’s paper [
62
] is
generally attributed for presenting backward mode automatic differentiation in the form we known
it now. Backward mode automatic differentiation for deep neural networks is referred to as the
backpropagation algorithm, a concept that we cover in detail in Chapter
??
. In fact, ideas for the
backpropagation algorithm were developed independently of some of the automatic differentiation
ideas; see [
31
] for an historical overview as well as the notes and references at the end of Chapter
??
for more details.
In recent years, with the deep learning revolution, automatic differentiation has gained major
popularity and is a core component of deep learning software frameworks such as TensorFlow [
1
],
PyTorch [
53
], and others. In parallel, the application of automatic differentiation for a variety of
scientific applications has also been popularized with technologies such as Flux.jl [33] in Julia and
JAX [
19
] in Python. See [
3
] for a survey in the context of machine learning and [
30
] for a more dated
review. Additional differentiable programming frameworks that implement automatic differentiation
include CasADi, Autograd, AutoDiff, JAX, Zygote.jl, Enzyme, and Chainer. Much of the complexity
of differentiable programming software frameworks is centered around the use of GPUs as well as
the placement and management of data in GPU memory.
Nesterov momentum was introduced in [
48
]. See [
60
], [
61
], and [
66
] for some comparisons of Nesterov
momentum with the standard momentum approach. The Nadam algorithm was introduced in [
25
]
to incorporate Nesterov momentum into ADAM and is now part of some deep learning frameworks.
The Adadelta method appeared in [
68
] prior to the appearance and growth in popularity of ADAM.
The Adamax method appeared in the same paper as ADAM, [
39
], as an extension. Line search
techniques are central in classical first-order optimization theory. Inexact line search techniques
using Wolfe conditions first appeared in [
64
] and [
65
]. Backtracking line search, also known as
Armijo line search, first appeared in [
2
]. See for example chapter 3 of [
50
] and chapter 4 of [
41
] for
overviews of exact and inexact line search algorithms. Concepts of the conjugate gradient method
first appeared in [
32
] for solving linear equations with symmetric positive semidefinite matrices.
52