See the latest book content here.

3 Optimization Algorithms

In this chapter we focus on general approach to optimization for multivariate functions. In the previous chapter, we have seen three different variants of gradient descent methods, namely, batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. One of these methods is chosen depending on the amount of data and a trade-off between the accuracy of the parameter estimation and the amount time it takes to perform the estimation. We have noted earlier that the mini-batch gradient descent strikes a balance between the other two methods and hence commonly used in practice. However, this method comes with few challenges that need to be addressed. We also focus on these issues in this chapter.

Learning outcomes from this chapter:

  • Several first-order optimization methods

  • Basic second-order optimization methods

This chapter closely follows chapters 4 and 5 of (Kochenderfer and Wheeler 2019). There are many interesting textbooks on optimization, including (Boyd and Vandenberghe 2004), (Sundaram 1996), and (Nesterov 2004),

3.1 Preliminaries

In this section, we establish some preliminaries that are necessary for understanding the methods discussed later.

3.1.1 Directional Derivative

  • Consider an n-dimensional multivariate function f:ΘR over a feasible set ΘRn.

  • The gradient of f at θ, when exists, is given by f(θ)=[f(θ)x1,,f(θ)xn], which is a vector capturing the local slope of the function f at θ.

  • The directional derivative sf(θ) of f at θ in the direction of s=[s1,,sn] is defined by sf(θ)=limh0f(θ+hs)f(θ)h.

  • Note that f(θ)xi is the directional derivative at θ in the direction of the vector ei consists of 1 at the i-th location and zeros everywhere else. This simply follows from the observation that eif(θ)=limh0f(x1,,xi+h,,xn)f(θ)h.

  • Thus, sif(θ)xi is the directional derivative at θ in the direction of the vector siei.

  • Together, we can conclude that the directional derivative at θ in the direction s is

(3.1)sf(θ)=sf(θ)=i=1nsif(θ)xi,

where denotes the dot product between vectors.

  • Alternatively, we can obtain the above relation between the gradient and directional derivative by observing from the first-order approximation f(θ+hs)=f(θ)+f(θ)(hs)+O(h2) that
    f(θ+hs)f(θ)h=sf(θ)+O(h). Now take the limit h0 on both the sides.

Exercise: Show that the directional derivative sf(θ) is maximum in the direction of the gradient. That is, for all unit length vectors s, the choice s=f(θ)/f(θ) maximizes sf(θ).

Hint: The Cauchy–Schwarz inequality for vectors.


3.1.2 Python Basics

SymPy is a Python library for symbolic mathematics. We illustrate the applications of this library with some examples.

Finding derivatives: Suppose that we want to find the derivative f/t of the following univariate function: f(t)=exp(t2/2)sin(πt). For this, SymPy package can be used as follows:

import sympy 
import numpy as np

t = sympy.Symbol('t', real=True)
f = sympy.exp(-t**2 / 2) * sympy.sin(sympy.pi*t)

dfdt = f.diff(t) # Taking derivative with respect to t

print("Output:\n f'(t) =", dfdt, "\n f'(t) =", sympy.latex(sympy.simplify(dfdt)))
## Output:
##  f'(t) = -t*exp(-t**2/2)*sin(pi*t) + pi*exp(-t**2/2)*cos(pi*t) 
##  f'(t) = \left(- t \sin{\left(\pi t \right)} + \pi \cos{\left(\pi t \right)}\right) e^{- \frac{t^{2}}{2}}

The second output is a latex expression for

f(t)=(tsin(πt)+πcos(πt))et22.

Finding gradient: Now consider the following multivariate function:

f(t1,t2)=exp(t12/2)sin(πt2).

t1, t2 = sympy.symbols('t1, t2', real=True)

f = sympy.exp(-t1**2 / 2) * sympy.sin(sympy.pi*t2)
theta = (t1, t2)
grad = [f.diff(t) for t in theta] 
print("Gradient of f: \n", grad)
## Gradient of f: 
##  [-t1*exp(-t1**2/2)*sin(pi*t2), pi*exp(-t1**2/2)*cos(pi*t2)]

Evaluating function values: Evaluating function value at a point using SymPy. Again suppose that

f(t1,t2)=exp(t12/2)sin(πt2),

and we want to evaluate the value of f at (t1,t2)=(1,3/2). The following code does it using subs and evalf:

t1, t2 = sympy.symbols('t1, t2', real=True)

f = sympy.exp(-t1**2 / 2) * sympy.sin(sympy.pi*t2)

print("value of f(1, 3/2):", (f.subs([(t1,1), (t2,3/2)])).evalf())
## value of f(1, 3/2): -0.606530659712633

Alternatively, we can use lambdify as follows:

t1, t2 = sympy.symbols('t1, t2', real=True)

f = sympy.exp(-t1**2 / 2) * sympy.sin(sympy.pi*t2)
f = sympy.lambdify((t1, t2), f)

print("value of f(1, 3/2):", f(1,3/2))
## value of f(1, 3/2): -0.6065306597126334

Note: Once we have the gradient of a function, it is easy to compute its directional derivative using (3.1).

3d plot of functions: Finally, suppose that we would like make a 3d plot of Rosebrock function:

f(t1,t2)=(at1)2+b(t2t12)2.

The following Python code plots f for a=1 and b=100 over [2,2]×[1,3].

import matplotlib.pyplot as plt

a = 1.0
b = 100
t1, t2 = sympy.symbols('t1, t2', real=True)
f = (a - t1)**2 + b*(t2 - t1**2)**2
f = sympy.lambdify((t1, t2), f)

def fun(x, y):
    return f(x,y)

t1_range = np.arange(-2.0,2, 0.05)
t2_range = np.arange(-1.0,3.0, 0.05)

X, Y = np.meshgrid(t1_range, t2_range)

Z = fun(X, Y)


fig = plt.figure()
ax = plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1,
                cmap='Blues', edgecolor='none')
ax.set_title('surface');

ax.set_xlabel(r'$\theta_1$')
ax.set_ylabel(r'$\theta_2$')
ax.set_zlabel(r'$f(\theta)$')
plt.show()

The corresponding contour plot is given by

CP = plt.contour(X, Y, Z, levels=np.exp(np.arange(-3, 7, 0.5)), cmap='Blues')
plt.clabel(CP, inline=1, fontsize=10)
plt.xlabel(r'$t_1$')
plt.ylabel(r'$t_2$')
plt.show()

3.1.3 Local Minimum, Saddle Point, and Convexity

In this section, we look at several examples of 2-dimensional functions to understand the notions of local minimum, saddle point, and convexity.

Local Minimum: For a multivariate function f(θ), the necessary condition for a point θ to be at a local minumum is that the gradient f(θ)=0 and the Hessian 2f(θ)=[2fθ122fθ1θ22fθ1θn2fθnθ12fθnθ22fθn2] is positive semidefinite , i.e., θ^(2f(θ)θ^)0 for all θ^.

The following plot of a peaks function has multiple local minima (also mmultiple local maxima).



Convexity: A multivariate function f(θ) is convex over a region ΘRn if the Hessian is positive semidefinite for all θΘ. Furthermore, f is strictly convex if the Hessian is poistive definite, i.e., θ^(2f(θ)θ^)>0 for all θ^0.

Convexity gurantees that all local minima are global minima. Strict convexity gurantees that the function has a unique global minimum. For an illustration, see the following plot of f(θ)=θ12+θ22.


Saddle Point: A point θ is a saddle point of f if the gradient f(θ)=0, but θ is not a local extrememum of f. Necessary condition for a point θ to be a saddle point is that the gradient f(θ)=0 and the Hessian 2f(θ) is indefinite (neither positive semidefinite nor negative definite).

For f(θ)=θ12θ22 plotted below, (0,0) is a saddle point.



3.1.4 Taylor Expansion

Let f be a univariate smooth function. Taylor expansion of f about a point a is given by

f(θ)=f(a)+(θa)f(a)+(θa)22!f(a)+(θa)33!f(a)+(3.2)=i=0(θa)ii!dif(a)dθi.

The mth order Taylor approximation of f is given by ignoring all the terms of order more than m in the Taylor expansion. That is,

f(θ)i=0m(θa)ii!dif(a)dθi

is the mth order Taylor approximation of f about a.

Note that for mth order approximations, we only need to compute the first m derivatives of the function.

Similarly, for a multivariate function f, the Taylor expansion about a point a is given by

f(θ)=f(a)+(θa)f(a)+12(θa)(2f(a)(θa))+.

Then, by ignoring the terms of order more than m, we get mth order approximation of f about a.

3.2 General Framework of Local Descent Methods

In this section, we introduce a general framework of local descent methods.

  • Given an n-dimensional multivariate function f:ΘR over a feasible set ΘRn, the general approach of optimization is to incrementally take steps on Θ based on a local model so that the function value f(θ) is decreased. That is we want to solve the following optimization problem:

minθΘf(θ).

  • We can think of f being the cost function of a neural network where θ represents an unknown vector of parameters taking values on a known set Θ, and the goal to find the value of θ that minimizes the cost.

  • The optimization methods that follow the common approach of the following pseudocode are called descent direction methods.


Algorithm : General approach of descent direction methods


  1. θθ(1)   (Start with an initial design point θ(1)Θ)
  2. repeat
  3.   Determine the descent direction d
  4.   Determine the step size of learning rate α
  5. θθ+αd    (The next design point)
  6. until θ satisfies a termination condition
  7. return θ,f(θ)

Following figure illustrates working of a descent direction method on a Rosenbrock function.

  • Different methods follow different approaches in finding the descent direction d and step size or learning rate α. Similarly the termination condition in Step 6 can change from method to method.

  • For each k1, denote the values of θ, d and α in the k-th iteration of the algorithm by θ(k), d(k) and α(k), respectively.

  • The descent direction d(k+1) of the next iteration is determined by the local information such as gradient and/or Hessian of the function f at θ(k). Some methods normalize the descent direction so that d(k)=1 for all k.

  • The phrase step size is generally refers to the magnitude of the overall step, that is, step size in k-th iteration is equal to α(k)d(k). When d(k)=1, the learning rate α is same as the step size.

  • In some algorithms, the step size is optimized so that it decreases the function f maximally. However, it may come at an extra computational cost.

  • In conclusion, different methods use different approaches to find α and d.

3.3 Finding Step Size

In this section, we assume that the descent direction d is given to us. Section 3.5 on first-order methods will deal with how to select d. Directional derivative plays a key role in finding the step size in each iteration.

3.4 Termination Conditions

Here, we discuss four commonly used termination conditions.

  • Maximum number of iterations: Under this condition, the algorithm is terminated either when the number of iterations exceeds a pre-selected number kmax, or the total running time of the algorithm exceeds a pre-selected time tmax. This is useful when there is a strict running time constraint, because for all the other termination conditions mentioned baove, it is hard to say hong the algorithm takes to terminate.

    The plot in Section~3.5.6 compares the performance of different gradient methods with the same kmax. Evidently, it will be clear that the same choice of kmax does not work for all the methods.

  • Absolute Improvement: Algorithm is terminated if the change in the function value is smaller than a pre-selected threshold ϵa over subsequent steps. That is, the termination condition is, |f(θ(k))f(θ(k+1))|<ϵa.

  • Gradient Magnitude: Algorithm is terminated if the magnitude f(θ(k+1)) of the derivative is smaller than a pre-selected threshold ϵg, that is, f(θ(k+1))<ϵg.

  • Relative Improvement: Algorithm is terminated if the relative change |f(θ(k))f(θ(k+1))||f(θ(k))| is smaller than a pre-selected threshold ϵr over subsequent steps. That is, the termination condition is
    |f(θ(k))f(θ(k+1))|<ϵr|f(θ(k))|. Relative improvement condition is particularly useful when the region around the (local) minimum is shallow.


Example: For instance, in the following figure, we see surface plots for

f1(θ)=θ12+θ22,Blue Surface

and f2(θ)=0.2θ12+0.2θ22.Red Surface

Both of these functions have global minimum at (0,0). However, f2 is shallower around (0,0) than f1.

Suppose that a descent direction algorithm takes a step of size 0.2 in each iteraction. We see that the relative improvement condition works well for both the functions.

Absolute Improvement: First assume that the algorithm uses the absolute improvement condition with ϵa=0.1. Then for f1, θ(k) should be within the blue cicle (in the figure below) before the termination, while for f2, the algorithm can terminate when θ(k) takes a value within the red circle.


Gradient Magnitude: Assume that the gardient magnitude condition with ϵg=0.1 in the algorithm. Then the algorithm terminates for f1 when θ(k+1) is within the blue circle shown in the following figure, while for f2 it terminates when θ(k+1) is within the red circle.

Relative Improvement: Finally assume that the algorithm uses the relative improvement condition with ϵr=0.1 for the termination. Then for both the functions, the algorithm terminates only when θ(k) is within the circle centered at the origion with radius 0.097. This is because the relative improvement is the same for both the functions, i.e.,

|f1(θ(k))f1(θ(k+1))||f1(θ(k))|=|f2(θ(k))f2(θ(k+1))||f2(θ(k))|,

for any θ(k) and θ(k+1).


3.5 First-Order Methods

Finally, we focus on the most important aspect of the local descent methods: finding the descent direction d. As mentioned earlier, first-order methods use the gradient of the function for selecting appropriate descent direction d in each iteration.

3.5.1 Gradient Descent

  • An obvious choice for the descent direction is in the direction of steepest descent.

  • Steepest descent direction guarantees an improvement, provided that the function is smooth, the step size is sufficiently small, and we are not at a minimum point (i.e., non-zero gradient).

  • Again let θ(k) be the design point in the k-th iteration. Define,

g(k)=f(θ(k)).

  • The steepest direction in the k-th iteration is the direction opposite the gradient g(k) (hence the name gradient descent).

  • The descent direction d(k) is typically the normalized steepest descent, that is, d(k)=g(k)g(k), where the negative sign implies that the descent direction is opposite the gradient.


Observation: If the step size α(k) is selected using exact line search: α(k)=argminαf(θ(k)+αd(k)), then the gradient descent can result in zig-zagging. To see this, observe that f(θ(k)+α(k)d(k))=f(θ(k)+αd(k))α=0. Consequently, the directional derivative d(k)f(θ(k)+α(k)d(k))=0, because 0=f(θ(k)+α(k)d(k))=limh0f(θ(k)+(α(k)+h)d(k))f(θ(k)+α(k)d(k))h=limh0f(θ(k)+α(k)d(k))(hd(k))+O(h2)h=f(θ(k)+α(k)d(k))d(k)=d(k)f(θ(k)+α(k)d(k)). Since d(k+1)=f(θ(k)+α(k)d(k))f(θ(k)+α(k)d(k)), we have d(k+1)d(k)=0. That is, the descent direction in each iteration is perpendicular to the descent direction of the previous iteration. Therefore, we obtain jagged search paths.



Example: Consider Booth’s function, a 2-dimensional quadratic function: f(θ)=(θ1+2θ27)2+(2θ1+θ25)2. This has a global minimum at (1,3). The following code implements gradient-descent method with exact line search. We start with θ(1)=[9,8].

def f(t):
    return (t[0] + 2*t[1] - 7)**2 + (2*t[0] + t[1] -5)**2

def grad(t):
    return np.array([10*t[0] + 8*t[1] - 34, 8*t[0] + 10*t[1] - 38])

def line_search_exact(t, d):
    nume = (t[0] + 2*t[1] - 7)*(d[0] + 2*d[1]) + (2*t[0] + t[1] - 5)*(2*d[0] + d[1])
    deno = (d[0] + 2*d[1])**2 + (2*d[0] + d[1])**2
    return -nume/deno

def GradientDescent_exact(f, grad, t, niter):
    g = grad(t)
    norm = np.linalg.norm(g)
    d = -g/norm
    t_seq = [t]
    d_seq = [d]

    for _ in range(niter):
        alpha = line_search_exact(t, d)
        
        # Theta update
        t = t + alpha*d
        t_seq.append(t)
        
        g = grad(t)
        norm = np.linalg.norm(g)
        d = -g/norm
        d_seq.append(d)
        
    print('Final point of GD:', t)
    t_seq = np.array(t_seq)
    d_seq = np.array(d_seq)
    return t_seq, d_seq
    
t_init = np.array([-9.0, 8.0]) 

t1_range = np.arange(-10.0, 10.0, 0.05)
t2_range = np.arange(-10.0, 10.0, 0.05)
A = np.meshgrid(t1_range, t2_range)
Z = f(A)

fig, ax = plt.subplots()  
ax.contour(t1_range, t2_range, Z, levels=np.exp(np.arange(-10, 6, 0.5)), cmap='Blues')
t_seq_gd, d_seq_gd = GradientDescent_exact(f, grad, t_init, niter=10)
ax.plot(t_seq_gd[:, 0], t_seq_gd[:, 1], '-r', label=r'Exact $\alpha$')

plt.xlabel(r'$\theta_1$')
plt.ylabel(r'$\theta_2$')
plt.legend()
plt.show()

Run the following lines to print the sequence of directions d(1),d(2),, the sequence of parameters θ(1),θ(2),, and the gradient vectors g(1),g(2),.

for i in range(len(d_seq_gd)):
  print('iteration:', i+1)
  print('\t d = ', d_seq_gd[i])
  print('\t t = ', t_seq_gd[i])
  print('\t Gradient =', grad(t_seq_gd[i]))


Example: Since, the gradient descent method follows the steepest descent direction, ideally speaking it should behave like water flowing from θ(1) and eventually reaching the local minimum. This happens when the step size is very small as illustrated in the following figure. They key drawback of small step size is that the number computations can drastically increase compared to the earlier exact line search version. For instance, in the following code we used 10 iterations for the exact line search and 10,000 for the small step size of 0.001.

def GradientDescent_fixed(f, grad, t, alpha,  niter):
    g = grad(t)
    norm = np.linalg.norm(g)
    d = -g/norm
    t_seq = [t]
    d_seq = [d]

    for _ in range(niter):
        # Theta update
        t = t + alpha*d
        t_seq.append(t)
        
        g = grad(t)
        norm = np.linalg.norm(g)
        d = -g/norm
        d_seq.append(d)
        
    print('Final point of GD:', t)
    t_seq = np.array(t_seq)
    d_seq = np.array(d_seq)
    return t_seq, d_seq

alpha = 0.001
t_init = np.array([-9.0, 8.0]) 

t1_range = np.arange(-10.0, 10.0, 0.05)
t2_range = np.arange(-10.0, 10.0, 0.05)
A = np.meshgrid(t1_range, t2_range)
Z = f(A)

fig, ax = plt.subplots()  
ax.contour(t1_range, t2_range, Z, levels=np.exp(np.arange(-10, 6, 0.5)), cmap='Blues')
ax.plot(t_seq_gd[:, 0], t_seq_gd[:, 1], '-r', label=r'Exact $\alpha$')

t_seq_gd_fixed, d_seq_gd_fixed = GradientDescent_fixed(f, grad, t_init, alpha, niter=10000)
ax.plot(t_seq_gd_fixed[:, 0], t_seq_gd_fixed[:, 1], '-b', label=r'Fixed $\alpha = 0.001$')

plt.xlabel(r'$\theta_1$')
plt.ylabel(r'$\theta_2$')
plt.legend()
plt.show()



Exercise: For the above Booth’s function, prove that the global minimum is achieved at θ=(1,3) by solving f(θ)=0, and find the value of f(θ).


3.5.2 Cojugate Gradient

  • Performance of the gradient descent method can be poor in narrow valleys. The conjugate descent method addresses this issue.

  • The Conjugate gradient method works well for optimizing quadratic functions: f(θ)=12θ(Aθ)+bθ+c, where A is a symmetric positive definite matrix. Then the optimization problem minθf(θ) has a unique solution.

  • The descent directions of the algorithm are chosen so that they are mutually conjugate with respect to A, that is, d(i)(Ad(j))=0,for allij.

  • The algorithm uses line search in each iteration to find the step factor α. It is an easy problem for quadratic functions. Because to solve minαf(θ+αd), we can easily solve
    f(θ+αd)α=0. Observe that
    f(θ+αd)α=dA(θ+αd)+db=d(Aθ+b)+αdAd. Equating this to zero results in α=d(Aθ+b)dAd.

  • The algorithm starts in the steepest descent direction: d(1)=g(1)=f(θ(1)). So, θ(2)=θ(1)+α(1)d(1).

  • In the subsequent iterations d(k+1)=g(k+1)+β(k)d(k)=f(θ(k+1))+β(k)d(k), where the scalar parameter β(k) is selected so that d(k) and d(k+1) are mutually conjugate with respect to A, as shown below: d(k+1)Ad(k)=0(g(k+1)+β(k)d(k))(Ad(k))=0g(k+1)(Ad(k))+β(k)(d(k))(Ad(k))=0β(k)=g(k+1)(Ad(k))(d(k)(Ad(k)).


Example: Again consider the Booth’s function from the previous example:

f(θ)=(θ1+2θ27)2+(2θ1+θ25)2.

The following code implements the conjugate gradient method.


def ConjGradientDescent_exact(f, grad, t, A, b, niter):
    g = grad(t)
    d = -g
    t_seq = [t]
    d_seq = [d]
    
    for _ in range(niter):
        alpha = line_search_exact(t, d)
        
        # Theta update
        t = t + alpha*d
        
        g = grad(t)
        Ad = A@d
        beta = np.dot(g, Ad)/(np.dot(d, Ad))
        d = -g + beta*d
        if np.linalg.norm(d) < 10e-16:
            t_seq = np.array(t_seq)
            d_seq = np.array(d_seq)
            return t_seq, d_seq
        t_seq.append(t)        
        d_seq.append(d)
        
    print('Final point of GD:', t)
    t_seq = np.array(t_seq)
    d_seq = np.array(d_seq)
    return t_seq, d_seq

A = np.array([[10, 8], [8, 10]])
b = -np.array([34, 38])
    
t1_range = np.arange(-10.0, 10.0, 0.05)
t2_range = np.arange(-10.0, 10.0, 0.05)
XY = np.meshgrid(t1_range, t2_range)
Z = f(XY)
plt.contour(t1_range, t2_range, Z, levels=np.exp(np.arange(-3, 7, 0.5)), cmap='Blues')
t_seq_gd, _ = GradientDescent_exact(f, grad, t_init, niter=10)
plt.plot(t_seq_gd[:, 0], t_seq_gd[:, 1], '-b', label='GD')
t_seq_cgd, _ = ConjGradientDescent_exact(f, grad, t_init, A, b, niter=10)
plt.plot(t_seq_cgd[:, 0], t_seq_cgd[:, 1], '-r', label='CGD')
plt.xlabel(r'$\theta_1$')
plt.ylabel(r'$\theta_2$')
plt.legend()
plt.show()

The above figure illustrates that the conjugate gradient method can have faster convergence compared to the gradient descent method.


  • The conjugate method can also be applied to non-quadratic functions as well because smooth continuous functions behave like quadratic functions close to the minimum. However, unfortunately, it is hard to find A that best approximates f locally. Observe that A is needed for computing β values. The following choices for β(k) seems to work well in practice.

    Fletcher-Reeves: β(k)=g(k)g(k)g(k1)g(k1).

    Polak-Ribiere: β(k)=g(k)(g(k)g(k1))g(k1)g(k1).


Exercise: Modify and run the python code for conjugate gradient with the above two choices of β update.


3.5.3 Challenges Posed by the Above Gradient Descent Methods

Gradient descent methods offer a few important challenges that meed to be addressed to make them useful in the neural network set-up.

  • Selection of the learning rate can be difficult. Small values of learning rate can result in slow convergence and large values can result in the loss function to fluctuate around the minimum, or even diverge. The following figure shows applications of the gradient descent method for Rosenbrock function given in (3.4) with two choices of α(1), where the learning rates are updated as α(k+1)=(0.9)k1α(1).

  • Some gradient descent methods try to adjust the learning rate during training by reducing the rate according to a pre-defined schedule or when the change in cost function between epochs falls below a threshold. These schedules and thresholds are unable to adapt to the characteristics of the dataset as they are defined in advance.

  • Applying the same learning rate to all parameter updates might not be a good idea as larger updates on rarely occurring features may result in faster convergences.

  • Gradient descent methods take time to traverse plateaus surrounding saddle points as the gradient in this region is close to zero in all directions.

    For instance, once again consider the Rosenbrock function given in (3.4). Note that for this function, the global minimum is achieved at (1,1). The following figure shows that the gradient descent method is slow along the valley. This is because the gradient over a nearly flat region has small magnitude, and thus the gradient descent method can require many iterations to converge to a local minimum (same as the global minimum for the Rosenbrock function). The this experiment, the initial point θ(1)=(1.25,0.5), and the learning rate is fixed at α=0.1.

From now onwards, we focus on a sequence of modified gradient descent methods for addressing the above issues.

3.5.4 Momentum

  • As mentioned above, the gradient methods take a long time to traverse over almost flat surfaces. Incorporating momentum accelerates the algorithm in the relevant directional and dampens oscillations.

  • The momentum updates at the k-the iteration are
    v(k+1)=βv(k)αg(k)θ(k+1)=θ(k)+v(k+1), for a scalar parameter β. When β=0, we have the standard gradient descent method.

  • Interpretation: Think of a ball rolling down a nearly flat surface. Due to the gravity, the ball accumulates momentum as it rolls down, becoming faster and faster with time.

3.5.5 Nesterov Momentum

  • One problem with the momentum is that the steps do not slow down enough near the local minimum at the bottom of the valley. Hence, it has a tendency to overshoot the valley floor.

  • For the ball rolling scenario, we would want to the ball to be smart enough know where it is going to be in the next step so that it can slow down before the hill slopes up again.

  • Nesterov momentum captures this notion by using the gradient at the projected future position. The update is given by, v(k+1)=βv(k)αf(θ(k)+βv(k))θ(k+1)=θ(k)+v(k+1),


Example: We once again consider the Rosenbrock function given by (3.4). The following code compares the basic gradient descent, momentum, and Nesterov momentum methods. We start with θ(1)=(1.25,0.5) and α(1)=0.2. The learning rate decreases as α(k)=α(1)γk1 where γ=0.9. For momentum and Nesterov momentum methods, we have taken β=0.8.


f = lambda t: (1 - t[0])**2 + 100*(t[1] - t[0]**2)**2

def grad(t):
    dx = -2*(1 - t[0]) - 4*100*t[0]*(t[1] - (t[0]**2))
    dy = 2*100*(t[1] - (t[0]**2))
    return np.array([dx, dy])

def GradientDescent(f, grad, t, alpha, gamma, niter=100000):
    g = grad(t)
    norm = np.linalg.norm(g)
    d = -g/norm
    t_seq = [t]
    d_seq = [d]

    for _ in range(niter):
        alpha *= gamma
        t = t + alpha*d
        t_seq.append(t)
        
        g = grad(t)
        norm = np.linalg.norm(g)
        d = -g/norm
        d_seq.append(d)
        
    t_seq = np.array(t_seq)
    d_seq = np.array(d_seq)
    return t_seq, d_seq

def MomentumGradientDescent(f, grad, t, alpha, gamma, beta, niter=100000):
    v = np.zeros(t.shape[0])
    t_seq = [t]
    v_seq = [v]

    for _ in range(niter):
        g = grad(t)
        norm = np.linalg.norm(g)
        v = beta*v - alpha*g/norm
        v_seq.append(v)

        alpha *= gamma
        t = t + v
        t_seq.append(t)

    t_seq = np.array(t_seq)
    v_seq = np.array(v_seq)
    return t_seq, v_seq  

def NesterovMomentumGradientDescent(f, grad, t, alpha, gamma, beta, niter=100000):
    g = grad(t)
    norm = np.linalg.norm(g)
    v = np.zeros(t.shape[0])
    t_seq = [t]
    v_seq = [v]
    

    for _ in range(niter):
        g = grad(t + beta*v)
        norm = np.linalg.norm(g)
        v = beta*v - alpha*g/norm
        v_seq.append(v)
        
        alpha *= gamma
        t = t + v
        t_seq.append(t)
        
    t_seq = np.array(t_seq)
    v_seq = np.array(v_seq)
    return t_seq, v_seq 

# Initialization
alpha_init = 0.2 
gamma = 0.9
beta = 0.8
t_init = np.array([-1.25, 0.5]) 
niter = 100000 

t1_range = np.arange(-1.5, 1.75, 0.01)
t2_range = np.arange(-0.5, 1.5, 0.01)

A = np.meshgrid(t1_range, t2_range)
Z = f(A)
plt.contour(t1_range, t2_range, Z, levels=np.exp(np.arange(-10, 6, 0.5)), cmap='Blues')
t_seq_gd, d_seq_gd = GradientDescent(f, grad, t_init, alpha_init, gamma, niter)
plt.plot(t_seq_gd[:, 0], t_seq_gd[:, 1], '-b', label='GD')
t_seq_mgd, d_seq_mgd = MomentumGradientDescent(f, grad, t_init, alpha_init, gamma, beta, niter)
plt.plot(t_seq_mgd[:, 0], t_seq_mgd[:, 1], '-r', label='MGD')
t_seq_nmgd, d_seq_nmgd = NesterovMomentumGradientDescent(f, grad, t_init, alpha_init, gamma, beta, niter)
plt.plot(t_seq_nmgd[:, 0], t_seq_nmgd[:, 1], '-g', label='NMGD')
plt.plot(1, 1, 'oy')
plt.xlabel(r'$\theta_1$')
plt.ylabel(r'$\theta_2$')
plt.legend()
plt.show()


3.5.6 Adagrad

  • Observe that both momentum and Nesterov momentum methods update all components of θ with the same learning rate.

  • The adaptive subgradient, or simply adgrad, adapts a learning rate for each component of θ so that the components with frequent updates have low learning rates and the components with infrequent updates have high learning rate. As a consequence, it increases the influence of components of θ with infrequent updates.

  • The update of each component in θ is as follows. θi(k+1)=θi(k)αϵ+si(k)gi(k), where si(k)=j=1k(gi(j))2, gi(k) is the i-th component of g(k), ϵ is a small value of order 1×108 to avoid division by zero, and α is the learning rate parameter.

  • The adagrad method is less sensitive to the parameter α, which is typically set to be 0.01.

  • Drawback: The key drawback of the adagrad is that each component si is strictly non-deceasing. As a consequence the accumulated sum keeps increasing and hence the effective learning rate decreases during training, often making the effective learning rate infinitesimally small before convergence.

  • The first-order methods discussed below overcome this drawback.


Example: The following python function implements the adagrad method for the earlier example on Rosenbrock function. The corresponding plot compares the other methods with the adagrad method. The parameters chosen are α=0.01 and ϵ=108.

def Adagrad(f, grad, t, alpha = 0.01, epsilon=10e-8, niter=100000):

    t_seq = [t]
    s = np.zeros(t.shape[0])
    
    #while norm > 10**(-8):
    for _ in range(niter):
        g = grad(t)
        norm = np.linalg.norm(g)
        g = g/norm
        
        s += np.multiply(g, g) # element-wise multiplication
        d = -np.divide(g, epsilon + np.sqrt(s)) # element-wise division
        t = t + alpha*d
        t_seq.append(t)
        
    t_seq = np.array(t_seq)
    return t_seq


3.5.7 RMSprop

Here, RMS stands for root mean square.

  • RMSprop method overcomes the problem of monotonically decreasing learning rate by maintaining a decaying average of squared gradients. This average vector S^ is updated according to s(k+1)=γss(k)+(1γs)(g(k)g(k)), where γs is the decay factor, typically taken to be 0.9, and denotes the element-wise product between vectors. Thus, the update of each element of S is given by
    (3.5)si(k+1)=γssi(k)+(1γs)(gi(k))2.

If the initial value S^(1)=0, then for each i, si(k+1)=(1γs)j=1kγskj(gi(j))2.

  • Then the component-wise update of θ is given by θi(k+1)=θi(k)αϵ+si(k+1)gi(k), where the learning parameter is typically set to be 0.001.

  • The expression Si(k+1) is called decaying root mean square of the time series gi, and is denoted by RMS(gi). Then, θi(k+1)=θi(k)αϵ+RMS(gi)gi(k).


Exercise: Implement the RMSProp method for the Rosenbrock function considered earlier. Take α=0.01, γ=0.9, and ϵ=108. Compare the performance with the adagrad method with α=0.01 and γ=0.9.


3.5.8 Adadelta

  • Adadelta method is an extension of Agagrad method and it overcomes the problem of monotonically of the learning parameter.

  • The authors of RMSprop noticed that the units in gradient descent, stochastic gradient descent, momentum, Adagrad methods do not match. To fix this they use an exponentially decaying average of the squared updates as follows.

  • In addition to (3.5), let T^ be the average vector updated according to T^i(k+1)=γtT^i(k)+(1γt)(Δθi(k+1))2, for each component of Δθ, where Δθ(k)=θ(k)θ(k1), where γt is also decay parameter close to 1 and T^(1)=0.

  • Then, in each iteration update the decaying average of the parameter vector by θi(k+1)=θi(k)ϵ+RMS(Δθi)ϵ+RMS(gi)gi(k), where RMS(Δθi)=T^i(k).

  • The above expression eliminates the learning parameter completely.

3.5.9 Adam

  • The adaptive moment estimation, or simply adam, is a another method that adapts learning rates to each parameter θi.

  • In addition to using the exponentially decaying average of squared gradients (like RMSprop and Adadelta), the adam method also uses exponentially decaying average of gradients (like the momentum method).

Recall that the momentum method can be seen as a ball rolling down a slope. The adam method can be seen as a heavy ball with friction rolling down the slope.

  • The corresponding updates in each iteration of the adam method are: v(k+1)=γvv(k)+(1γv)g(k),s(k+1)=γss(k)+(1γs)(g(k)g(k)).

  • v(k) and s(k+1) are estimates of the moment and the second moment of the gradient, respectively (hence the name of the method).

  • Since the initial values v(1)=s(1)=0, the estimates v(k) and s(k) are biased towards zero, particularly when decay rates are small (i.e., γv and γs are close to 1).

  • Since γs and γv are close to 1, for small k, v(k) and s(k+1) are biased towards 0. To eliminate this bias, the corrected (or unbiased) updates are (1)v^(k+1)=v(k+1)(1γvk),(2)s^(k+1)=s(k+1)(1γsk)

  • Finally, the parameter vector update is given by

θ(k+1)=θ(k)αv^(k+1)ϵ+s^(k+1), where ϵ is again on the order of 1×108 to avoid division by zero.

The authors propose that a good default choice to be γv=0.9, γs=0.999, and α=0.001.

3.5.10 Hypergradient Descent

  • The learning rate decides how sensitive the descent method to the gradient, and it has a substantial influence on the performance of the algorithm.

  • The accelerated descent methods presented above are either extremely sensitive to the learning rate or go to great extent to adapt the learning rate during execution.

  • The goal of the hypergradient methods is to use derivative of the learning rate parameter for improving the optimizer performance. This method applies gradient descent on the learning rate parameter.

  • The update of the learning parameter consists of the partial derivative of the objective function f with respect to the learning parameter: α(k+1)=α(k)μf(θ(k))α, where μ is the hypergradient learning rate.

  • For gradient descent method, f(θ(k))α=(g(k))α(θ(k1)αg(k1))=g(k)g(k1)

  • Similarly, hypergradient descent idea can be applied to other gradient-based methods.

  • The key advantage of the hypergradient method is that it reduces the sensitivity of the algorithm to the hyperparameter.

3.6 Second-Order Methods

In the previous section on first-order methods, we used the gradient of the objective function f in the first-order approximation. In this section, we present some well-known methods that use second order approximations of f.

3.6.1 Newton’s Method

  • As we have seen in the first-order methods, objective function value and its gradient help us in finding the direction of the next step. However, this information does not imply how far we need to travel.

  • On the other hand, Newton’s method uses a quadratic approximation of the objective function and finds the next step size by minimizing the quadratic function.

  • Consider the univariate case. In the kth iteration, let q(x) be the quadratic approximation of the objective function f given by f(θ)q(θ)=f(θ(k))+(θθ(k))f(θ(k))+(θθ(k))22f(θ(k)).

  • Then the update is given by setting the derivative of q(x) to zero: θq(θ)=f(θ(k))+(θθ(k))f(θ(k))=0, then the solution is, θ(k+1)=θ(k)f(θ(k))f(θ(k)). For instance, suppose that f(θ)=(θ1)2(θ+3) and θ(k)=3. Then, θ(k+1)=1.6.

  • For multivariate objective functions, update of Newton’s method is very similar to the univariate case: gradient replaces the derivative and Hessian replaces the second-order derivative. The quadratic approximation is given by f(θ)q(θ)=f(θ(k))+(θθ(k))g(k)+12(θθ(k))(H(k)(θ(k))(θθ(k))), where f(θ(k)) and H(k) are, respectively, the gradient and Hessian of f at θ(k). Then, the update can be obtained by taking gradient of q(θ): q(θ)=g(k)+H(k)(θθ(k))=0. Therefore, the update is given by θ(k+1)=θ(k)(H(k))1g(k) given that H(k) is non-singular (i.e., inverse exists).

  • Under certain conditions, the Newton’s method can exhibit a qudratic convergence, that is, there exists a constant c such that θ(k+1)θθ(k)θ2c, for all k1, where θ is the limit of the sequence θ(1),θ(2),. In other words, the error in each iteration is proportional to the squared error of the previous iteration.

3.6.2 Instability of the Newton’s method

  • Update of θ involves decision by the second derivative f. So, the update is undefined if f(θ(k))=0.

  • It can be instable when the second derivative f(θ(k)) is very close to zero, because then θ(k+1) can be too far from θ(k) to make the quadratic approximation invalid.

  • The following figures show situations where Newton’s method fail to work:

    Overshoot: Consider f(θ)=(θ3)2(t+3) and θ(k)=1.5. Then, θ(k+1)=5.25. It is clearly a overshoot.



Oscillation: Consider f(θ)=θ4/4+5θ2/2. If θ(k)=1, then we get θ(k+1)=1. That gives, θ(k+2)=1. So, the Newton’s methods continue to oscillate between 1 and 1.

3.6.3 Secant Method

  • Consider the univariate case. Implementation of Newton’s method requires both the first derivative f and the second derivative f. In many scenarios, we can have access to f, but not f.

  • The secant method is a modification of Newton’s method where the second derivative in each iteration is approximated by the last two iterations: f(θ(k))f(θ(k))f(θ(k1))θ(k)θ(k1). That is, the updates in Secant method are given by θ(k+1)=θ(k)θ(k)θ(k1)f(θ(k))f(θ(k1))f(θ(k)).

  • This method may take more iterations for convergence than Newton’s method and it also suffers from the problems associated with Newton’s method.

3.6.4 Quasi-Newton Methods

  • Consider multivariate cases where computing the inverse of Hessian is difficult or impossible.

  • As in the case of the secant method, quasi-Newton methods modifies the Newton’s method by approximating the inverse of Hessian.

  • The general update form the quasi-Newton methods is θ(k+1)=θ(k)α(k)Q(k)g(k), where α(k) is scalar step factor and Q(k) is an approximation of (H(k))1 at θ(k). That is, the descent direction d is given by d=Q(k)g(k).

  • Quasi-Newton methods typically starts with Q(1) being the identity matrix.

  • To simplify the notation, define γ(k+1)=g(k+1)g(k), and recall that Δθ(k+1)=θ(k+1)θ(k)

  • We now discuss three quasi-Newton methods.

  • Davidon-Fletcher-Powell (DFP): This method uses the following update on Q:

(3.6)Q(k+1)=Q(k)Q(k)γ(k)(γ(k))Q(k)(γ(k))Q(k)γ(k)+Δθ(k)(Δθ(k))(Δθ(k))γ(k),

where denotes the transpose operation.

  • The update of Q in (3.6) has the following properties:

    1. If Q(1) is symmetric and positive definite, Q(k) is symmetric and positive definite in every iteration k;

    2. If f(θ) has a quadratic form, that is, f(θ)=12θAθ+bθ+c, then Q=A1. Thus, for this case, DFP and conjugate gradient methods have the same rate of convergence;

    3. Storing and updating Q can be computationally demanding for high-dimensional problems, compared to conjugate gradient method;


Exercise: Prove properties (i) and (ii).


  • Broyden-Fletcher-Goldfarb-Shanno (BFGS): This method is an alternative to DFP method, and the update is as follows: Q(k+1)=Q(k)(Δθ(k)(γ(k))Q(k)+Q(k)γ(k)(Δθ(k))(Δθ(k))γ(k))+(1+(γ(k))Q(k)γ(k)(Δθ(k))γ(k))Δθ(k)(Δθ(k))(Δθ(k))γ(k). The BFGS is known to work better than the DFP method with approximate line search. However, storing n×n matrix is challenging for higher dimensional problems.

  • Limited-memory BFGS (L-BFGS): To remedy the above issue with BFGS, limited memory BFGS (L-BFGS) approximates BFGS by storing the last m of γ and Δθ, rather than the storing the entire Hessian matrix.

    In each iteration, the procedure has two steps to find the descent direction when current design point is θ.
    First, backward procedure: Take q(m)=f(θ), and define for each i=m1,,1 backward, q(i)=q(i+1)(Δθ(i+1))q(i+1)(γ(i+1))Δθ(i+1)γ(i+1). Then forward procedure: Take z(0)=γ(m)Δθ(m)q(m)(γ(m))γ(m), and define for i=1,,m, z(i)=z(i1)+Δθ(i1)((Δθ(i1))q(i1)(γ(i1))Δθ(i1)(γ(i1))z(i1)(γ(i1))Δθ(i1)). Finally, the descent direction is given by d=z(m).


Exercise: Consider the Rosenbrock function mentioned earlier. Impliment and compare DFP, BFGS, and L_BFGS (with m=3).


Page built: 2021-03-04 using R version 4.0.3 (2020-10-10)