Mathematical Engineering

of Deep Learning

Benoit Liquet, Sarat Moka and Yoni Nazarathy

March 3, 2026

5 Feedforward Deep Networks

We now enter the heart of deep learning models by presenting key concepts for feedforward

deep neural networks also known as general fully connected neural networks. These models

are a powerful extension of shallow neural networks presented previously and are the central

entities of deep learning. The key mathematical objects used for such networks are highlighted

in this chapter. These include layers with neurons, activation function alternatives, and the

backpropagation algorithm for gradient evaluation. A universal approximation theorem is

stated and further demonstrations highlight the beneﬁt of exploiting deep neural networks

for machine learning tasks. Key steps for training a deep neural network are presented. These

include weight initialization and batch normalization. We then focus on key strategies for

handling overﬁtting. These are dropout and addition of regularization terms.

In Section 5.1 we present details for the general fully connected architecture. In Section 5.2

we explore the expressive power of neural networks to get a feel for why the general fully

connected architecture is a very useful machine learning model. Towards that end, we explore

a few theoretical underpinnings for the power of deep learning. In Section 5.3 we introduce

activation functions beyond the sigmoid and softmax functions which were already presented

in Chapter

. In Section 5.4 we study the backpropagation algorithm used for gradient

evaluation. This algorithm which is a form of backward mode automatic diﬀerentiation, is at

the core of training deep learning models. In Section 5.5 we study various methods for weight

initialization. In Section 5.6 we introduce the idea of batch-normalzation. In Section 5.7

we explore methods for mitigating overﬁtting. These include the addition of regularization

terms and the concept of dropout.

5.1 The General Fully Connected Architecture

We refer to the generic deep learning model as the general fully connected architecture and

also mention that such a model can be called a fully connected network, a feedforward

network, or a dense network where each of these terms may also be augmented with the

phrases “deep”, “neural”, and “general”. Another name for such a model is a multi-layer

perceptron (MLP) or a multi-layer dense network.

This architecture involves multiple layers, each with non-linear transformations, and is an

extension to the shallow neural networks presented in Chapter

. Schematic illustrations

of this architecture are presented in Figure 5.1. In that ﬁgure, each circle may be called a

neuron or unit and each vertical set of neurons is a layer. A network with a single hidden

layer is in (a) and a more general multi-layer network is in (b). Compare these illustrations

with Figure

which has no hidden layers and has a single output (logistic regression), or

Figure

which also has no hidden layers and has multiple outputs (multinomial regression).

Returning to Figure 5.1, the left most layer is called the input layer and it has the input

features vector

. Each element of this layer is sometimes called an input neuron even though

5 Feedforward Deep Networks

ˆy

Input

Layer

Hidden

Layer

Output

Layer

(a)

ˆy

Input

Layer

Hidden

Layer 1

Hidden

Layer 2

Hidden

Layer 3

Output

Layer

(b)

Figure 5.1: Fully connected feedforward neural networks. (a) A network with a single hidden layer.

(b) A deep neural network with multiple hidden layers.

it does not involve any computation. The right most or output layer contains the output

neurons (just one in this example). The layers in the middle are called hidden layers, since

the neurons in these layers are neither inputs nor outputs. When using the network in

production for prediction (regression or classiﬁcation), we do not directly observe what goes

on in the hidden layers, but rather observe network outputs resulting from inputs. However

in certain cases, the values of neurons in hidden layers are also called “features” or more

precisely, extracted features (also known as computed features or derived features) since they

may encode some intermediate summary of the input data.

The design of the input and output layers in a network is often straightforward. There are

as many neurons in the input layer as the number of features. As for the output layer, the

number of output neurons depends on the application. In certain cases the output can be a

scalar, determining either a probability in binary classiﬁcation, or a response in a regression

problem. In other cases, the output may be a vector, such as for example in multi-class

classiﬁcation where there will typically be as many output neurons as the number of possible

labels (classes) and the output is a probability vector. In contrast to the input and output

layers, when selecting the number and size of the hidden layers, there is much room for

variation of model choice.

For example, assume we wish to create a model for classiﬁcation of a type of plant based

= 120 measured indicators (features). Assume further that our classiﬁer supports 30

diﬀerent types of plants. Then in this case we may use a model where the input layer has

120 input neurons and the output layer has 30 output neurons which may be interpreted as

probabilities, similar to the multinomial regression model of Chapter

. With this, there

still remains a choice of how many hidden layers to use, and how many neurons to use for

each of those hidden layers. These choices determine the number of trained parameters, the

model expressivity, and the ease/diﬃculty of training the model.

5.1 The General Fully Connected Architecture

Finally note that in our terminology, when counting layers in a multi-layer network, we

count hidden layers as well as the output layer, but we do not count an input layer. Hence

with our terminology, Figure 5.1 (a) has 2 layers, and (b) has 4 layers.

A Model Based on Function Composition

The goal of a feedforward network is to approximate some function

∗

−→ R

. A

feedforward network model deﬁnes a mapping

−→ R

and learns the value of the

parameters θ that ideally result in

(x) ≈ f

∗

(x).

The function f

is recursively composed via a chain of functions:

(x) = f

[L]

[L−1]

(. . . (f

[1]

(x)) . . .)), (5.1)

where

[ℓ]

ℓ−1

−→ R

ℓ

is associated with the

ℓ

-th layer which depends on parameters

[ℓ]

∈

[ℓ]

, where Θ

[ℓ]

is the parameter space for the

ℓ

-th layer. The depth of the network

. We have that

(the number of features) and

(the number of output

variables). Note that in case of networks used for classiﬁcation we typically have

, the

number of classes. The number of neurons in the network is

ℓ=1

ℓ

Aﬃne Transformations Followed by Activations

In deep learning, the function

[ℓ]

is generally deﬁned by an aﬃne transformation followed

by an activation function. Activation functions are the means of introducing non-linearity

into the model. The output of layer

ℓ

is represented by the vector

[ℓ]

and the intermediate

result of the aﬃne transformation is represented by the vector

[ℓ]

(see also Figure

Chapter ??). We typically denote the output of the model via ˆy and hence,

ˆy = a

[L]

= f

(x).

The action of f

[ℓ]

can be schematically represented as follows,

[ℓ−1]

[ℓ]

:= W

[ℓ]

[ℓ−1]

+ b

[ℓ]

:= S

[ℓ]

Aﬃne

Transformation

[ℓ]

Activation

(5.2)

where

[0]

. Hence the parameters of the

ℓ

-th layer,

[ℓ]

, are given by the

ℓ

×N

ℓ−1

weight

matrix

[ℓ]



[ℓ]

i,j



and the

ℓ

dimensional bias vector

[ℓ]

= (

[ℓ]

). Thus the parameter

space of the layer is Θ

[ℓ]

= R

ℓ

×N

ℓ−1

× R

ℓ

5 Feedforward Deep Networks

The activation function

[ℓ]

ℓ

−→ R

ℓ

is a non-linear multivalued function. For

ℓ = 1, . . . , L − 1 it is generally of the form

[ℓ]

(z) =

[ℓ]

) . . . σ

[ℓ]

ℓ

)

⊤

, (5.3)

where

[ℓ]

R → R

is typically an activation function common to all hidden layers. For the

output layer, ℓ = L, it is often of a diﬀerent form depending on the task at hand.

In the popular case of multi-class classiﬁcation, a softmax function as

(??)

is used, or

more speciﬁcally for binary classiﬁcation, a sigmoid function as

(??)

is typically used; see

Chapter

for background. Thus, in such a classiﬁcation framework, the output of the

network is a vector of probability values determining class membership. In order to get a

class label prediction one can convert the predicted probability scores into a class label using

a chosen threshold, or more simply maximum a posteriori probability, as described in (??).

The Forward Pass

The forward pass equation,

(5.1)

, of a deep neural network can be expanded out as follows,

Layer 1







[1]

= W

[1]

Input

z}|{

x +b

[1]

= S

[1]

)

Layer 2







[2]

= W

[2]

[1]

+ b

[2]

= S

[2]

)

Layer L







[L]

= W

[L]

[L−1]

+ b

[L]

ˆy

|{z}

output

= a

[L]

= S

[L]

(5.4)

Thus, for a given input

, when computing

(

), we sequentially execute the aﬃne trans-

formations and activation functions from layer 1 to layer

. The computational cost of such

a forward pass is at an order of the total number of weights. To see this, observe that at the

ℓ

-th layer, the cost to compute

[ℓ]

[ℓ−1]

[ℓ]

and

[ℓ]

(

[ℓ]

) are of the order

ℓ

× N

ℓ−1

ℓ

and

ℓ

, respectively. Hence, the total computational cost is of the order

ℓ=1

ℓ

ℓ−1

+ 2) ≈

ℓ=1

ℓ

ℓ−1

, which is the total number of weights.

An Example with Concrete Dimensions

As an illustration, let us return to the networks depicted in Figure 5.1. Consider the network

in (a) with one-hidden layer. Here

= 4,

= 5, and

= 1. Hence the dimension

[1]

is 5

4, the dimension of

[1]

is 5, the dimension of

[2]

is 1

5, and the dimension

5.1 The General Fully Connected Architecture

of b

[2]

is 1. Putting these elements together we obtain,

(x) = S

[2]

(

[2]

[1]

(

[1]

x + b

[1]

)

| {z }

[1]

[2]

)

| {z }

[2]

, (5.5)

where the number of parameters in

is 5

4 + 5 + 1

5 + 1 = 31. Similarly, the deeper

network on the right of Figure 5.1 is represented via,

(x) = S

[4]

[3]

[2]

[1]

x + b

[1]

) + b

[2]

) + b

[3]

) + b

[4]

). (5.6)

We may work out that the number of parameters is,

4 × 4 + 4

| {z }

Hidden layer 1

+ 3 × 4 + 3

| {z }

Hidden layer 2

+ 5 × 3 + 5

| {z }

Hidden layer 3

+ 1 × 5 + 1

| {z }

Output layer

= 61.

The Scalar Based View of the Model

It is also instructive to consider the scalar view of the system. The

-th neuron of layer

ℓ

with

= 1

, . . . , N

ℓ

, is typically composed of both

[ℓ]

and

[ℓ]

. The transition from layer

ℓ −

to layer

ℓ

takes the output of layer

ℓ −

1, an

ℓ−1

dimensional vector, and operates on it as

follows,

Aﬃne

Transformation











[ℓ]

= w

[ℓ]

⊤

(1)

[ℓ−1]

+ b

[ℓ]

= w

[ℓ]

⊤

(2)

[ℓ−1]

+ b

[ℓ]

ℓ

= w

[ℓ]

⊤

ℓ

)

[ℓ−1]

+ b

[ℓ]

ℓ

⇒

Activation

Step











[ℓ]

= σ



[ℓ]



[ℓ]

= σ



[ℓ]



[ℓ]

ℓ

= σ



[ℓ]

ℓ



, (5.7)

where,

[ℓ]

(j)

⊤

[ℓ]

j,1

. . . w

[ℓ]

j,N

ℓ−1

, for j = 1, . . . , N

ℓ

is the

-th row of the weight matrix

[ℓ]

, and

[ℓ]

is the

-th element of the bias vector

[ℓ]

Hence the parameters associated with neuron j in layer ℓ, are w

[ℓ]

(j)

⊤

and b

[ℓ]

Vectorizing Across Multiple Samples

So far we have deﬁned our neural network using only one input feature vector

to generate

a prediction ˆy. Namely,

x −→ a

[L]

= ˆy.

Let us now consider mini-batches as introduced in Section

. Here we use

training

samples

(1)

, . . . , x

)

where

is the size of the mini-batch and

(i)

∈ R

. We can use the

5 Feedforward Deep Networks

feedforward pass equation to get for each training sample x

(i)

, a prediction

(i)

−→ a

[L](i)

= ˆy

(i)

i = 1, . . . , n

One can clearly iterate using a loop for getting all the predictions. However, a matrix repre-

sentation is often useful for describing the prediction of the whole mini-batch together. This

is particularly useful in GPU implementations when the mini-batch size,

, is appropriately

chosen to ﬁt GPU memory. Let us ﬁrst deﬁne the matrix

⊤

in which every column is a

feature vector for one training sample:

⊤





| | |

(1)

(2)

. . . x

)

| | |





. (5.8)

Then, for each ℓ, we deﬁne the matrix Z

[ℓ]

with columns z

[ℓ](1)

, . . . , z

[ℓ](n

)

as,

[ℓ]





| | |

[ℓ](1)

[ℓ](2)

. . . z

[ℓ](n

)

| | |





The activation matrix A

[ℓ]

for each layer ℓ is deﬁned similarly via,

[ℓ]





| | |

[ℓ](1)

[ℓ](2)

. . . a

[ℓ](n

)

| | |





where for example the element in the ﬁrst row and in the second column of a matrix

[ℓ]

is an activation of the ﬁrst hidden unit from the layer

ℓ

and the second training example.

Based on this matrix and the forward pass representation (5.4), we get,











[1]

= W

[1]

⊤

+ B

[1]

= σ

[1]

)

[2]

= W

[2]

[1]

+ B

[2]

= σ

[2]

)

[L]

= W

[L]

[L−1]

+ B

[L]

|{z}

[ˆy

(1)

,...,ˆy

)

]

= S

[L]

)

(5.9)

where the scalar activation functions, as in

(5.3)

[ℓ]

(

) are applied independently to each

element of the matrix

[ℓ]

for

ℓ

= 1

, . . . , L −

1, while

[L]

(

) is applied independently on

each column of

[L]

. For this representation, the matrix

[ℓ]

for each layer

ℓ

is based on the

bias vector b

[ℓ]

and is constructed via,

[ℓ]





| | |

[ℓ]

. . . b

[ℓ]

| | |





. (5.10)

5.2 The Expressive Power of Neural Networks

An Overview of Model Training

Training of feedforward deep neural networks follows the same iterative optimization based

learning paradigm presented in Section

of Chapter

. This paradigm was also applied

to simple neural networks in Chapter

. Speciﬁcally a standardized training dataset



(1)

, y

(1)

), . . . , (x

(n)

, y

(n)

)



is used to ﬁnd network parameters

(weights and biases)

that minimize a loss function

(

;

). As illustrated in previous chapters, common loss

functions include the cross entropy loss (binary or categorical) in the context of classiﬁcation

or the square loss function in the context of regression.

In practice, gradient based optimization generalizing the gradient descent of Algorithm

is the typical technique of training. Further, as discussed in detail in Chapter

, a very

common variant is ADAM presented in Algorithm

. In any case, such learning algorithms

require the evaluation of gradients, e.g., Step 3 of Algorithm

or Step 4 of Algorithm

For deep learning models, such gradient evaluation is carried out via the backpropagation

algorithm which we study in detail in Section 5.4 below.

Another important aspect of iterative optimization involves parameter intialization. This is

Step 1 of Algorithm

or Step 2 of Algorithm

. In the context of deep learning this is

called weight initialization, a topic we discuss in Section 5.5.

Finally, other aspects of training deep neural networks include batch normalization presented

in Section 5.6 and dropout presented in Section 5.7, as well as other methods of regularization

also presented in that section.

5.2 The Expressive Power of Neural Networks

Deep neural network models are extremely expressive and versatile. Research about their

strength dates all the way back to the early days of artiﬁcial intelligence and continues in

current times. We now explore the expressivity of such models where we simply touch the

tip of the iceberg, attempting to illustrate why the model is sensible and versatile.

Neural networks are known for being able to approximate arbitrarily complex functions.

Our exposition in this section aims to illustrate this while also presenting intuition about

the beneﬁts of deep models. We begin with a simple constructive example for scalar real

valued functions, and we then progress to present an overview key results and intuition

which highlight why these models are so powerful.

Simple Function Approximation

We now see one possible way to approximate scalar functions via a neural network. Say we

have a function

∗

: [

l, l

]

→ R

and we are given the values of the function at

∗

(

) for

, . . . , u

with l ≤ u

< u

< . . . < u

≤ l.

See for example Figure 5.2 focusing on this arbitrary example,

∗

(x) = cos



2 cos





(x − 1)



+ 1, for x ∈ [0, 4]. (5.11)

5 Feedforward Deep Networks

Approximation functions obtained for

∗

(

) of

(5.11)

are also illustrated in Figure 5.2 where

the case of

= 10 is in (b) and the case of

= 30 is in (c). Our approximation of

∗

(

) uses

a feedforward neural network

(

) as illustrated in (a). In this case we apply a construction

of a piecewise continuous aﬃne function

R → R

with breakpoints at

, . . . , u

and

) = f

∗

) for i = 1, . . . , r. For completeness, we now present the details.

Our constructed network has one hidden layer (

= 2), and model dimensions

= 1,

r −

1, and

= 1. In this case, the approximation is obtained by setting the scalar

activation functions as in

(5.3)

to be

[1]

(

) =

max

(

0) and

[2]

(

) =

, where the former

is called the ReLU function, see also

(5.17)

, and the latter is simply the identity function.

We set the ﬁrst

1 dimensional weight matrix,

[1]

, to have all entries 1; the ﬁrst

dimensional bias vector to have entries

[1]

−u

; the second 1

× N

dimensional weight

matrix is set with entries

[2]

1,i

− s

i−1

for

= 1

, . . . , r −

1 with

= 0 and

i+1

−v

i+1

−u

;

and ﬁnally the second scalar bias b

[2]

is set to be v

In summary, the construction uses a linear combination of shifted ReLU functions (see also

Figure 5.6) where each shifted ReLU (shifted by the bias

−u

) is constructed to match the

interval [

, u

i+1

] with a slope

. As the construction moves from left to right, slopes are

cancelled out via subtraction of the s

i−1

term in w

[2]

1,i

With such a construction we see that essentially any continuous real valued function on a

bounded domain can be approximated via a neural network.

It is evident that as

r → ∞

the approximation becomes exact, and this can be made rigorous for any continuous target

function f

∗

(·) on a bounded interval.

A General Approximation Result

The expressive power of feedforward neural networks generalizes to functions that have

inputs and

outputs. In fact, as the result below shows, similarly to the construction above,

a single hidden layer (L = 2) and identity activations in the second layer suﬃces for this.

Theorem 5.1. Consider a continuous function

∗

K → R

where

K ⊆ R

is a compact

set. Then for any non-polynomial activation function σ

[1]

(·) and any ε > 0, there exists an

and parameters W

[1]

∈ R

×p

, b

[1]

∈ R

, and W

[2]

∈ R

q×N

, such that the function

(x) = W

[2]

[1]

x + b

[1]

), with S

[1]

as in (5.3),

satisﬁes ||f

(x) − f

∗

(x)|| < ε for all x ∈ K.

Hence this theorem states that essentially all functions can be approximated to arbitrary

precision dictated via

. Practically for complicated functions

∗

(

) and small

one may

need large

. Yet, the theorem states that it is always possible. A reference to a proof is

provided in the notes and references section at the end of this chapter. The constructive

example in Figure 5.2 above (for

= 1 and

= 1) may hint at the validity of the result. Note

also that the tanh, sigmoid, and ReLU activation functions described in the next section are

some of the valid activation functions for this result.

This construction is one of many options one could use.

5.2 The Expressive Power of Neural Networks

bias = −u

r−1

ˆy

[1]

1,1

= 1

[2]

1,1

= s

− s

[1]

1,2

= 1

[2]

2,1

= s

− s

[1]

1,r−1

= 1

[2]

r−1,1

= s

r−1

− s

r−2

(a)

(b) (c)

Figure 5.2: Approximation of an arbitrary real valued function on a bounded domain via a piecewise

continuous function obtained by a single hidden layer feedforward model with ReLU activation

functions. (a) A neural network with one hidden layer that constructs such an approximation. (b)

Approximation based on

= 10 sampled points of the function. (c) Approximation based on

= 30

sampled points of the function.

5 Feedforward Deep Networks

The Strength of a Hidden Layer

As we saw in Chapter

shallow neural networks such as logistic regression or softmax

regression can be used to create classiﬁers with linear decision boundaries. Further, as in

Section

, for cases where more general decision boundaries are needed, one may attempt

to create additional transformed features while still using these basic models. However, the

expressiveness of models with a single hidden layer (or more), as introduced in the current

chapter, can yield a versatile alternative to the shallow networks of Chapter ??.

(a) (b) (c)

Figure 5.3: Binary classiﬁcation example with

x ∈ R

, moving from a shallow neural network to a

model with a hidden layer and then increasing the number of neurons. (a) Sigmoid model (

= 1).

(b) One-hidden layer (L = 2) N

= 4 neurons. (c) One-hidden layer (L = 2) N

= 10 neurons.

Consider Figure 5.3 (a) for a classiﬁcation task based on two inputs

x ∈ R

using logistic

regression. By adding a single hidden layer with 4 neurons (sigmoid activation function

is used for all units), we can move beyond the linear boundaries to obtain Figure 5.3 (b).

Then, by increasing the number of neurons in the hidden layer from 4 to 10 units, the model

further reﬁnes the decision boundary as in Figure 5.3 (c).

We thus see that obtaining non-linear decision boundries is possible not only via feature

enginering as discussed in Section

and illustrated there in Figure

, but can also be

done via neural networks as hinted by Theorem 5.1 and illustrated here in Figure 5.3.

Stylized Functions via Simple Models

To further appreciate the expressive power of feedforward neural networks we also consider

speciﬁc stylized functions. For example, historically, in the study of neural networks much

eﬀort has gone into characterizing the set of logical functions, sometimes called gates, that

may be described via certain classes of models. In our exploration we focus on a diﬀerent

type of speciﬁc task, multiplication of two inputs, and demonstrate a simple constructive

network that approximates this task. We thus construct what may be called a multiplication

gate.

A simple construction of a single hidden layer network with

= 2 and

= 1, allows us to

create a function

(

), parametrized by

λ >

0, such that for input

= (

, x

), the function

approximately implements multiplication of inputs,

, x

) ≈ x

. (5.12)

5.2 The Expressive Power of Neural Networks

Importantly, the approximation error vanishes as

λ →

0 and achieving

(5.12)

to arbitrary

accuracy requires only

= 4 neurons in the single hidden layer. Similarly to the model in

Theorem 5.1 and to the simple function approximation example of Figure 5.2, the activation

function of the output layer is the identity. There are no bias terms, and the weight matrices

are,

[1]







λ λ

−λ −λ

λ −λ

−λ λ







, and W

[2]



µ µ −µ −µ



, (5.13)

with



4λ

¨σ(0)



−1

. Here

¨σ

(0) represents the second derivative of the scalar activation

function of the hidden layer (

ℓ

= 1) at 0. Hence the model assumes

[1]

(

) is twice diﬀerentiable

(at 0) with a non-zero second derivative at zero. A schematic representation of

(

) is in

Figure 5.4.

[1]

(λx

+ λx

)

[1]

(−λx

− λx

)

[1]

(λx

− λx

)

[1]

(−λx

+ λx

)

ˆy = f

, x

)

[1]

1,1

= λ

[1]

2,1

= −λ

[1]

3,1

= λ

[1]

4,1

= −λ

[1]

1,2

= λ

[1]

2,2

= −λ

[1]

3,2

= −λ

[1]

4,2

= λ

[2]

1,1

= µ

[2]

1,2

= µ

[2]

1,3

= −µ

[2]

1,4

= −µ

Figure 5.4: A simple neural network model with a single hidden layer composed of four neurons.

This network approximates multiplication, or a multiplication gate, in the sense that ˆy ≈ x

We may verify that the model f

(·) evaluates to,

, x

) =



λ(x

+ x

)



+ σ



λ(−x

− x

)



− σ



λ(x

− x

)



− σ



λ(−x

+ x

)



4λ

¨σ(0)

, (5.14)

5 Feedforward Deep Networks

where we use

(

) to denote

[1]

(

). We may now use a Taylor expansion (see Theorem

Appendix ??) of σ(·) around the origin,

σ(u) = σ(0) + ˙σ(0)u + ¨σ(0)

+ O





, (5.15)

with

(

) denoting a function such that

(

)

goes to a constant as

h →

0. We can

now use (5.14) and (5.15) represent the model as,

, x

) = x



1 + O



λ(x

+ x

)



Hence as

λ →

0 the desired goal

(5.12)

becomes exact. Note that this continuous multiplica-

tion gate is mostly a theoretical construct and for many popular activation functions the

condition

¨σ

(0) = 0 does not hold. Nevertheless, this problem can be overcome by introducing

biases to shift the origin and hence use a diﬀerent input into the activation. Below we use

this construction to further argue about the power of deep learning models.

Feature Focus with Neural Networks

We now contrast the usage of feedforward neural networks with the more classic practice

of feature engineering where one would consider data and add additional features by trans-

forming existing features. See also Section

where feature engineering is illustrated for

simple neural networks. As an example we consider a case where there are originally

features

, . . . , x

and we wish to construct

˜p

(

+ 1)

2 features based on all possible

pairwise interactions

(multiplications)

for

i, j

= 1

, . . . , p

. For instance if

= 1

000

then we arrive at

˜p ≈

500

000. Clearly for non-small

we quickly arrive at huge number of

transformed features ˜p.

Let us contrast the usage of two alternatives. On the one hand consider a linear model acting

on the transformed features

˜x

where for simplicity we ignore the bias (intercept). On the

other hand let us consider a neural network with a single hidden layer acting on the original

features

. In the linear model, we have

(

˜x

) =

˜w

⊤

˜x

where the learned weight vector

˜w

has

˜p

parameters. In the single hidden layer neural network, there are

inputs,

= 1 output,

and

units in the hidden layer. Thus the number of parameters is

× p

+ 1.

It is often the case that not all interactions (product features) are relevant. As an example

let us consider that only a fraction

of the interactions are relevant. We now argue that the

neural network model with

hidden units suﬃciently large, but not necessarily huge, can

in principle capture these interactions. To do so, revisit the multiplication example presented

above where 4 hidden units were needed to approximate an interaction (multiplication).

While the construction using weight parameters as in (5.13) is artiﬁcial, the example hints

at the fact that with

≈

α p

hidden units we may be able to capture the key interactions.

In such a framework, the basic linear model still requires the full set of parameters. With

this, compare the number of parameters,

p(p + 1)

| {z }

Linear Model

vs. 4αp(p + 2) + 1

| {z }

Neural Network

Typically in statistics interactions refer to terms such as

for

i 

. However in this example we

also allow for terms such as x

5.2 The Expressive Power of Neural Networks

Observe that

is the dominant term in both models but for

α <

8 and large

, the neural

network model has less parameters. Keep in mind that often

is very small while

may

grow. Returning to the numerical case of

= 1

000, if

= 0

02 (20 signiﬁcant interactions)

then the linear model has an order of 500

000 parameters while the neural network only has

order of 80, 000 parameters and is thus parisiminous in comparison to the linear model.

Improvement in Expressivity by an Increase of Depth

Despite the fact that Theorem 5.1 states that almost any function can be approximated

using a neural network model with a single hidden layer, practice and research has shown

that to gain high expressive power, this model might require a very large number of units

(

needs to be very large). Hence gaining signiﬁcant expressive power may require a very

large number of parameters. The power of deep learning then arises via repeated composition

of non-linear activations functions via an increase of depth (an increase of L).

Note ﬁrst that if the identity activation function is used in each hidden layer, then the

network reduces to a shallow neural network,

(x) = S

[L]

(

W x +

b),

where,

W = W

[L]

[L−1]

· . . . · W

[1]

, and

b =

ℓ=1



ℓ=ℓ+1

[

ℓ]



[ℓ]

In the case where the identity function is also used for the output layer, the model reduces

to be a linear (aﬃne) model. Thus, we have no gain by going deeper and adding multiple

layers with identity activations. The expressivity of the neural network comes from the

composition of non-linear activation functions. The repeated compositions of such functions

has signiﬁcant expressive power and can reduce the number of units needed in each layer in

comparison to a network with a single hidden layer. A consequence is that the parameter

space is reduced as well.

Let us consider an artiﬁcial example to demonstrate why exploiting the depth of the network

is crucial for modelling complex relationships between input and output. We revisit the

previous example involving models using interactions of order 2 (i.e.,

). Let us consider

a higher complexity model by exploiting potential high-order interactions, namely products

of r inputs, similarly to the discussion in Section ??.

Consider a fully connected network with

inputs,

= 1 output, and

L ≥

2 layers with same

size (

[ℓ]

in each layer). We re-use the idea appearing in

(5.13)

that with

N ≈

α p

hidden units we may be able to capture the

α p

relevant interactions of order

= 2. Then

by moving forward in the network, the subsequent addition of a layer with

N ≈

α p

hidden

units will capture interactions of order

= 2

and so on, until we capture interaction of

order

= 2

at the output layer. Hence to achieve interactions of order

we may require

L ≈ log

, e.g.,

⌈log

r⌉

. Such network is depicted in Figure 5.5 where the number of

Note that in the expression

ℓ=ℓ+1

[

ℓ]

we assume that the product multiplies the matrices in the

left to right order W

[L]

[L−1]

[L−2]

. . .. Further, for ℓ = L the product is taken as the identity matrix.

5 Feedforward Deep Networks

parameters is,

N × p + N

| {z }

First hidden layer

+ (L − 2) × (N

+ N)

| {z }

Inner hidden layers

+ N + 1

| {z}

Output layer

≈ LN

For example, assume we wish to have a model for

= 1

000 features that supports about 20

meaningful interactions of order

= 500. Hence we can consider

= 0

02. With a model

not involving hidden layers (e.g., a linear model or a logistic model), as shown in Section

we cannot specialize for an order of 20 interactions and thus we require a full model of order

≈

365

parameters. In contrast, with a deep model we require about

= 10

≈ log

500

layers and

= 4

αp

= 80 neurons per layer, and thus the total number of parameters is

at the order of

= 57

600. This deep construction which can capture a desired set of

meaningful interactions is clearly more feasible and eﬃcient than the shallow construction

of astronomical size.

1000

ˆy

Figure 5.5: A deep learning model with

= 10 hidden layers may express many meaningful

interactions for the inputs.

5.3 Activation Function Alternatives

As presented in

(5.2)

, each layer

ℓ

of the network incorporates an activation function which

is generally a non-linear transformation of

[ℓ]

to arrive at

[ℓ]

. For most layers of the

network, the activation function

[ℓ]

(

) is composed of a sequence of identical scalar valued

activation functions as in

(5.3)

. That is, the

ℓ

-th layer incorporates a scalar activation

function

[ℓ]

R → R

and it is applied to each of the

ℓ

coordinates of

[ℓ]

separately. In

some models, all the scalar activation functions across all layers will be of the same form,

while in other models diﬀerent layers will sometimes incorporate diﬀerent forms of scalar

activation functions.

Scalar Activations and their Derivatives

When it comes to the choice of the scalar activation functions, there are diﬀerent heuristic

considerations that depend on model expressivity and learning ability. These considerations

are often not based on theory, yet practice and experience over the years has shown that

some scalar activation functions perform much better than others. In Section 5.5 we outline

how such choices interface with optimization and the associate vanishing gradient problem

is discussed in Section 5.5. Here we simply outline the key activation functions.

5.3 Activation Function Alternatives

At the onset of the development of deep learning, in the late 1950’s, the step scalar activation

function was used. That is,

Step

(u) =

(

−1 u < 0,

+1 u ≥ 0.

(5.16)

However,

Step

is not used in modern neural network models, primarily because its derivative

˙σ

Step

(

) = 0 for all

u 

= 0. Indeed the derivative of activation functions is important since it

is used in the backpropagation algorithm (see the next section) to compute the gradient of

the loss function with respect to the model parameters.

The sigmoid activation function (also known as the logistic function), used heavily in

Chapter 3, see

(??)

, is a much more popular choice. Note also that its derivative

˙σ

Sig

can be

expressed in terms of the function σ

Sig

itself,

Sig

(u) =

1 + e

−u

, with ˙σ

Sig

(u) = σ

Sig

(u)



1 − σ

Sig

(u)



A similar popular function is hyperbolic tangent, denoted tanh,

Tanh

(u) =

− e

−u

+ e

−u

, with ˙σ

Tanh

(u) = 1 − σ

Tanh

(u)

Both

Sig

and

Tanh

share similar qualitative properties as

Step

. They are non-decreasing

and bounded. At

u → ∞

both functions converge to unity just like

Step

and at

u → −∞

the sigmoid function converges to 0 while the tanh function converges to

−

1 like

Step

This minor quantitative diﬀerence is not signiﬁcant as it may generally be compensated by

learned weights and biases or a shifting and scaling of the functions, yet in practice when

the output

ˆy

is a probability in [0

1] using

Sig

is much more common. See Figure 5.6 for

plots of several activation functions.

(a) (b)

Figure 5.6: Several common scalar activation functions. (a) The step and tanh function have a

range of (

−

1) while the sigmoid function has a range of (0

1). We also plot a scaled sigmoid to

the range (

−

1) so it can be compared to tanh. (b) The ReLU activation function and the leaky

ReLU variant with diﬀerent leaky ReLU parameters.

In earlier applications of deep learning, it was a matter of empirical research, practice, and

heuristics to choose between models or layers that use

Sig

Tanh

, or similar forms. However

in more recent years, a completely diﬀerent type of scalar activation function became popular,

5 Feedforward Deep Networks

namely the rectiﬁed linear unit

or ReLU,

ReLU

(u) = max(0, u) =

(

0 u < 0,

u u ≥ 0,

with, (5.17)

˙σ

ReLU

(u) = 1{u ≥ 0} =

(

0 u < 0,

1 u ≥ 0.

Note that while

ReLU

(

) is not diﬀerentiable at

= 0 we still arbitrarily deﬁne the derivative

at u = 0 to be the same as the derivative for u > 0.

As we describe in Section 5.5, when parameters are initialized properly, the unboundedness

ReLU

often presents an advantage in training over bounded activation functions such

Sig

since it overcomes a training problem known as the vanishing gradient problem.

However in certain cases it also introduces a problem called dying ReLU since the derivative

is 0 for negative inputs. While in practice, this is often not considered a major problem, to

handle dying ReLU, one may use a related activation function, leaky ReLU, parameterized

by a ﬁxed small ε ≥ 0 (e.g., ε = 0.01) and deﬁned via,

LeakyReLU

(u) = max(0, u) + min(0, εu) =

(

εu u < 0,

u u ≥ 0,

with,

˙σ

LeakyReLU

(u) = 1{u ≥ 0} + ε1{u < 0} =

(

ε u < 0,

1 u ≥ 0.

Observe that when

= 0, this is just the ReLU activation function. Another variant called

PReLU (parametric ReLU) considers the leaky ReLU parameter

as a learned parameter.

That is, the gradient based optimization for the parameters of the network also includes

improvement steps of ε, incorporating it as part of the parameters for ℓ-th layer, θ

[ℓ]

Note that feedforward models that only use piecewise aﬃne scalar activation functions such

as the step, ReLU, leaky ReLU, or PReLU have some mathematical appeal. When only

such activations are used in a deep learning model, these functions imply that the complete

model can be described as a piecewise aﬃne function of the input. That is, the trained model

implicitly partitions the input space

into polytopes (regions deﬁned by intersections of

half spaces), and for each polytope, when the input

is in that polytope, the response of

the model is a ﬁxed aﬃne transformation of

. This property is simply a consequence of

the fact that compositions of piecewise aﬃne functions remains a piecewise aﬃne function.

There are no immediate practical applications for this property but it further hints at the

expressive power of neural network models further to the description in Section 5.2.

Also note that many other forms of scalar activation functions have been introduced and

experimented with. These include the arctan, softsign, softplus, elu, selu, and swish among

others. Figure 5.6 presents the key activation functions

Step

Sig

Tanh

ReLU

, and

LeakyReLU

(with ϵ = 0.01), along with a few of these alternatives.

Note that source of the name is from electrical engineering where a rectiﬁer is a device that converts

alternating current to direct current.

5.3 Activation Function Alternatives

Non-Scalar Activations and their Derivatives

Some layers also use non-scalar activation functions. That is,

[ℓ]

ℓ

→ R

ℓ

is a vector to

vector function that cannot be decomposed as in

(5.3)

. The most common example of this is

the softmax activation function, typically used for classiﬁcation in the last layer

ℓ

. We

now denote

since we often deal with classiﬁcation of

classes as in Section

. In

such a case,

[L]

= S

Softmax

[L]

). (5.18)

The R

→ R

softmax activation function is deﬁned as,

Softmax

(z) =

i=1



··· e



⊤

which was also deﬁned in

(??)

of Section

. Note that other examples of non-scalar

activations include max pooling layers, introduced in Section ?? of the next chapter.

As will become evident in the next section, denoting the training loss by

, the gradient

∂C/∂z

[L]

needs to be evaluated as part of the backpropagation algorithm. One approach for

this evaluation is

∂C

∂z

[L]

∂a

[L]

∂z

[L]

|{z}

×N

∂C

∂a

[L]

|{z}

×1

where we use the multivariate chain rule (see also Appendix

) and keep in mind that

[L]

is a vector,

[L]

is a vector, and

is a scalar. With this approach, when the last layer is a

softmax as

(5.18)

, one may essentially need to evaluate

∂a

[L]

/∂z

[L]

which is the transpose

of the Jacobian of

Softmax

(

). The elements of this Jacobian can be represented using the

(scalar) quotient rule for diﬀerentiation, and are,

Softmax

(z)]

∂

∂z

k=1







Softmax

(z)]

(1 − [S

Softmax

(z)]

), i = j,

−[S

Softmax

(z)]

Softmax

(z)]

, i = j.

(5.19)

However, one rarely uses

(5.19)

since in the typical case where the loss function is the cross

entropy loss (see also Section

), we have a direct expression for

∂C/∂z

[L]

. To see, this

suppose the label

equals

, an element from

{

, . . . , K}

. In this case, as was shown in

(??)

from Section ??, the loss for a speciﬁc observation is,

C = −log [S

Softmax

[L]

)]

= log

i=1

[L]

− z

[L]

Now to obtain

∂C/∂z

[L]

we compute the derivative with respect to

for every

= 1

, . . . , K

This yields,

∂C

∂z

[L]

i=1

[L]

− 1{j = k} = [S

Softmax

[L]

)]

− 1{j = k}.

Hence in this case the direct expression is,

∂C

∂z

[L]

= S

Softmax

(z) − e

, (5.20)

5 Feedforward Deep Networks

where

is the

-dimensional vector with 1 in the

-th position and 0 in other coordinates.

Note that in practice when using deep learning frameworks, it is often recommended to use

(5.20) directly in such a case.

5.4 The Backpropagation Algorithm

Now that we understand the model, we focus on gradient computation so as to facilitate

learning using variants of gradient descent as covered in Chapter

. A key algorithm is

the backpropagation algorithm which implements backward mode automatic diﬀerentiation

which is overviewed in Section

of Chapter

. Now we build the related backpropagation

algorithm of deep learning. We start with a general recursive model and then specialize to

feedforward neural networks where the parameters are weights and biases.

Backpropagation for the General Recursive Model

It is instructive to ﬁrst consider a general recursive feedforward model as appearing in

(5.1)

For such a model, the recursive step is of the form

[ℓ]

(

[ℓ−1]

). However, in this section

it is convenient to use notation that treats the function

[ℓ]

separately as a function of

[ℓ−1]

and of the parameter θ

[ℓ]

(· ; θ

[ℓ]

) : R

ℓ−1

−→ R

ℓ

, and f

[ℓ]

[ℓ−1]

; ·) : Θ

[ℓ]

−→ R

ℓ

Using this notation, the recursive step is,

[ℓ]

= f

[ℓ]

[ℓ−1]

; θ

[ℓ]

), for ℓ = 1, . . . , L, (5.21)

where

[0]

and

ˆy

[L]

. Given a single data sample (

x, y

) we assume there is a loss

function which depends on the given parameters

, on the label value

, and on the output

of the model, a

[L]

. We denote this loss via C(a

[L]

, y ; θ).

Our purpose is to optimize the loss with respect to

= (

[1]

, . . . , θ

[L]

). For this, we require

the gradient with respect to θ and we denote its components via,

[ℓ]

∂C(a

[L]

, y ; θ)

∂θ

[ℓ]

. (5.22)

A key aspect of the automatic diﬀerentiation setup is that we are presented with computable

expressions or code, for evaluation of certain derivatives. In our context these are the

derivative of the loss

with respect to

[L]

, and the derivatives of

[ℓ]

(

;

) with respect

to the input arguments (both layer input and parameters). These known derivatives are

denoted via,

C(u) :=

∂C(u, y ; θ)

∂u

[ℓ]

(u) :=

∂f

[ℓ]

(u ; θ

[ℓ]

)

∂u

[ℓ]

(u) :=

∂f

[ℓ]

[ℓ−1]

; u)

∂u

. (5.23)

Note that the shape of these expressions varies based on the domain and co-domain of

(

)

and

(

;

). The derivative

(

) is typically a gradient and thus vector valued with length

. The derivative

[ℓ]

(

) is typically an

ℓ−1

× N

ℓ

matrix obtained via a transpose of a

Jacobian and thus matrix valued. This is because the input to the layer is

ℓ−1

dimensional

and the output argument is

ℓ

dimensional. Finally, the derivative

[ℓ]

(

) may take on

5.4 The Backpropagation Algorithm

various shapes depending on the form of

. See Appendix

for a review of matrix derivatives.

On route to compute the desired gradients

(5.22)

we require intermediate gradients of the

[1]

[2]

[L−1]

[L]

x = a

[0]

[1]

[2]

[L−1]

[L]

[1]

[2]

[L−1]

[L]

[1]

[2]

[L−1]

[L]

Gradient Values

Parameter Values

Forward

Backward

Loss : C

y, ˆy

Figure 5.7: The variables and ﬂow of information in the backpropagation algorithm for the general

recursive model.

loss with respect to the activation values

[1]

, . . . , a

[L]

. Keeping in mind that

[L]

ˆy

is a

function of these activation values, the intermediate gradients

are denoted,

[ℓ]

∂C(a

[L]

, y ; θ)

∂a

[ℓ]

, ℓ = 1, . . . , L. (5.24)

Now based on the multivariate chain rule, the recursive step

(5.21)

, and the deﬁnitions

above, we observe,

[ℓ]

∂a

[ℓ+1]

∂a

[ℓ]

∂C

∂a

[ℓ+1]

[ℓ]

) ζ

[ℓ+1]

, g

[ℓ]

∂a

[ℓ]

∂θ

[ℓ]

∂C

∂a

[ℓ]

(θ

[ℓ]

) ζ

[ℓ]

. (5.25)

Note that in the application of the chain rule above, we use the notation speciﬁed in

Appendix

. Hence once the activation values

[1]

, . . . , a

[L]

are populated via forward

propagation of (5.21), backward computation can be carried out via,

[ℓ]

(

C(a

[L]

), ℓ = L,

[ℓ+1]

[ℓ]

) ζ

[ℓ+1]

, ℓ = L − 1, . . . , 1,

(5.26)

and at each step the gradient can be obtained via g

[ℓ]

(θ

[ℓ]

) ζ

[ℓ]

These are also called adjoints. See (??) in Chapter ??.

5 Feedforward Deep Networks

This process is summarized in Algorithm 5.1. See also Figure 5.7 for an illustration of the

ﬂow of information.

Algorithm 5.1: Backpropagation for the general recursive model

Input: Dataset D = {(x

(1)

, y

(1)

), . . . , (x

(n)

, y

(n)

)},

objective function C(·) = C(·; D), and

parameter values θ = (θ

[1]

, . . . , θ

[L]

)

Output: gradients of the loss g

[1]

, . . . , g

[L]

1 Compute a

[ℓ]

for ℓ = 1, . . . , L using (5.21) (Forward pass)

2 Compute ζ

[L]

C(a

[L]

)

3 Compute g

[L]

(θ

[L]

) ζ

[L]

4 for ℓ = L − 1, . . . , 1 do

5 compute ζ

[ℓ]

[ℓ+1]

[ℓ]

) ζ

[ℓ+1]

6 compute g

[ℓ]

(θ

[ℓ]

) ζ

[ℓ]

An Unrolled Example

To get a better feel for Algorithm 5.1, the operation of backpropagation, and the associated

notation, we consider a simple hypothetical example as illustrated in Figure 5.8. This is

a feedforward neural network with

= 4, arbitrary input dimension

, output dimension

= 1, and a single neuron on each hidden layer. That is

and

, N

= 1.

Assume the loss function is

C(a

[L]

, y ; θ) =

[L]

− y)

and assume a model structure,

[ℓ]

= f

[ℓ]

[ℓ−1]

; θ

[ℓ]

) = σ

[ℓ]

(θ

[ℓ]

⊤

[ℓ−1]

) for ℓ = 1, 2, 3, 4, (5.27)

similarly to the neural network structure of Section 5.1 where aﬃne transformations are

followed by activation functions, yet without bias terms. Since

, we have that

[1]

and since

, . . . , N

are all unity, we have that Θ

[2]

[3]

, and Θ

[4]

are each

and the transpose in (5.27) is not needed for ℓ = 2, 3, 4.

|{z}

∈R

[1]

∈ R

[1]

(θ

[1]

[0]

)

| {z }

[1]

[2]

∈ R

[2]

[1]

[2]

(θ

[2]

[1]

)

| {z }

[2]

[3]

∈ R

[3]

[2]

[3]

(θ

[3]

[2]

)

| {z }

[3]

[4]

∈ R

[4]

[3]

[4]

[3]

| {z }

ˆy = a

[4]

Figure 5.8: A simple hypothetical example with L = 4 and scalar hidden units.

In this hypothetical illustrative example, we use various activation functions. Namely,

[1]

(·) = σ

ReLU

(·), σ

[2]

(·) = σ

Tanh

(·), σ

[3]

(·) = σ

Sig

(·),

5.4 The Backpropagation Algorithm

and for the last step, we use the identity function, i.e.,

[4]

(

) =

. With the model speciﬁed

we now have computable expressions for known derivatives as in

(5.23)

, see also Section 5.3,

C(u) = u − y,

[4]

(u) = a

[3]

[4]

(u) = θ

[4]

[3]

(u) = a

[2]

Sig

(ua

[2]

)(1 − σ

Sig

(ua

[2]

))

| {z }

˙σ

Sig

(ua

[2]

)

[3]

(u) = θ

[3]

˙σ

Sig

(uθ

[3]

[2]

(u) = a

[1]

(1 − σ

Tanh

(ua

[1]

)

| {z }

˙σ

Tanh

(ua

[1]

)

[2]

(u) = θ

[2]

˙σ

Tanh

(uθ

[2]

[1]

(u) = a

[0]

1{u

[0]

≥ 0}

| {z }

˙σ

ReLU

[0]

)

(5.28)

Note that in this example we choose to treat

[1]

(

) as a

dimensional vector, while the

other functions in this example are scalar valued both in their domain and co-domain.

Using these computable expressions for the derivatives, Algorithm 5.1 can be used to compute

the gradient of the loss function with respect to

. For illustration we unroll the algorithm

where we use the notation u ← v to indicate assignment of v to the variable u.

Step 1: We compute forward propagation,

[1]

← σ

ReLU

⊤

[1]

[2]

← σ

Tanh

(θ

[2]

[1]

[3]

← σ

Sig

(θ

[3]

[2]

ˆy = a

[4]

← θ

[4]

[3]

Steps 2 and 3:

[4]

← a

[4]

− y, g

[4]

← a

[3]

[4]

Then the loop in steps 4–6 executes for three iterations for ℓ = 3, 2, 1:

Iteration ℓ = 3 (steps 5 and 6):

[3]

← θ

[4]

, g

[3]

← a

[2]

˙σ

Sig

(θ

[3]

[2]

)ζ

[3]

Iteration ℓ = 2 (steps 5 and 6):

[2]

← θ

[3]

˙σ

Sig

(θ

[3]

[2]

)ζ

[3]

, g

[2]

← a

[1]

˙σ

Tanh

(θ

[2]

[1]

)ζ

[2]

Iteration ℓ = 1 (steps 5 and 6):

[1]

← θ

[2]

˙σ

Tanh

(θ

[2]

[1]

)ζ

[2]

, g

[1]

← x ˙σ

ReLU

⊤

[1]

)ζ

[1]

By expanding out the resulting expressions and using the chain rule, we may verify that

each of the intermediate outputs

[4]

, g

[3]

, g

[2]

, and

[1]

yields the correct gradient expression.

5 Feedforward Deep Networks

Accounting for δ

[ℓ]

Instead of ζ

[ℓ]

Algorithm 5.1 summarizes backpropagation since it deals with gradient evaluation for the

general recursive model

(5.21)

. However, it does not make any use of the more speciﬁc

structure of

(5.2)

which has an aﬃne transformation followed by a non-linear activation

function. It turns out that in the context of deep learning models as in

(5.2)

, one can

simplify the algorithm by keeping track of an alternative set of intermediate derivative values,

[ℓ]

∈ R

ℓ

, instead of ζ

[ℓ]

. These are,

[ℓ]

∂C(a

[L]

, y ; θ)

∂z

[ℓ]

, ℓ = 1, . . . , L,

where we observe that in contrast to the

[ℓ]

values from above, these values are derivatives of

the loss with respect to

[ℓ]

values instead of with respect

[ℓ]

values. Usage of

[ℓ]

is standard

in backpropagation for deep learning and serves the basis for the main backpropagation

algorithm that we present below. With this notation, the key recursive relationship for

backward computation is,

[ℓ]

∂a

[ℓ]

∂z

[ℓ]

∂z

[ℓ+1]

∂a

[ℓ]

∂C

∂z

[ℓ+1]

∂a

[ℓ]

∂z

[ℓ]

∂z

[ℓ+1]

∂a

[ℓ]

[ℓ+1]

, ℓ = L − 1, . . . , 1, (5.29)

which in comparison to the left hand side of

(5.25)

breaks up the step

[ℓ]

⇒ a

[ℓ+1]

into two

steps: a

[ℓ]

⇒ z

[ℓ+1]

followed by z

[ℓ+1]

⇒ a

[ℓ+1]

. Further, for the ﬁnal layer, ℓ = L, we have,

[L]

∂a

[L]

∂z

[L]

∂C

∂a

[L]

. (5.30)

Backpropagation for Fully Connected Networks

We now expand and adapt Algorithm 5.1 using the key recursive relationships for

[ℓ]

(5.29)

and

(5.30)

. Consider ﬁrst the component

∂a

[ℓ]

/∂z

[ℓ]

associated with the (vector) activation

function S

[ℓ]

(·). In a typical layer ℓ, the vector activation function is composed of identical

scalar activation functions as in

(5.3)

and hence the transposed Jacobian (see also

(??)

Appendix ??) is,

∂a

[ℓ]

∂z

[ℓ]

= J

[ℓ]

)

⊤

= Diag



˙σ

[ℓ]

)



. (5.31)

Here

Diag

(

) transforms a vector into a diagonal matrix and

˙σ

[ℓ]

(

) is the derivative of

the scalar activation which is interpreted as operating element wise on the input vector.

Examples of scalar activation functions and their derivatives are in Section 5.3.

In other cases,

[ℓ]

(

) is not separable as in

(5.3)

and the transposed Jacobian of

[ℓ]

does

not have a simple diagonal form as in

(5.31)

. One such potential Jacobian is computed for

the softmax in

(5.19)

. However, the most common application of softmax is in the last layer

ℓ

together with cross entropy loss. In such a case, using

(5.20)

directly in place of

(5.30)

is more eﬃcient and practical.

Continuing now with the recursive relationship

(5.29)

, we consider the component

∂z

[ℓ+1]

/∂a

[ℓ]

Here since z

[ℓ+1]

= W

[ℓ+1]

[ℓ]

+ b

[ℓ+1]

, we have

∂z

[ℓ+1]

∂a

[ℓ]

= W

[ℓ+1]

⊤

5.4 The Backpropagation Algorithm

Hence, similarly to (5.26), putting the pieces together we have the recursive relationship,

[ℓ]

(

∂C

∂z

[L]

, ℓ = L,

Diag



˙σ

[ℓ]

)



[ℓ+1]

⊤

[ℓ+1]

, ℓ = L − 1, . . . , 1.

(5.32)

As discussed above, the case

ℓ

can be computed using the two components of

(5.30)

separately or can be computed according to

(5.20)

in case of softmax combined with cross

entropy loss.

Now with a recursive relationship supporting backpropagation of the

[ℓ]

values, we are

ready to deal with the desired derivative values of the parameters

[ℓ]

= (

[ℓ]

, b

[ℓ]

). Deﬁne,

[ℓ]

∂C

∂W

[ℓ]

, and g

[ℓ]

∂C

∂b

[ℓ]

, ℓ = 1, . . . , L.

Here

[ℓ]

is an

ℓ

×N

ℓ−1

matrix and

[ℓ]

is an

ℓ

-vector. Paralleling the right hand equation

(5.25)

(which involves

[ℓ]

for the more general model), we seek equations to retrieve these

target values (gradient components) in terms of δ

[ℓ]

. This is done via,

[ℓ]

∂C

∂z

[ℓ]

∂z

[ℓ]

∂W

[ℓ]

= δ

[ℓ]

[ℓ−1]

⊤

, and g

[ℓ]

∂z

[ℓ]

∂b

[ℓ]

∂C

∂z

[ℓ]

= δ

[ℓ]

. (5.33)

where the ﬁrst expression results from

(??)

and the second expression from

(??)

, both in

Appendix ??.

All of these relationships are packaged in the backpropagation algorithm for fully connected

networks. See also Figure 5.9 for an illustration of the ﬂow of information.

Algorithm 5.2: Backpropagation for fully connected networks

Input: Dataset D = {(x

(1)

, y

(1)

), . . . , (x

(n)

, y

(n)

)},

objective function C(·) = C(·; D), and

parameter values θ = (θ

[1]

, . . . , θ

[L]

)

Output: derivatives of the loss (g

[1]

, g

[1]

), . . . , (g

[L]

, g

[L]

)

1 Compute a

[ℓ]

and z

[ℓ]

for ℓ = 1, . . . , L (Forward pass)

2 Compute δ

[L]

∂C

∂z

[L]

3 Set g

[L]

= δ

[L]

[L−1]

⊤

and set g

[L]

= δ

[L]

4 for ℓ = L − 1 . . . , 1 do

5 Compute δ

[ℓ]

= Diag



˙σ

[ℓ]

)



[ℓ+1]

⊤

[ℓ+1]

6 Set g

[ℓ]

= δ

[ℓ]

[ℓ−1]

⊤

and set g

[ℓ]

= δ

[ℓ]

Backpropagation on a Whole Mini Batch

In Section

we discussed the concept of iterative optimization using mini-batches. With

this approach, instead of considering all of the data points, we only use

samples and use

the gradient

∇C

, summed over the samples in the mini-batch

. See also equation

(??) in Chapter ??.

5 Feedforward Deep Networks

[1]

, b

[1]

[2]

, b

[2]

[L−1]

, b

[L−1]

[L]

, b

[L]

[1]

[2]

[L−1]

[L]

[1]

[2]

[L−1]

[L]

[1]

, g

[1]

) (g

[2]

, g

[2]

) (g

[L−1]

, g

[L−1]

) (g

[L]

, g

[L]

)

x = a

[0]

[1]

[2]

[L−1]

[L]

Gradient Values

Parameter Values

Forward

Backward

Loss : C

y, ˆy

Figure 5.9: The variables and ﬂow of information in the backpropagation algorithm for fully

connected networks.

In fact, earlier in this chapter, in

(5.9)

, we introduced notation for executing the forward pass

simultaneously on a whole mini-batch. The key here is to properly organize the activation

values and the data values in matrices. This notation can be further extended to adapt the

backpropagation algorithm for computation of the mini-batch gradient. Hence once may

also specify a form of Algorithm 5.2, suitable for mini-batches.

While speciﬁc implementation details are not our focus, we mention that in most practical

deep learning frameworks, the common interface for gradient evaluation using backpropaga-

tion is based on input tensors, where each tensor is associated with a mini-batch. Then one

of the indices of the tensor indexes speciﬁc data samples within the mini-batch. For example

common formats for image data are based on 4 dimensional tensors, with a format referred

to as NHWC or NCHW. Here ‘N’ represents the index within the mini-batch, while ‘H’, ‘W’,

and ‘C’ are for height, width, and channels as appropriate for color image data. Note that

the concept of channels is taken from the context of convolutional neural networks which

are the topic of study in Chapter ??.

Vanishing and Exploding Gradients

The key backpropagation recursions are the forward step

[ℓ+1]

[ℓ]



[ℓ]

[ℓ−1]

[ℓ]



(5.4)

and the backward step

[ℓ]

Diag



˙σ

[ℓ]

(

[ℓ]

)



[ℓ+1]

⊤

[ℓ+1]

as in

(5.32)

. From a

practical perspective, these steps are sometimes subject to instability when the number of

layers L is large.

To see this, let us ﬁrst simplify the situation by ignoring the activation functions and

assuming that

[ℓ]

is with a ﬁxed square dimension and the same weight matrix

, for

ℓ

= 1

, . . . , L −

1, and the last weight matrix is

[L]

. Further, ignore the bias terms

[ℓ]

. In

5.4 The Backpropagation Algorithm

this simpliﬁed case,

ˆy = a

[L]

= W

[L]

[L−1]

[L−2]

··· W

[3]

[2]

[1]

x = W

[L]

L−1

x, (5.34)

where W

L−1

is the L − 1 power of W .

As is well known in linear algebra and systems theory, unless the maximal eigenvalues of

are exactly with a magnitude of unity, as

grows we have that

ˆy

either vanishes (towards

0) or explodes (with values of increasing magnitude). As a further simpliﬁed illustration of

this, if

(a constant multiple of the identity matrix), then

ˆy

[L]

L−1

, and for

any

w 

= 1, a vanshing

ˆy

or exploding

ˆy

phenomena persists. This illustration shows that for

non-small network depths (large L), instability issues may arise in the forward pass.

The same type of instability problem can then also persist in the backward pass since

the backward recursion

[ℓ]

Diag



˙σ

[ℓ]

(

[ℓ]

)



[ℓ+1]

⊤

[ℓ+1]

also includes repeated matrix

multiplications, and if for simplicity we ignore the activation functions and again take a

constant matrix W , then,

[ℓ]



⊤



L−ℓ

[L]

. (5.35)

Hence there is often a vanishing or exploding nature of

[ℓ]

for large

and low values of

ℓ

(the ﬁrst layers of the network). Now with

(5.33)

in mind, we see that in general, the gradient

values

[ℓ]

and

[ℓ]

may get smaller and smaller (vanishing) or larger and larger (exploding)

as we go backward with every layer during backpropagation. These are respectively called

the vanishing gradient or exploding gradient problems.

In the worst case, vanishing gradients, may completely stop the neural network from training,

or exploding gradients may throw parameter values towards arbitrary directions. This may

result in oscillations around the minima or even overshooting the optimum again and again.

Another impact of exploding gradients is that huge values of the gradients may cause number

overﬂow resulting in incorrect computations or introductions of

NaN

ﬂoating point values

(“not a number”).

Gradient descent improvements such as RMSProp, integrated in ADAM (see Section

can help normalize such variation in the gradients. Nevertheless, numerical instability can

still persist. Further, with activation functions such as sigmoid or tanh, in cases of inputs far

from 0 the gradient components of

Diag



˙σ

[ℓ]

(

[ℓ]

)



may also vanish. Activation functions such

as ReLU or leaky ReLU handle such problems, yet the overarching phenomenon associated

with repeated matrix multiplication as exempliﬁed in

(5.34)

and

(5.35)

still persists. A key

strategy for mitigating such a problem is based on weight initialization as we discuss in the

section below.

To handle exploding gradients, sometimes gradient clipping is employed. This approach

adjusts the gradient so that it, or its individual components, are not too big in magnitude.

One approach is to clip each coordinate of the gradient so that it does not exceed a pre-

speciﬁed value in absolute value. This approach can obviously change the direction of the

gradient since some coordinates may be clipped while others not. Another approach is to

scale the whole gradient by its norm and to multiply by a ﬁxed factor. This maintains the

direction of the gradient as originally computed and ensures its magnitude is at a ﬁxed

threshold.

5 Feedforward Deep Networks

5.5 Weight Initialization

While gradient based optimization generally works in advancing towards local minima,

as highlighted above, vanishing gradients or exploding gradients may signiﬁcantly hinder

progress. In fact, starting with initial values that are either constant or 0 for the weights

and bias parameters may throw the learning process oﬀ. Such constant initial parameters

may impose symmetry on the activation values of the hidden units and in turn prohibit the

model from exploiting its expressive power.

Random initialization enables us to break any potential symmetries and is almost always

preferable. Below we outline speciﬁc principles of random parameter intilization, yet even if

these are not applied, the most basic random intilization approach is to set all parameters

of the weight matrices

[1]

, . . . , W

[L]

as independent and identically distributed standard

normal random variables and to set all the entries of the bias vectors b

[1]

, . . . , b

[L]

at 0.

In view of the potential vanishing gradient and exploding gradient problems highlighted

above, there is room for smarter weight intialization techniques. Speciﬁcally, a general

principle that is followed focuses on the activation values

[1]

, . . . , a

[L]

associated with the

initial weight and bias parameters. If we momentarily consider these activation values

as random entries, an overarching goal is that the distribution of each entry of

[ℓ+1]

approximately similar to that of each entry of

[ℓ]

. In such a case, recursing the forward

pass

(5.4)

, a form of distributional stability can persist when the number of layers

is large.

With such an approach, if we are to choose the distribution of the initial weight matrix

parameters judiciously, vanishing and exploding gradients may be mitigated at the onset of

learning.

More concretely, the goal of equating the distribution of

[ℓ+1]

entries with

[ℓ]

entries is

viewed via the ﬁrst two moments of the distribution (mean and variance). Speciﬁcally, with

activation function σ(·) we have in the ℓ-th layer,

[ℓ]

= σ



[ℓ]



, where z

[ℓ]

ℓ−1

j=1

[ℓ]

i,j

[ℓ−1]

+ b

[ℓ]

. (5.36)

To develop an initialization approach, we assume that all

[ℓ−1]

values for

= 1

, . . . , N

ℓ−1

are identically distributed with mean 0, some variance

˘v

, and are statistically independent.

Further, we wish to initialize parameters randomly such that

[ℓ]

is also with a mean of 0

and the same variance

˘v

. Our strategy for this is to initialize the bias

[ℓ]

with entries 0 and

the weight parameters

[ℓ]

i,j

as normally distributed random variables with a mean of 0 and

some speciﬁed variance.

It turns out, that when the activation function,

(

), is tanh, a sensible variance to use

for

[ℓ]

i,j

is 1

ℓ−1

. This form of initialization is called Xavier initialization. Further, when

the activation function is ReLU, a sensible variance to use is 2

ℓ−1

and this form of

initialization is called He initialization. Another commonly used heuristic is to use variance

(

ℓ−1

ℓ

) which is the harmonic mean of

ℓ−1

and

ℓ

. In any case, all of these are

heuristics, often implemented in practical deep learning frameworks.

5.5 Weight Initialization

Derivation of Xavier initialization

To get a feel for the considerations of the 1

ℓ−1

variance rule of Xavier initialization we

now present the derivation of this rule. The approach is to equate activation variances at

some

˘v

, based on

(5.36)

. To do so, we use an approximation of

Tanh

(

) as

(

) =

. This is

a ﬁrst-order Taylor approximation of the tanh activation function. With this approximation

and with 0 bias entries, (5.36) yields,

Var



[ℓ]



= Var



[ℓ]



= Var





ℓ−1

j=1

[ℓ]

i,j

[ℓ−1]





ℓ−1

j=1

Var



[ℓ]

i,j

[ℓ−1]



. (5.37)

This is based on the assumption that all random variables, both activation values and weight

parameters, are statistically independent. Now to evaluate the variance summands on the

right hand side, we use the following property of the variance of a product of two independent

random variables X and Y ,

Var (XY ) = E[X]

Var (Y ) + Var(X)E[Y ]

+ Var (X)Var (Y ).

With this we obtain,

Var



[ℓ]

i,j

[ℓ−1]



= E

[ℓ]

i,j

Var



[ℓ−1]



+Var



[ℓ]

i,j



[ℓ−1]

+Var



[ℓ]

i,j



Var



[ℓ−1]



Now since we seek activation values with a mean of 0 and a variance of

˘v

, it turns out that

Var



[ℓ]

i,j

[ℓ−1]



Var

(

[ℓ]

i,j

)

˘v

. Now also assuming this variance

˘v

for activations of layer

ℓ

and combining in (5.37) we have,

˘v =

ℓ−1

j=1

Var (w

[ℓ]

i,j

) ˘v.

By further requiring that all weight initialization variance entries for layer

ℓ

are constant,

say at

Var

(

[ℓ]

i,j

) =

˘w

ℓ

, we have

˘v

ℓ−1

˘w

ℓ

˘v

. Then, this shows that setting

˘w

ℓ

= 1

ℓ−1

achieves the desired result.

Further Insight Regarding Vanishing or Exploding Values

The derivation of the Xavier initialization oﬀers further insight about the nature of vanishing

or exploding values. Assume that the input features vector

is also random with each

entry having the same variance

˘v

. Then by recursing the forward pass

(5.4)

, continuing to

approximate the activation function as identity, and using the variance calculations above

we have,

Var



[L]



ℓ=1

ℓ−1

˘w

ℓ

˘v

This further highlights that setting

˘w

ℓ

= 1

ℓ−1

yields stability of the variance of the outputs

which is especially important when the number of layers

is large. Other choices where

ℓ−1

˘w

ℓ

1 across all layers

ℓ

may yield vanishing activations, and similarly if

ℓ−1

˘w

ℓ

we may observe exploding activations.

5 Feedforward Deep Networks

5.6 Batch Normalization

In Section

we discussed standardization of data. As seen in equations

(??)

and

(??)

, this

is simply a transformation for each feature of the input data based on subtraction of the

mean and division by the standard deviation. There are multiple beneﬁts in carrying out

standardization, one of which is a reshaping of the loss landscape. As an illustration, in

Section

we explored the eﬀect that such standardization has on simple problems. It turns

out that with much more complicated deep neural networks, standardization also called

normalization, may also be very beneﬁcial. In this section we present an extended idea called

batch normalization where the outputs of intermediate hidden layers are also normalized in

an adaptive manner.

The overarching idea of batch normalization is to normalize (or standardize) not just the

input data but also individual neuron values within the intermediate hidden layers or ﬁnal

layer of the network. In a nutshell returning to the display

(5.2)

and taking

as an index

of a neuron in layer

ℓ

, we may wish to have either

[ℓ]

exhibit near-normalized

values over the input dataset. Such normalization of the neuron values then yields more

consistent training and mitigates vanishing or exploding gradient problems. It also has a

slight regularization eﬀect which may prevent overﬁtting.

Our exposition here outlines normalization of the z

[ℓ]

values, yet the reader should keep in

mind that one may choose to do so for the

[ℓ]

values instead. When applying the batch

normalization technique that we describe here, the output of the training process involves

further parameters, some of which are trained via optimization (gradient descent), and

others are based on running averages in the training process. We now present the details.

The Idea of Per Unit Normalization

The main idea of batch normalization is to consider neuron

in layer

ℓ

and instead of using

[ℓ]

as in

(5.2)

, to use a transformed version

˜z

[ℓ]

. Such a transformation takes place both in

training time and when using the model in production, with subtle diﬀerences between the

two cases as we describe below. The transformation aims to position the

˜z

[ℓ]

values so that

they have approximately zero mean and unit standard deviation over the data. Further, the

transformation involves a correction using trainable parameters.

The transformation requires estimates of the unit’s mean and standard deviation so that

the unit value can be normalized. During training time, at a given training epoch and for a

given mini-batch of size n

, such estimates are obtained via,

ˆµ

[ℓ]

i=1

[ℓ](i)

and ˆσ

[ℓ]

i=1

[ℓ](i)

− ˆµ

[ℓ]

)

, (5.38)

where

[ℓ](i)

is the value at unit

, at layer

ℓ

, and sample

within the mini-batch, prior to

carrying out normalization.

5.6 Batch Normalization

With ˆµ

[ℓ]

and ˆσ

[ℓ]

available, we compute

¯z

[ℓ](i)

− ˆµ

[ℓ]

(ˆσ

[ℓ]

)

+ ε

, (5.39)

where

ε >

0 is a small ﬁxed quantity that ensures that we do not divide by zero. At this

point

¯z

[ℓ](i)

has nearly zero mean and nearly unit standard deviation for all data samples

in the mini-batch (it is nearly and not exactly only due to ε).

Now ﬁnally, an additional transformation takes place in the form,

˜z

[ℓ](i)

= γ

[ℓ]

¯z

[ℓ](i)

+ β

[ℓ]

, (5.40)

where

[ℓ]

and

[ℓ]

are trainable parameters. A consequence of

(5.39)

and

(5.40)

is that

˜z

[ℓ](i)

has a standard deviation of approximately

[ℓ]

and a mean of approximately

[ℓ]

over

the data samples

in the mini-batch. These parameters are respectively initialized at 1

and 0, and then as training progresses,

[ℓ]

and

[ℓ]

are updated using the same learning

mechanisms applied to the weights and biases of the network. Namely they are updated

using backpropagation, and gradient based learning. Speciﬁc backpropagation details are

presented at the end of this section.

As presented above, each unit or neuron has their own set of trainable parameters,

[ℓ]

and

[ℓ]

. However, in practice, multiple neurons in the same layer often share the same batch

normalization parameters. This implies adjusting the mean and standard deviation estimates

(5.38)

to have summations over multiple neurons

. For example in convolutional neural

networks as presented in the next chapter, all neurons of the same channel typically have

the same batch normalization applied to them.

Batch Normalization in Production

When using the model in production we need to be able to apply the model to a single

input data sample in which case evaluation of

ˆµ

[ℓ]

and

ˆσ

[ℓ]

as in

(5.38)

is impossible. Instead,

averages from the training set are collected during train time and these are supplied and

deployed with the model. Practically, since parameters are updated during a training run

and this updating in turn aﬀects the

[ℓ]

values, averages are often collected in parallel to

training. A common technique is to use exponential smoothing as in

(??)

in Chapter

and apply it to the sequence of computed values

ˆµ

[ℓ]

and

ˆσ

[ℓ]

between mini-batches during

training. The result of this exponential smoothing, denoted here as

¯µ

[ℓ]

and

¯σ

[ℓ]

, is then

deployed together with the model.

As a summary, in a deployed model, each unit (

in layer

ℓ

), or set of units, to which batch

normalization is applied is deployed with the trained

[ℓ]

and

[ℓ]

values as well as with the

exponentially smoothed estimates

¯µ

[ℓ]

and

¯σ

[ℓ]

. Then in production,

˜z

[ℓ]

= γ

[ℓ]

−

¯µ

[ℓ]



¯σ

[ℓ]



+ ε

+ β

[ℓ]

, (5.41)

5 Feedforward Deep Networks

is used in place of z

[ℓ]

Interestingly, returning to

(5.2)

and observing that the vector

[ℓ]

[ℓ−1]

[ℓ]

is an

aﬃne function of the previous activation vector, we see that when combined with

(5.41)

are left with an aﬃne transformation

˜z

[ℓ]

[ℓ−1]

[ℓ]

with a modiﬁed weight matrix

[ℓ]

and bias vector

[ℓ]

. Hence at least in principle, deployment of batch normalization of this

sort can be done without the production model needing to be aware of batch normalization

at all since we can encode it in the weight matrices and bias vectors.

Backpropagation of Batch Normalization Parameters

We close this section with an exposition of how the batch normalization parameters

[ℓ]

and

[ℓ]

can be updated as part of the backpropagation algorithm. To simplify the notation

we omit the superscript [ℓ] and the subscript j as the batch learning parameters are either

neuron-speciﬁc or are shared by a group of neurons. Our focus is thus on one pair

and

for which we require

∂C/∂γ

and

∂C/∂β

respectively. We also require computing the

intermediate derivative value ∂C/∂z

(i)

for backpropagation to operate.

(1)

¯z

(1)

ˆz

(1)

ˆµ

ˆσ

C(·)

)

¯z

)

˜z

)

Figure 5.10: A schematic of the computational graph for batch normalization at an arbitrary layer.

The goal is to compute the gradients of the loss with respect to γ, β and each z

(i)

At each iteration, the network estimates the mean

ˆµ

and the standard deviation

ˆσ

corre-

sponding to the current batch. In the forward pass, given inputs (output of the previous

layer)

(i)

for

= 1

, . . . , n

, we calculate the outputs of the batch normalization procedure

˜z

(i)

. Then, the backpropagation pass will propagate the gradient of the loss function

(

)

through this transformation and compute the gradients with respect to the parameters

and β. We then also compute the intermediate derivatives ∂C/∂z

(i)

A schematic of the computational graph associated to the batch normalization steps is

presented in Figure 5.10. Backpropagation of the higher layers yields the gradient

∂C

∂ ˜z

(i)

With this we want to compute

∂C

∂γ

∂C

∂β

, and

∂C

∂z

(i)

This consolidation of batch normalization into the weight matrix and bias vector is often not done in

practice, especially due to the fact that W is often a convolution matrix as described in the next chapter.

5.7 Mitigating Overﬁtting with Dropout and Regularization

The gradients

∂C/∂γ

and

∂C/∂β

are simple. By applying the chain rule with

(5.40)

, we get

∂C

∂γ

i=1

∂C

∂˜z

(i)

¯z

(i)

, and

∂C

∂β

i=1

∂C

∂˜z

(i)

. (5.42)

In order to compute

∂C

∂z

(i)

, we need also to evaluate

∂C

∂ ˆµ

∂C

∂ ˆσ

, and

∂C

∂ ¯z

(i)

since,

∂C

∂z

(i)

∂C

∂¯z

(i)

∂¯z

(i)

∂z

(i)

∂C

∂ˆσ

∂z

(i)

∂C

∂ ˆµ

∂z

(i)

. (5.43)

With the multivariate chain rule and (5.39) and (5.38) we have,

∂C

∂¯z

(i)

∂C

∂˜z

(i)

γ,

∂C

∂ˆσ

i=1

∂C

∂¯z

(i)

∂¯z

(i)

∂ˆσ

i=1

∂C

∂¯z

(i)



(i)

− ˆµ



−





ˆσ

+ ε



−3/2

∂C

∂ ˆµ

i=1

∂C

∂¯z

(i)

∂¯z

(i)

∂ ˆµ

∂C

∂ˆσ

∂ ˆµ

i=1

∂C

∂¯z

(i)

−1

√

ˆσ

+ ε

∂C

∂ˆσ

i=1

−2



(i)

− ˆµ



These formulas present us with an explicit expression by transforming (5.43) to

∂C

∂z

(i)

∂C

∂¯z

(i)

√

ˆσ

+ ε

∂C

∂ˆσ

2(z

(i)

− ˆµ)

∂C

∂ ˆµ

. (5.44)

This representation of

(5.44)

can then be integrated with backpropagation through the

neural network.

5.7 Mitigating Overﬁtting with Dropout and Regularization

In Section

we discussed the challenges and tradeoﬀs associated with overﬁtting and the

need for generalization. On the one hand we seek a model that will make use of the available

data and properly capture the dependence on the input features, while on the other hand we

wish to avoid overﬁtting. As presented in Section

, one general approach for mitigating

overﬁtting is called regularization, where as in

(??)

we may augment the loss function

(

)

with a regularization term

(

), and optimize

min

(

;

) +

(

), in place of just

minimization of the loss function. Such practice using additive regularization is possible with

deep neural networks as well. However, there is also an alternative popular approach called

dropout which is speciﬁc to deep neural networks. We ﬁrst describe the dropout approach

and then highlight relationships between dropout, ensemble methods, as well as addition of

regularization terms.

Dropout

The idea of dropout is to randomly zero out certain neurons during the training process.

This allows training to focus on multiple random subsets of the parameters and yields a

form of regularization.

With dropout, at any backpropagation iteration (forward pass and backward pass) on a

mini-batch, only some random subset of the neurons is active. Practically neurons in layer

ℓ

5 Feedforward Deep Networks

dropout

Figure 5.11: An illustration of dropout during training for a network with

= 3 layers,

= 5

input features, and

= 1 output. In each iteration during training, the network is transformed to

the network on the right where the dropped out units are randomly selected.

for

ℓ

= 0

, . . . , L −

1, have a speciﬁed probability

[ℓ]

keep

∈

1] where if

[ℓ]

keep

= 1 dropout

does not aﬀect the layer, and otherwise each neuron

of the layer is “dropped out” with

probability 1

−p

[ℓ]

keep

. This is simply a zeroing out of the neuron activation

[ℓ]

as we illustrate

in Figure 5.11.

In the forward pass, when we get to the neurons of layer

ℓ

+ 1, all the neurons in layer

ℓ

that were zeroed out do not participate in the computation. Speciﬁcally, the update for a

neuron j in the next layer, assuming a scalar activation function σ(·), becomes,

[ℓ+1]

= σ



[ℓ+1]

i kept

[ℓ+1]

i,j

[ℓ]



. (5.45)

[ℓ]

keep

= 1 all neurons are kept and the summation is over

= 1

, . . . , N

ℓ

as in

(5.7)

, but

otherwise the summation is over the random subset of kept neurons. The Bernoulli coin ﬂips

determining which neurons are kept and which are not are all carried out independently

within the layer, across layers, and throughout the training iterations.

In the backward pass, the eﬀect of dropping out a neuron is evident via

(5.33)

. When neuron

is dropped out in layer

ℓ

, the weights

[ℓ+1]

i,j

for all neurons

= 1

, . . . , N

ℓ+1

are updated

based on the gradient [

[ℓ]

]

i,j

which is set at 0. With a pure gradient descent optimizer this

means that weights

[ℓ+1]

i,j

are not updated at all during the given iteration, whereas with a

momentum based optimizer such as ADAM it means that the descent step for those weights

has a smaller magnitude; see Section ?? for a review of such optimizers.

5.7 Mitigating Overﬁtting with Dropout and Regularization

At the end of training, even with dropout implemented, the trained model still has a complete

set of weight matrices without any zeroed out elements, similar to the case in which we do

not have dropout. Hence to account for the fact that neuron

in layer

ℓ

only took part in a

proportion of iterations

[ℓ]

keep

during the training, when using the model in production (test

time), we would like to use a weight matrix

[ℓ+1]

[ℓ]

keep

[ℓ+1]

in place of

[ℓ+1]

. The

rationale here is to have the production forward pass similar to the training pass. Namely

such a production forward pass has,

[ℓ+1]

= σ



[ℓ+1]

ℓ

i=1

[ℓ]

keep

[ℓ+1]

i,j

[ℓ]



, (5.46)

and this serves as an approximation to

(5.45)

. To see the basis of this approximation, treat

the summands in

(5.45)

[ℓ+1]

i,j

[ℓ]

, as identically distributed random variables, say with

expected value

. Now observe that the expected value of both the summation (over a

random number of elements) in

(5.45)

and the expected value of the summation in

(5.46)

are both N

ℓ

[ℓ]

keep

In practice, the training–production pair

(5.45)

–

(5.46)

is not typically used per-se. The more

practical alternative is instead of remembering

[ℓ]

keep

and deploying it with the production

model, the training forward pass is modiﬁed to have the reciprocal of

[ℓ]

keep

as a scaling

factor of the weights. Namely, the training forward pass is,

[ℓ+1]

= σ



[ℓ+1]

i kept

[ℓ]

keep

[ℓ+1]

i,j

[ℓ]



. (5.47)

This form allows to use the resulting model normally in production without having to take

dropout into consideration at all. Namely, the production forward pass is,

[ℓ+1]

= σ



[ℓ+1]

ℓ

i=1

[ℓ+1]

i,j

[ℓ]



, (5.48)

With this setup, the expected value of the summations in

(5.47)

and

(5.48)

agree and further

the forward step (5.48) agrees with the standard model as in (5.2), (5.4), and (5.7).

In practice, this simple and easy to implement idea of dropout has improved performance of

deep neural networks in many empirically tested cases. It is now an integral part of deep

learning training. We now explore the idea a bit further though the viewpoint of ensemble

methods.

Viewing Dropout as an Ensemble Approximation

Dropout can be viewed as an approximation of an ensemble method, a general concept from

machine learning. Let us ﬁrst present an overview of ensemble methods or ensemble learning

and then argue why dropout serves as an approximation.

In general when we seek a model

ˆy

(

), we may use the same dataset to train multiple

models that all try to achieve the same task. We may then combine the models into an

ensemble (model). The latter is usually more accurate than each of the individual models.

5 Feedforward Deep Networks

Let us illustrate this in the case of a scalar output model. We can choose to use

models

and denote each model via

ˆy

{i}

(

) for

= 1

, . . . , M

, where

{i}

is taken here as the

set of parameters of the

-th model. The ensemble model on an input

is then the average,

(x) =

i=1

{i}

(x), where θ = (θ

{1}

, . . . , θ

{M}

). (5.49)

Clearly

(

) is more computationally costly since it requires

models instead of a single

model. This implies storing

parameter sets, training

times instead of once, and

evaluating

models in

(5.49)

instead of once during production. Nevertheless, there are

beneﬁts.

To illustrate the strength of this technique assume for simplicity that the models are

homogenous in nature and only diﬀer due to randomness in the training process and not the

model choice or hyper-parameters. In this case, for some ﬁxed unseen input

˜x

we may treat

the output of model

, denoted

ˆy

{i}

(

˜x

), as a random variable that is identical in distribution

to every other model output

ˆy

{j}

(

˜x

), yet generally not independent. We further assume

that any pair of model outputs is identically distributed to any other pair. In this case we

may denote,



ˆy

{i}

(˜x)



= µ, Var



ˆy

{i}

(˜x)



= σ

, and cor



ˆy

{i}

(˜x), ˆy

{j}

(˜x)



= ρ,

respectively, where

cor

(

·, ·

) is the correlation between two models

i 

and is assumed to

be the same for all

pairs. Such an assumption on the correlation also imposes

a lower

bound on the correlation,

−

M − 1

≤ ρ, or 0 ≤ ρ +

1 − ρ

. (5.50)

We can now evaluate the mean and variance of the ensemble model (5.49). Namely,

E[f

(˜x)] =



i=1

{i}

(˜x)



= µ, (5.51)

and further noting that ρσ

is the covariance between any two models we obtain,

Var



(˜x)



Var



i=1

{i}

(˜x)





Mσ

+ M(M − 1)ρσ





ρ +

1 − ρ



(5.52)

With

(5.50)

it is conﬁrmed that as required this variance expression in

(5.52)

is non-negative.

Further, we see that as the number of models in the ensemble,

, grows, the variance of the

ensemble model converges to

ρσ

. Since in addition to

(5.50)

ρ ≤

1 and practically

ρ <

this limiting variance is less than

. For example if

= 0

5 as

grows the variance of the

estimator drops by 50%.

Putting the computational costs aside, these properties of ensemble models make them very

attractive because the bias does not change as shown in (5.51), but the variance decreases;

This may be shown based on the constraint that the covariance matrix is a positive semi-deﬁnite matrix.

The variance of a sum of random variables is the sum of the elements of the covariance matrix.

5.7 Mitigating Overﬁtting with Dropout and Regularization

recall also the general discussion of the bias variance tradeoﬀ in Section

. Nevertheless,

deep learning models are not easily amenable for ensemble models in the form of

(5.49)

because the number of parameters and computational cost (both for training and production)

is too high. Training a single model may sometimes take days and the computational costs

of a single evaluation f

{i}

(˜x) are also non-negligible. This is where dropout comes in.

We may loosely view dropout as an ensemble of

models where

is the number of training

iterations. In a practical training scenario,

can be on the order of hundreds, thousands,

or tens of thousands, depending on the number of epochs during training and the size of

the mini-batch in comparison to the number of training samples. For example, if

= 10

training samples and the mini-batch size is

= 100, then each epoch has 1000 iterations

and if we execute, say 1000 epochs of training then

= 10

. Then, since the production

model involves weights accumulated throughout all 10

iterations, we may loosely view the

production model as an ensemble model and we may expect its variance to be reduced from

to approximately

ρσ

. Now clearly each of the

iterations did not execute training of

the model fully but rather only trained for a single iteration. Hence this ensemble view of

dropout is merely a heuristic description.

Addition of Regularization Terms and Weight Decay

In addition to dropout, as already introduced in Section

, addition of a regularization

term is another key approach to prevent overﬁtting and improve generalization performance.

Augmenting the loss with a regularization term R

(θ) restricts the ﬂexibility of the model,

and this restriction is sometimes needed to prevent overﬁtting. In the context of deep learning,

and especially when ridge regression style regularization is applied, this practice is sometimes

called weight decay when considering gradient based optimization. We now elaborate.

Take the original loss function

(

) and augment it to be

(

) =

(

) +

(

). In our

discussion here, let us focus on the ridge regression type regularization with parameter

λ >

and,

(θ) =

R(θ), with R(θ) = ∥θ∥

= θ

+ . . . + θ

Here for notational simplicity, we consider all the

parameters of the model as scalars,

for

= 1

, . . . , d

(not to be confused with the notation used above for the parameters of a model

as part of an ensemble). Nevertheless, note that typically regularization is only applied to

the weight matrix parameters and not to the bias vectors. Further, we may even restrict

regularization to certain layers and not others.

Now assume we execute basic gradient descent steps as in

(??)

. With a learning rate

α >

the update at iteration t is,

(t+1)

= θ

(t)

− α∇

C(θ

(t)

In our ridge regression style penalty case we have

∇

(

) =

∇C

(

) +

λθ

, and hence the

gradient descent update can be represented as

(t+1)

= (1 − αλ)θ

(t)

− α∇C(θ

(t)

). (5.53)

Now the interesting aspect of

(5.53)

, assuming that

αλ <

2, is that it involves shrinkage

or weight decay directly on the parameters in addition to gradient based learning. That is,

independently of the value of the gradient

∇C

(

(t)

), in every iteration,

(5.53)

continues to

decay the parameters, each time multiplying the previous parameter by a factor 1 − αλ.

5 Feedforward Deep Networks

This weight decay phenomena can then be extended algorithmically to enforce regularization

not directly via addition of a regularization term, but rather simply by augmenting the

gradient descent updates to include weight decay. For example we may consider popular

gradient based algorithms such as ADAM in Section

and the other algorithms in Section

and in each case add an additional step which incorporates multiplying the weights by a

constant less than but close to unity.

5.7 Mitigating Overﬁtting with Dropout and Regularization

Notes and References

The origins of deep learning date back to the same era during which the digital computer materialised.

In fact, early ideas of artiﬁcial neural networks were introduced ﬁrst in 1943 with [

]. Then, in the

post WWII era, Frank Rosenblatt’s perceptron was the ﬁrst working implementation of a neural

network model, [

]. The perceptron and follow up work drew excitement in the 1960’s yet with the

1969 paper [

], there was formal analysis that shone negative light on limited capabilities of single

layer neural networks and this eventually resulted in a decline of interest, which is sometimes termed

the “AI winter” of 1974–1980. Before this period there were even implementations of deep learning

architectures with [

], and with [

] where 8 layers were implemented. In 1967 such multi-layer

perceptrons were even trained using stochastic gradient descent, [2].

Some attribute the end of the 1974–1980 AI winter to a few developments that drew attention and

resulted in impressive results. Some include Hopﬁeld networks which are recurrent in nature (see

Chapter

), and also formalism and implementation of the backpropagation algorithm in 1981,

[

]. In fact, early ideas of backpropagation can be attributed to a PhD thesis in 1970, [

] by S.

Linnainmaa. See also our notes and references on early developments of reverse mode automatic

diﬀerentiation at the end of Chapter

. In parallel to this revival of interest in artiﬁcial intelligence

in the early 1980’s there were many developments that led to today’s contemporary convolutional

neural networks. See the notes and references at the end of Chapter

. Historical accounts of deep

learning can be found in [

] and [

] as well as a website by A. Kurenkov.

The book [

] was a key

reference of neural networks, summarizing developments up to the turn of the twenty ﬁrst century.

The 2015 Nature paper by Yann LeCun, Yoshua Bengio, and Geoﬀrey Hinton, [

] captures more

contemporary developments.

Positive results about the universal approximation ability of neural networks, such as Theorem 5.1

presented in this chapter, appear in [

] for a class of sigmoid activations and in [

] for a larger

class of non-polynomial functions. With such results, it became evident that with a single hidden

layer, neural networks are very expressive. Still, the practical insight to add more hidden layers to

increase expressivity arose with the work of Geoﬀrey Hinton et al. in 2006 [

], Yoshua Bengio et

al. 2006 [

] and other inﬂuential including works such as [

], [

] and [

]. The big explosion was in

2012 with AlexNet in [29].

Our example of a multiplication gate as in Figure 5.4 comes from [34]. Our motivation to use this

example and the analysis we present around Figure 5.5 dealing with the expressivity of deeper

networks is motivated by a 2017 talk of Niklas Wahlström.

Beyond our elementary presentation,

many researchers have tried to provide theoretical justiﬁcations to why deep neural networks are so

powerful. Some justiﬁcations of the power of deep learning are in [

] using information theory, and

[

], [

],[

] , and [

], using other mathematical reasoning approaches. See also [

] and [

] for

surveys of theoretical analysis of neural networks.

In terms of activation functions, the sigmoid function was the most popular function in early neural

architectures with the tanh function serving as an alternative. The popularity of ReLU grew with

[

], especially after its useful application in the AlexNet convolutional neural network [

]. Note

that ReLU was ﬁrst used in 1969, see [

]. Other activation functions such as leaky ReLU were

introduced in [

], as well as parameterized activation functions such as PReLU studied in [

] in

order to mitigate vanishing and exploding gradient issues. A general survey of activation functions

is in [11].

The backpropagation algorithm can be attributed to [

], yet has earlier origins with general back-

ward mode automatic diﬀerentiation surveyed in [

] (see also notes of Chapter

). Our presentation

in Algorithm 5.2 is speciﬁc to the precise form of feedforward networks that we considered, yet

variants can be implemented. Importantly, with the advent of automatic diﬀerentiation frameworks

such as TensorFlow [

] followed by PyTorch [

], Flux.jl [

], JAX [

], and others, the use of back-

propagation as part of deep learning has become standard. Such software frameworks automatically

implement backpropagation as a special case of backward mode automatic diﬀerentiation where the

computational graph is constructed, often automatically based on code. Early deep learning work

up to about 2014, did not have such software frameworks available and hence “hand coding” of

backpropagation was more delicate and time consuming. It is fair to say that with the proliferation

of deep learning software frameworks, innovations in research and industry accelerated greatly. We

https://www.skynettoday.com/overviews/neural-net-history.

See https://www.it.uu.se/katalog/nikwa778/talks/DL_EM2017_online.pdf.

5 Feedforward Deep Networks

also note that properly considering matrix analysis aspects of backpropagation often requires matrix

calculus for which useful references are [16] and [45].

Extensive discussion of vanishing and exploding gradients is in [

]. Related aspects of training

including weight initialization are discussed in chapter 7 of [

]. The Xavier initialization technique

was introduced in [

]. Related initialization techniques and their analysis are studied in [

]. The

idea of batch normalization was initially introduced in [

] and analyzed in [

] and [

]. Since then,

batch normalization has been extended in several ways including instance normalization in [

]. A

survey is in [

]. Our analysis of backpropagation of batch-normalization parameters is from [

Dropout was initially introduced in [

] and [

]. Analysis of dropout as an ensemble approximation

is in [

]. See also [

] for an overview of ensemble methods. Further study of dropout and its

implications is in [

]. Regularization in feedforward networks was reviewed in [

] and analysis of

weight decay is in [30] and [37].

Bibliography

[1]

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,

A. Davis, J. Dean, M. Devin, et al. TensorFlow: Large-scale machine learning on

heterogeneous distributed systems. arXiv:1603.04467, 2016.

[2]

S. Amari. A theory of adaptive pattern classiﬁers. IEEE Transactions on Electronic

Computers, 1967.

[3]

S. Arora, Z. Li, and K. Lyu. Theoretical analysis of auto rate-tuning by batch normal-

ization. arXiv:1812.03981, 2018.

[4]

A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic dif-

ferentiation in machine learning: A survey. Journal of Machine Learning Research,

2018.

[5]

Y. Bengio. Learning deep architectures for ai. Foundations and Trends® in Machine

Learning, 2009.

[6]

Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of

deep networks. Advances in Neural Information Processing Systems, 2006.

[7]

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula,

A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang. JAX: Composable

Transformations of Python+NumPy Programs. http://github.com/google/jax, 2018.

[8]

D. C. Cire

an, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep, big, simple

neural nets for handwritten digit recognition. Neural Computation, 2010.

[9]

N. Cohen, O. Sharir, and A. Shashua. On the Expressive Power of Deep Learning: A

Tensor Analysis, 2016.

[10]

G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of

Control, Signals, and Systems, 1989.

[11]

S. R. Dubey, S. K. Singh, and B. B. Chaudhuri. Activation functions in deep learning:

A comprehensive survey and benchmark. Neurocomputing, 2022.

[12]

R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In

Conference on learning theory, 2016.

[13]

K. Fukushima. Visual feature extraction by a multilayered network of analog threshold

elements. IEEE Transactions on Systems Science and Cybernetics, 1969.

Bibliography

[14]

Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model

uncertainty in deep learning. In International conference on machine learning, 2016.

[15]

X. Glorot and Y. Bengio. Understanding the diﬃculty of training deep feedforward

neural networks. In Proceedings of the thirteenth international conference on artiﬁcial

intelligence and statistics, 2010.

[16] G. H. Golub and C. F. Van Loan. Matrix computations. JHU press, 2013.

[17]

K. Hara, D. Saitoh, and H. Shouno. Analysis of dropout learning regarded as ensemble

learning. In Artiﬁcial Neural Networks and Machine Learning–ICANN 2016: 25th

International Conference on Artiﬁcial Neural Networks, Barcelona, Spain, September

6-9, 2016, Proceedings, Part II 25, 2016.

[18] S. Haykin. Neural networks: a comprehensive foundation. Prentice Hall, 1998.

[19]

K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiﬁers: Surpassing human-

level performance on imagenet classiﬁcation. In Proceedings of the IEEE international

conference on computer vision, 2015.

[20]

G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief

nets. Neural Computation, 2006.

[21]

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdi-

nov. Improving neural networks by preventing co-adaptation of feature detectors.

arXiv:1207.0580, 2012.

[22]

K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural

Networks, 1991.

[23]

L. Huang, J. Qin, Y. Zhou, F. Zhu, L. Liu, and L. Shao. Normalization techniques in

training dnns: Methodology, analysis and application. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 2023.

[24]

M. Innes. Flux: Elegant Machine Learning with Julia. Journal of Open Source Software,

2018.

[25]

S. Ioﬀe. Batch renormalization: Towards reducing minibatch dependence in batch-

normalized models. Advances in Neural Information Processing Systems, 2017.

[26]

S. Ioﬀe and C. Szegedy. Batch normalization: Accelerating deep network training by

reducing internal covariate shift. In International conference on machine learning, 2015.

[27]

A. G. Ivakhnenko. Polynomial theory of complex systems. IEEE Transactions on

Systems, Man, and Cybernetics, 1971.

[28]

A. G. Ivakhnenko and V. G. Lapa. Cybernetic predicting devices. Purdue Univ Lafayette

Ind School of Electrical Engineering, appearing in The Defense Technical Information

Center, 1966.

Bibliography

[29]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep

convolutional neural networks. Advances in neural information processing systems,

2012.

[30]

A. Krogh and J. Hertz. A Simple Weight Decay Can Improve Generalization. In

Advances in Neural Information Processing Systems, 1991.

[31]

J. Kukačka, V. Golkov, and D. Cremers. Regularization for deep learning: A taxonomy.

arXiv:1710.10686, 2017.

[32]

H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin. Exploring strategies for training

deep neural networks. Journal of Machine Learning Research, 2009.

[33] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015.

[34]

H. W. Lin, M. Tegmark, and D. Rolnick. Why does deep and cheap learning work so

well? Journal of Statistical Physics, 2017.

[35]

A. Lindholm, N. Wahlström, F. Lindsten, and T. B. Schön. Machine Learning - A First

Course for Engineers and Scientists. Cambridge University Press, 2022.

[36]

S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as

a taylor expansion of the local rounding errors. Master’s Thesis (in Finnish), Univ.

Helsinki, 1970.

[37]

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv:1711.05101,

2017.

[38]

A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectiﬁer nonlinearities improve neural

network acoustic models. In International Conference on Machine Learning, 2013.

[39]

W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous

activity. The Bulletin of Mathematical Biophysics, 1943.

[40]

G. Menghani. Eﬃcient deep learning: A survey on making deep learning models smaller,

faster, and better. ACM Computing Surveys, 2023.

[41]

M. Minsky and S. A. Papert. Perceptrons: An Introduction to Computational Geometry.

MIT Press, 1969.

[42]

V. Nair and G. E. Hinton. Rectiﬁed linear units improve restricted boltzmann machines.

In International conference on machine learning, 2010.

[43]

R. Pascanu, T. Mikolov, and Y. Bengio. On the diﬃculty of training recurrent neural

networks. In International conference on machine learning, 2013.

[44]

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,

L. Antiga, and A. Lerer. Automatic diﬀerentiation in PyTorch. 31st Conference on

Neural Information Processing Systems (NIPS2017), 2017.

Bibliography

[45]

K. B. Petersen and M. S. Pedersen. The matrix cookbook. Technical University of

Denmark, 2012.

[46]

T. Poggio, A. Banburski, and Q. Liao. Theoretical issues in deep networks. Proceedings

of the National Academy of Sciences, 2020.

[47]

T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao. Why and When Can

Deep – but Not Shallow – Networks Avoid the Curse of Dimensionality: a Review, 2017.

[48] S. J. D. Prince. Understanding Deep Learning. MIT Press, 2023.

[49]

F. Rosenblatt. The perceptron: a probabilistic model for information storage and

organization in the brain. Psychological review, 1958.

[50]

D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by

back-propagating errors. Nature, 1986.

[51]

J. Schmidhuber. Annotated history of modern AI and Deep learning. arXiv:2212.11279,

2022.

[52] T. J. Sejnowski. The Deep Learning Revolution. MIT Press, 2018.

[53]

R. Shwartz-Ziv and N. Tishby. Opening the black box of deep neural networks via

information. arXiv:1703.00810, 2017.

[54]

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout:

a simple way to prevent neural networks from overﬁtting. The journal of machine

learning research, 2014.

[55]

M. Telgarsky. Beneﬁts of depth in neural networks. In Conference on learning theory,

2016.

[56]

D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing

ingredient for fast stylization. arXiv:1607.08022, 2016.

[57]

P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In System

Modeling and Optimization: Proceedings of the 10th IFIP Conference, 1981, 2005.