i
i
i
i
i
i
i
i
Mathematical Engineering
of Deep Learning
Benoit Liquet, Sarat Moka and Yoni Nazarathy
March 3, 2026
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
We now enter the heart of deep learning models by presenting key concepts for feedforward
deep neural networks also known as general fully connected neural networks. These models
are a powerful extension of shallow neural networks presented previously and are the central
entities of deep learning. The key mathematical objects used for such networks are highlighted
in this chapter. These include layers with neurons, activation function alternatives, and the
backpropagation algorithm for gradient evaluation. A universal approximation theorem is
stated and further demonstrations highlight the benefit of exploiting deep neural networks
for machine learning tasks. Key steps for training a deep neural network are presented. These
include weight initialization and batch normalization. We then focus on key strategies for
handling overfitting. These are dropout and addition of regularization terms.
In Section 5.1 we present details for the general fully connected architecture. In Section 5.2
we explore the expressive power of neural networks to get a feel for why the general fully
connected architecture is a very useful machine learning model. Towards that end, we explore
a few theoretical underpinnings for the power of deep learning. In Section 5.3 we introduce
activation functions beyond the sigmoid and softmax functions which were already presented
in Chapter
??
. In Section 5.4 we study the backpropagation algorithm used for gradient
evaluation. This algorithm which is a form of backward mode automatic differentiation, is at
the core of training deep learning models. In Section 5.5 we study various methods for weight
initialization. In Section 5.6 we introduce the idea of batch-normalzation. In Section 5.7
we explore methods for mitigating overfitting. These include the addition of regularization
terms and the concept of dropout.
5.1 The General Fully Connected Architecture
We refer to the generic deep learning model as the general fully connected architecture and
also mention that such a model can be called a fully connected network, a feedforward
network, or a dense network where each of these terms may also be augmented with the
phrases “deep”, “neural”, and “general”. Another name for such a model is a multi-layer
perceptron (MLP) or a multi-layer dense network.
This architecture involves multiple layers, each with non-linear transformations, and is an
extension to the shallow neural networks presented in Chapter
??
. Schematic illustrations
of this architecture are presented in Figure 5.1. In that figure, each circle may be called a
neuron or unit and each vertical set of neurons is a layer. A network with a single hidden
layer is in (a) and a more general multi-layer network is in (b). Compare these illustrations
with Figure
??
which has no hidden layers and has a single output (logistic regression), or
Figure
??
which also has no hidden layers and has multiple outputs (multinomial regression).
Returning to Figure 5.1, the left most layer is called the input layer and it has the input
features vector
x
. Each element of this layer is sometimes called an input neuron even though
1
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
x
1
x
2
x
3
x
4
ˆy
Input
Layer
Hidden
Layer
Output
Layer
(a)
x
1
x
2
x
3
x
4
ˆy
Input
Layer
Hidden
Layer 1
Hidden
Layer 2
Hidden
Layer 3
Output
Layer
(b)
Figure 5.1: Fully connected feedforward neural networks. (a) A network with a single hidden layer.
(b) A deep neural network with multiple hidden layers.
it does not involve any computation. The right most or output layer contains the output
neurons (just one in this example). The layers in the middle are called hidden layers, since
the neurons in these layers are neither inputs nor outputs. When using the network in
production for prediction (regression or classification), we do not directly observe what goes
on in the hidden layers, but rather observe network outputs resulting from inputs. However
in certain cases, the values of neurons in hidden layers are also called “features” or more
precisely, extracted features (also known as computed features or derived features) since they
may encode some intermediate summary of the input data.
The design of the input and output layers in a network is often straightforward. There are
as many neurons in the input layer as the number of features. As for the output layer, the
number of output neurons depends on the application. In certain cases the output can be a
scalar, determining either a probability in binary classification, or a response in a regression
problem. In other cases, the output may be a vector, such as for example in multi-class
classification where there will typically be as many output neurons as the number of possible
labels (classes) and the output is a probability vector. In contrast to the input and output
layers, when selecting the number and size of the hidden layers, there is much room for
variation of model choice.
For example, assume we wish to create a model for classification of a type of plant based
on
p
= 120 measured indicators (features). Assume further that our classifier supports 30
different types of plants. Then in this case we may use a model where the input layer has
120 input neurons and the output layer has 30 output neurons which may be interpreted as
probabilities, similar to the multinomial regression model of Chapter
??
. With this, there
still remains a choice of how many hidden layers to use, and how many neurons to use for
each of those hidden layers. These choices determine the number of trained parameters, the
model expressivity, and the ease/difficulty of training the model.
2
i
i
i
i
i
i
i
i
5.1 The General Fully Connected Architecture
Finally note that in our terminology, when counting layers in a multi-layer network, we
count hidden layers as well as the output layer, but we do not count an input layer. Hence
with our terminology, Figure 5.1 (a) has 2 layers, and (b) has 4 layers.
A Model Based on Function Composition
The goal of a feedforward network is to approximate some function
f
:
R
p
R
q
. A
feedforward network model defines a mapping
f
θ
:
R
p
R
q
and learns the value of the
parameters θ that ideally result in
f
θ
(x) f
(x).
The function f
θ
is recursively composed via a chain of functions:
f
θ
(x) = f
[L]
θ
[L]
(f
[L1]
θ
[L1]
(. . . (f
[1]
θ
[1]
(x)) . . .)), (5.1)
where
f
[]
θ
[]
:
R
N
1
R
N
is associated with the
-th layer which depends on parameters
θ
[]
Θ
[]
, where Θ
[]
is the parameter space for the
-th layer. The depth of the network
is
L
. We have that
N
0
=
p
(the number of features) and
N
L
=
q
(the number of output
variables). Note that in case of networks used for classification we typically have
q
=
K
, the
number of classes. The number of neurons in the network is
P
L
=1
N
.
Affine Transformations Followed by Activations
In deep learning, the function
f
[]
θ
[]
is generally defined by an affine transformation followed
by an activation function. Activation functions are the means of introducing non-linearity
into the model. The output of layer
is represented by the vector
a
[]
and the intermediate
result of the affine transformation is represented by the vector
z
[]
(see also Figure
??
in
Chapter ??). We typically denote the output of the model via ˆy and hence,
ˆy = a
[L]
= f
θ
(x).
The action of f
[]
θ
[]
can be schematically represented as follows,
a
[1]
z
[]
:= W
[]
a
[1]
+ b
[]
a
[]
:= S
[]
(z
[]
),
Affine
Transformation
f
[]
θ
[]
Activation
(5.2)
where
a
[0]
=
x
. Hence the parameters of the
-th layer,
θ
[]
, are given by the
N
×N
1
weight
matrix
W
[]
=
w
[]
i,j
and the
N
dimensional bias vector
b
[]
= (
b
[]
i
). Thus the parameter
space of the layer is Θ
[]
= R
N
×N
1
× R
N
.
3
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
The activation function
S
[]
:
R
N
R
N
is a non-linear multivalued function. For
= 1, . . . , L 1 it is generally of the form
S
[]
(z) =
h
σ
[]
(z
1
) . . . σ
[]
(z
N
)
i
, (5.3)
where
σ
[]
:
R R
is typically an activation function common to all hidden layers. For the
output layer, = L, it is often of a different form depending on the task at hand.
In the popular case of multi-class classification, a softmax function as
(??)
is used, or
more specifically for binary classification, a sigmoid function as
(??)
is typically used; see
Chapter
??
for background. Thus, in such a classification framework, the output of the
network is a vector of probability values determining class membership. In order to get a
class label prediction one can convert the predicted probability scores into a class label using
a chosen threshold, or more simply maximum a posteriori probability, as described in (??).
The Forward Pass
The forward pass equation,
(5.1)
, of a deep neural network can be expanded out as follows,
Layer 1
z
[1]
= W
[1]
Input
z}|{
x +b
[1]
a
[1]
= S
[1]
(z
[1]
)
Layer 2
z
[2]
= W
[2]
a
[1]
+ b
[2]
a
[2]
= S
[2]
(z
[2]
)
.
.
.
Layer L
z
[L]
= W
[L]
a
[L1]
+ b
[L]
ˆy
|{z}
output
= a
[L]
= S
[L]
(z
[L]
).
(5.4)
Thus, for a given input
x
, when computing
f
θ
(
x
), we sequentially execute the affine trans-
formations and activation functions from layer 1 to layer
L
. The computational cost of such
a forward pass is at an order of the total number of weights. To see this, observe that at the
-th layer, the cost to compute
z
[]
=
W
[]
a
[1]
+
b
[]
and
a
[]
=
S
[]
(
z
[]
) are of the order
N
× N
1
+
N
and
N
, respectively. Hence, the total computational cost is of the order
P
L
=1
N
(N
1
+ 2)
P
L
=1
N
N
1
, which is the total number of weights.
An Example with Concrete Dimensions
As an illustration, let us return to the networks depicted in Figure 5.1. Consider the network
in (a) with one-hidden layer. Here
N
0
=
p
= 4,
N
1
= 5, and
N
2
=
q
= 1. Hence the dimension
of
W
[1]
is 5
×
4, the dimension of
b
[1]
is 5, the dimension of
W
[2]
is 1
×
5, and the dimension
4
i
i
i
i
i
i
i
i
5.1 The General Fully Connected Architecture
of b
[2]
is 1. Putting these elements together we obtain,
f
θ
(x) = S
[2]
(
z
[2]
W
[2]
S
[1]
(
z
[1]
W
[1]
x + b
[1]
)
| {z }
a
[1]
+b
[2]
)
| {z }
a
[2]
, (5.5)
where the number of parameters in
θ
is 5
×
4 + 5 + 1
×
5 + 1 = 31. Similarly, the deeper
network on the right of Figure 5.1 is represented via,
f
θ
(x) = S
[4]
(W
[4]
S
[3]
(W
[3]
S
[2]
(W
[2]
S
[1]
(W
[1]
x + b
[1]
) + b
[2]
) + b
[3]
) + b
[4]
). (5.6)
We may work out that the number of parameters is,
4 × 4 + 4
| {z }
Hidden layer 1
+ 3 × 4 + 3
| {z }
Hidden layer 2
+ 5 × 3 + 5
| {z }
Hidden layer 3
+ 1 × 5 + 1
| {z }
Output layer
= 61.
The Scalar Based View of the Model
It is also instructive to consider the scalar view of the system. The
i
-th neuron of layer
,
with
i
= 1
, . . . , N
, is typically composed of both
z
[]
i
and
a
[]
i
. The transition from layer
1
to layer
takes the output of layer
1, an
N
1
dimensional vector, and operates on it as
follows,
Affine
Transformation
:
z
[]
1
= w
[]
(1)
a
[1]
+ b
[]
1
z
[]
2
= w
[]
(2)
a
[1]
+ b
[]
2
.
.
.
z
[]
N
= w
[]
(N
)
a
[1]
+ b
[]
N
Activation
Step
:
a
[]
1
= σ
z
[]
1
a
[]
2
= σ
z
[]
2
.
.
.
a
[]
N
= σ
z
[]
N
, (5.7)
where,
w
[]
(j)
=
h
w
[]
j,1
. . . w
[]
j,N
1
i
, for j = 1, . . . , N
,
is the
j
-th row of the weight matrix
W
[]
, and
b
[]
j
is the
j
-th element of the bias vector
b
[]
.
Hence the parameters associated with neuron j in layer , are w
[]
(j)
and b
[]
j
.
Vectorizing Across Multiple Samples
So far we have defined our neural network using only one input feature vector
x
to generate
a prediction ˆy. Namely,
x a
[L]
= ˆy.
Let us now consider mini-batches as introduced in Section
??
. Here we use
n
b
training
samples
x
(1)
, . . . , x
(n
b
)
where
n
b
is the size of the mini-batch and
x
(i)
R
p
. We can use the
5
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
feedforward pass equation to get for each training sample x
(i)
, a prediction
x
(i)
a
[L](i)
= ˆy
(i)
i = 1, . . . , n
b
.
One can clearly iterate using a loop for getting all the predictions. However, a matrix repre-
sentation is often useful for describing the prediction of the whole mini-batch together. This
is particularly useful in GPU implementations when the mini-batch size,
n
b
, is appropriately
chosen to fit GPU memory. Let us first define the matrix
X
in which every column is a
feature vector for one training sample:
X
=
| | |
x
(1)
x
(2)
. . . x
(n
b
)
| | |
. (5.8)
Then, for each , we define the matrix Z
[]
with columns z
[](1)
, . . . , z
[](n
b
)
as,
Z
[]
=
| | |
z
[](1)
z
[](2)
. . . z
[](n
b
)
| | |
.
The activation matrix A
[]
for each layer is defined similarly via,
A
[]
=
| | |
a
[](1)
a
[](2)
. . . a
[](n
b
)
| | |
,
where for example the element in the first row and in the second column of a matrix
A
[]
is an activation of the first hidden unit from the layer
and the second training example.
Based on this matrix and the forward pass representation (5.4), we get,
Z
[1]
= W
[1]
X
+ B
[1]
A
[1]
= σ
[1]
(Z
[1]
)
Z
[2]
= W
[2]
A
[1]
+ B
[2]
A
[2]
= σ
[2]
(Z
[2]
)
.
.
.
Z
[L]
= W
[L]
A
[L1]
+ B
[L]
A
[L]
|{z}
[ˆy
(1)
,...,ˆy
(n
b
)
]
= S
[L]
(Z
[L]
)
(5.9)
where the scalar activation functions, as in
(5.3)
,
σ
[]
(
·
) are applied independently to each
element of the matrix
Z
[]
for
= 1
, . . . , L
1, while
S
[L]
(
·
) is applied independently on
each column of
Z
[L]
. For this representation, the matrix
B
[]
for each layer
is based on the
bias vector b
[]
and is constructed via,
B
[]
=
| | |
b
[]
b
[]
. . . b
[]
| | |
. (5.10)
6
i
i
i
i
i
i
i
i
5.2 The Expressive Power of Neural Networks
An Overview of Model Training
Training of feedforward deep neural networks follows the same iterative optimization based
learning paradigm presented in Section
??
of Chapter
??
. This paradigm was also applied
to simple neural networks in Chapter
??
. Specifically a standardized training dataset
D
=
(x
(1)
, y
(1)
), . . . , (x
(n)
, y
(n)
)
is used to find network parameters
θ
(weights and biases)
that minimize a loss function
C
(
θ
;
D
). As illustrated in previous chapters, common loss
functions include the cross entropy loss (binary or categorical) in the context of classification
or the square loss function in the context of regression.
In practice, gradient based optimization generalizing the gradient descent of Algorithm
??
is the typical technique of training. Further, as discussed in detail in Chapter
??
, a very
common variant is ADAM presented in Algorithm
??
. In any case, such learning algorithms
require the evaluation of gradients, e.g., Step 3 of Algorithm
??
or Step 4 of Algorithm
??
.
For deep learning models, such gradient evaluation is carried out via the backpropagation
algorithm which we study in detail in Section 5.4 below.
Another important aspect of iterative optimization involves parameter intialization. This is
Step 1 of Algorithm
??
or Step 2 of Algorithm
??
. In the context of deep learning this is
called weight initialization, a topic we discuss in Section 5.5.
Finally, other aspects of training deep neural networks include batch normalization presented
in Section 5.6 and dropout presented in Section 5.7, as well as other methods of regularization
also presented in that section.
5.2 The Expressive Power of Neural Networks
Deep neural network models are extremely expressive and versatile. Research about their
strength dates all the way back to the early days of artificial intelligence and continues in
current times. We now explore the expressivity of such models where we simply touch the
tip of the iceberg, attempting to illustrate why the model is sensible and versatile.
Neural networks are known for being able to approximate arbitrarily complex functions.
Our exposition in this section aims to illustrate this while also presenting intuition about
the benefits of deep models. We begin with a simple constructive example for scalar real
valued functions, and we then progress to present an overview key results and intuition
which highlight why these models are so powerful.
Simple Function Approximation
We now see one possible way to approximate scalar functions via a neural network. Say we
have a function
f
: [
l, l
]
R
and we are given the values of the function at
v
i
=
f
(
u
i
) for
u
1
, . . . , u
r
with l u
1
< u
2
< . . . < u
r
l.
See for example Figure 5.2 focusing on this arbitrary example,
f
(x) = cos
2 cos
x
2
+
1
4
(x 1)
2
+
x
2
+ 1, for x [0, 4]. (5.11)
7
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
Approximation functions obtained for
f
(
·
) of
(5.11)
are also illustrated in Figure 5.2 where
the case of
r
= 10 is in (b) and the case of
r
= 30 is in (c). Our approximation of
f
(
·
) uses
a feedforward neural network
f
θ
(
·
) as illustrated in (a). In this case we apply a construction
of a piecewise continuous affine function
f
θ
:
R R
with breakpoints at
u
1
, . . . , u
r
and
f
θ
(u
i
) = f
(u
i
) for i = 1, . . . , r. For completeness, we now present the details.
Our constructed network has one hidden layer (
L
= 2), and model dimensions
N
0
= 1,
N
1
=
r
1, and
N
2
= 1. In this case, the approximation is obtained by setting the scalar
activation functions as in
(5.3)
to be
σ
[1]
(
u
) =
max
(
u,
0) and
σ
[2]
(
u
) =
u
, where the former
is called the ReLU function, see also
(5.17)
, and the latter is simply the identity function.
We set the first
N
1
×
1 dimensional weight matrix,
W
[1]
, to have all entries 1; the first
N
1
dimensional bias vector to have entries
b
[1]
i
=
u
i
; the second 1
× N
1
dimensional weight
matrix is set with entries
w
[2]
1,i
=
s
i
s
i1
for
i
= 1
, . . . , r
1 with
s
0
= 0 and
s
i
=
v
i+1
v
i
u
i+1
u
i
;
and finally the second scalar bias b
[2]
is set to be v
1
.
In summary, the construction uses a linear combination of shifted ReLU functions (see also
Figure 5.6) where each shifted ReLU (shifted by the bias
u
i
) is constructed to match the
interval [
u
i
, u
i+1
] with a slope
s
i
. As the construction moves from left to right, slopes are
cancelled out via subtraction of the s
i1
term in w
[2]
1,i
.
With such a construction we see that essentially any continuous real valued function on a
bounded domain can be approximated via a neural network.
1
It is evident that as
r
the approximation becomes exact, and this can be made rigorous for any continuous target
function f
(·) on a bounded interval.
A General Approximation Result
The expressive power of feedforward neural networks generalizes to functions that have
p
inputs and
q
outputs. In fact, as the result below shows, similarly to the construction above,
a single hidden layer (L = 2) and identity activations in the second layer suffices for this.
Theorem 5.1. Consider a continuous function
f
:
K R
q
where
K R
p
is a compact
set. Then for any non-polynomial activation function σ
[1]
(·) and any ε > 0, there exists an
N
1
and parameters W
[1]
R
N
1
×p
, b
[1]
R
N
1
, and W
[2]
R
q×N
1
, such that the function
f
θ
(x) = W
[2]
S
[1]
(W
[1]
x + b
[1]
), with S
[1]
as in (5.3),
satisfies ||f
θ
(x) f
(x)|| < ε for all x K.
Hence this theorem states that essentially all functions can be approximated to arbitrary
precision dictated via
ε
. Practically for complicated functions
f
(
·
) and small
ε
one may
need large
N
1
. Yet, the theorem states that it is always possible. A reference to a proof is
provided in the notes and references section at the end of this chapter. The constructive
example in Figure 5.2 above (for
p
= 1 and
q
= 1) may hint at the validity of the result. Note
also that the tanh, sigmoid, and ReLU activation functions described in the next section are
some of the valid activation functions for this result.
1
This construction is one of many options one could use.
8
i
i
i
i
i
i
i
i
5.2 The Expressive Power of Neural Networks
x
bias = u
1
bias = u
2
bias = u
r1
ˆy
w
[1]
1,1
= 1
w
[2]
1,1
= s
1
s
0
w
[1]
1,2
= 1
w
[2]
2,1
= s
2
s
1
w
[1]
1,r1
= 1
w
[2]
r1,1
= s
r1
s
r2
(a)
(b) (c)
Figure 5.2: Approximation of an arbitrary real valued function on a bounded domain via a piecewise
continuous function obtained by a single hidden layer feedforward model with ReLU activation
functions. (a) A neural network with one hidden layer that constructs such an approximation. (b)
Approximation based on
r
= 10 sampled points of the function. (c) Approximation based on
r
= 30
sampled points of the function.
9
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
The Strength of a Hidden Layer
As we saw in Chapter
??
shallow neural networks such as logistic regression or softmax
regression can be used to create classifiers with linear decision boundaries. Further, as in
Section
??
, for cases where more general decision boundaries are needed, one may attempt
to create additional transformed features while still using these basic models. However, the
expressiveness of models with a single hidden layer (or more), as introduced in the current
chapter, can yield a versatile alternative to the shallow networks of Chapter ??.
(a) (b) (c)
Figure 5.3: Binary classification example with
x R
2
, moving from a shallow neural network to a
model with a hidden layer and then increasing the number of neurons. (a) Sigmoid model (
L
= 1).
(b) One-hidden layer (L = 2) N
1
= 4 neurons. (c) One-hidden layer (L = 2) N
1
= 10 neurons.
Consider Figure 5.3 (a) for a classification task based on two inputs
x R
2
using logistic
regression. By adding a single hidden layer with 4 neurons (sigmoid activation function
is used for all units), we can move beyond the linear boundaries to obtain Figure 5.3 (b).
Then, by increasing the number of neurons in the hidden layer from 4 to 10 units, the model
further refines the decision boundary as in Figure 5.3 (c).
We thus see that obtaining non-linear decision boundries is possible not only via feature
enginering as discussed in Section
??
and illustrated there in Figure
??
, but can also be
done via neural networks as hinted by Theorem 5.1 and illustrated here in Figure 5.3.
Stylized Functions via Simple Models
To further appreciate the expressive power of feedforward neural networks we also consider
specific stylized functions. For example, historically, in the study of neural networks much
effort has gone into characterizing the set of logical functions, sometimes called gates, that
may be described via certain classes of models. In our exploration we focus on a different
type of specific task, multiplication of two inputs, and demonstrate a simple constructive
network that approximates this task. We thus construct what may be called a multiplication
gate.
A simple construction of a single hidden layer network with
p
= 2 and
q
= 1, allows us to
create a function
f
θ
(
·
), parametrized by
λ >
0, such that for input
x
= (
x
1
, x
2
), the function
approximately implements multiplication of inputs,
f
θ
(x
1
, x
2
) x
1
x
2
. (5.12)
10
i
i
i
i
i
i
i
i
5.2 The Expressive Power of Neural Networks
Importantly, the approximation error vanishes as
λ
0 and achieving
(5.12)
to arbitrary
accuracy requires only
N
1
= 4 neurons in the single hidden layer. Similarly to the model in
Theorem 5.1 and to the simple function approximation example of Figure 5.2, the activation
function of the output layer is the identity. There are no bias terms, and the weight matrices
are,
W
[1]
=
λ λ
λ λ
λ λ
λ λ
, and W
[2]
=
µ µ µ µ
, (5.13)
with
µ
=
4λ
2
¨σ(0)
1
. Here
¨σ
(0) represents the second derivative of the scalar activation
function of the hidden layer (
= 1) at 0. Hence the model assumes
σ
[1]
(
·
) is twice differentiable
(at 0) with a non-zero second derivative at zero. A schematic representation of
f
θ
(
·
) is in
Figure 5.4.
x
1
x
2
σ
[1]
(λx
1
+ λx
2
)
σ
[1]
(λx
1
λx
2
)
σ
[1]
(λx
1
λx
2
)
σ
[1]
(λx
1
+ λx
2
)
ˆy = f
θ
(x
1
, x
2
)
w
[1]
1,1
= λ
w
[1]
2,1
= λ
w
[1]
3,1
= λ
w
[1]
4,1
= λ
w
[1]
1,2
= λ
w
[1]
2,2
= λ
w
[1]
3,2
= λ
w
[1]
4,2
= λ
w
[2]
1,1
= µ
w
[2]
1,2
= µ
w
[2]
1,3
= µ
w
[2]
1,4
= µ
Figure 5.4: A simple neural network model with a single hidden layer composed of four neurons.
This network approximates multiplication, or a multiplication gate, in the sense that ˆy x
1
x
2
.
We may verify that the model f
θ
(·) evaluates to,
f
θ
(x
1
, x
2
) =
σ
λ(x
1
+ x
2
)
+ σ
λ(x
1
x
2
)
σ
λ(x
1
x
2
)
σ
λ(x
1
+ x
2
)
4λ
2
¨σ(0)
, (5.14)
11
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
where we use
σ
(
·
) to denote
σ
[1]
(
·
). We may now use a Taylor expansion (see Theorem
??
in
Appendix ??) of σ(·) around the origin,
σ(u) = σ(0) + ˙σ(0)u + ¨σ(0)
u
2
2
+ O
u
3
, (5.15)
with
O
(
h
k
) denoting a function such that
O
(
h
k
)
/h
k
goes to a constant as
h
0. We can
now use (5.14) and (5.15) represent the model as,
f
θ
(x
1
, x
2
) = x
1
x
2
1 + O
λ(x
2
1
+ x
2
2
)

.
Hence as
λ
0 the desired goal
(5.12)
becomes exact. Note that this continuous multiplica-
tion gate is mostly a theoretical construct and for many popular activation functions the
condition
¨σ
(0) = 0 does not hold. Nevertheless, this problem can be overcome by introducing
biases to shift the origin and hence use a different input into the activation. Below we use
this construction to further argue about the power of deep learning models.
Feature Focus with Neural Networks
We now contrast the usage of feedforward neural networks with the more classic practice
of feature engineering where one would consider data and add additional features by trans-
forming existing features. See also Section
??
where feature engineering is illustrated for
simple neural networks. As an example we consider a case where there are originally
p
features
x
1
, . . . , x
p
and we wish to construct
˜p
=
p
(
p
+ 1)
/
2 features based on all possible
pairwise interactions
2
(multiplications)
x
i
x
j
for
i, j
= 1
, . . . , p
. For instance if
p
= 1
,
000
then we arrive at
˜p
500
,
000. Clearly for non-small
p
we quickly arrive at huge number of
transformed features ˜p.
Let us contrast the usage of two alternatives. On the one hand consider a linear model acting
on the transformed features
˜x
where for simplicity we ignore the bias (intercept). On the
other hand let us consider a neural network with a single hidden layer acting on the original
features
x
. In the linear model, we have
˜
f
θ
(
˜x
) =
˜w
˜x
where the learned weight vector
˜w
has
˜p
parameters. In the single hidden layer neural network, there are
p
inputs,
q
= 1 output,
and
N
1
units in the hidden layer. Thus the number of parameters is
N
1
× p
+
N
1
+
N
1
+ 1.
It is often the case that not all interactions (product features) are relevant. As an example
let us consider that only a fraction
α
of the interactions are relevant. We now argue that the
neural network model with
N
1
hidden units sufficiently large, but not necessarily huge, can
in principle capture these interactions. To do so, revisit the multiplication example presented
above where 4 hidden units were needed to approximate an interaction (multiplication).
While the construction using weight parameters as in (5.13) is artificial, the example hints
at the fact that with
N
1
4
α p
hidden units we may be able to capture the key interactions.
In such a framework, the basic linear model still requires the full set of parameters. With
this, compare the number of parameters,
1
2
p(p + 1)
| {z }
Linear Model
vs. 4αp(p + 2) + 1
| {z }
Neural Network
.
2
Typically in statistics interactions refer to terms such as
x
i
x
j
for
i
=
j
. However in this example we
also allow for terms such as x
2
i
.
12
i
i
i
i
i
i
i
i
5.2 The Expressive Power of Neural Networks
Observe that
p
2
is the dominant term in both models but for
α <
1
/
8 and large
p
, the neural
network model has less parameters. Keep in mind that often
α
is very small while
p
may
grow. Returning to the numerical case of
p
= 1
,
000, if
α
= 0
.
02 (20 significant interactions)
then the linear model has an order of 500
,
000 parameters while the neural network only has
order of 80, 000 parameters and is thus parisiminous in comparison to the linear model.
Improvement in Expressivity by an Increase of Depth
Despite the fact that Theorem 5.1 states that almost any function can be approximated
using a neural network model with a single hidden layer, practice and research has shown
that to gain high expressive power, this model might require a very large number of units
(
N
1
needs to be very large). Hence gaining significant expressive power may require a very
large number of parameters. The power of deep learning then arises via repeated composition
of non-linear activations functions via an increase of depth (an increase of L).
Note first that if the identity activation function is used in each hidden layer, then the
network reduces to a shallow neural network,
f
θ
(x) = S
[L]
(
˜
W x +
˜
b),
where,
3
˜
W = W
[L]
W
[L1]
· . . . · W
[1]
, and
˜
b =
L
X
=1
L
Y
˜
=+1
W
[
˜
]
b
[]
.
In the case where the identity function is also used for the output layer, the model reduces
to be a linear (affine) model. Thus, we have no gain by going deeper and adding multiple
layers with identity activations. The expressivity of the neural network comes from the
composition of non-linear activation functions. The repeated compositions of such functions
has significant expressive power and can reduce the number of units needed in each layer in
comparison to a network with a single hidden layer. A consequence is that the parameter
space is reduced as well.
Let us consider an artificial example to demonstrate why exploiting the depth of the network
is crucial for modelling complex relationships between input and output. We revisit the
previous example involving models using interactions of order 2 (i.e.,
x
i
x
j
). Let us consider
a higher complexity model by exploiting potential high-order interactions, namely products
of r inputs, similarly to the discussion in Section ??.
Consider a fully connected network with
p
inputs,
q
= 1 output, and
L
2 layers with same
size (
N
[]
=
N
in each layer). We re-use the idea appearing in
(5.13)
that with
N
4
α p
hidden units we may be able to capture the
α p
relevant interactions of order
r
= 2. Then
by moving forward in the network, the subsequent addition of a layer with
N
4
α p
hidden
units will capture interactions of order
r
= 2
2
and so on, until we capture interaction of
order
r
= 2
L
at the output layer. Hence to achieve interactions of order
r
we may require
L log
2
r
, e.g.,
L
=
log
2
r
. Such network is depicted in Figure 5.5 where the number of
3
Note that in the expression
Q
L
˜
=+1
W
[
˜
]
we assume that the product multiplies the matrices in the
left to right order W
[L]
W
[L1]
W
[L2]
. . .. Further, for = L the product is taken as the identity matrix.
13
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
parameters is,
N × p + N
| {z }
First hidden layer
+ (L 2) × (N
2
+ N)
| {z }
Inner hidden layers
+ N + 1
| {z}
Output layer
LN
2
.
For example, assume we wish to have a model for
p
= 1
,
000 features that supports about 20
meaningful interactions of order
r
= 500. Hence we can consider
α
= 0
.
02. With a model
not involving hidden layers (e.g., a linear model or a logistic model), as shown in Section
??
,
we cannot specialize for an order of 20 interactions and thus we require a full model of order
p
r
/r
!
10
365
parameters. In contrast, with a deep model we require about
L
= 10
log
2
500
layers and
N
= 4
αp
= 80 neurons per layer, and thus the total number of parameters is
at the order of
LN
2
= 57
,
600. This deep construction which can capture a desired set of
meaningful interactions is clearly more feasible and efficient than the shallow construction
of astronomical size.
x
1
x
2
x
1000
ˆy
Figure 5.5: A deep learning model with
L
= 10 hidden layers may express many meaningful
interactions for the inputs.
5.3 Activation Function Alternatives
As presented in
(5.2)
, each layer
of the network incorporates an activation function which
is generally a non-linear transformation of
z
[]
to arrive at
a
[]
. For most layers of the
network, the activation function
S
[]
(
·
) is composed of a sequence of identical scalar valued
activation functions as in
(5.3)
. That is, the
-th layer incorporates a scalar activation
function
σ
[]
:
R R
and it is applied to each of the
N
coordinates of
z
[]
separately. In
some models, all the scalar activation functions across all layers will be of the same form,
while in other models different layers will sometimes incorporate different forms of scalar
activation functions.
Scalar Activations and their Derivatives
When it comes to the choice of the scalar activation functions, there are different heuristic
considerations that depend on model expressivity and learning ability. These considerations
are often not based on theory, yet practice and experience over the years has shown that
some scalar activation functions perform much better than others. In Section 5.5 we outline
how such choices interface with optimization and the associate vanishing gradient problem
is discussed in Section 5.5. Here we simply outline the key activation functions.
14
i
i
i
i
i
i
i
i
5.3 Activation Function Alternatives
At the onset of the development of deep learning, in the late 1950’s, the step scalar activation
function was used. That is,
σ
Step
(u) =
(
1 u < 0,
+1 u 0.
(5.16)
However,
σ
Step
is not used in modern neural network models, primarily because its derivative
˙σ
Step
(
u
) = 0 for all
u
= 0. Indeed the derivative of activation functions is important since it
is used in the backpropagation algorithm (see the next section) to compute the gradient of
the loss function with respect to the model parameters.
The sigmoid activation function (also known as the logistic function), used heavily in
Chapter 3, see
(??)
, is a much more popular choice. Note also that its derivative
˙σ
Sig
can be
expressed in terms of the function σ
Sig
itself,
σ
Sig
(u) =
e
u
1 + e
u
=
1
1 + e
u
, with ˙σ
Sig
(u) = σ
Sig
(u)
1 σ
Sig
(u)
.
A similar popular function is hyperbolic tangent, denoted tanh,
σ
Tanh
(u) =
e
u
e
u
e
u
+ e
u
, with ˙σ
Tanh
(u) = 1 σ
Tanh
(u)
2
.
Both
σ
Sig
and
σ
Tanh
share similar qualitative properties as
σ
Step
. They are non-decreasing
and bounded. At
u
both functions converge to unity just like
σ
Step
and at
u −∞
the sigmoid function converges to 0 while the tanh function converges to
1 like
σ
Step
.
This minor quantitative difference is not significant as it may generally be compensated by
learned weights and biases or a shifting and scaling of the functions, yet in practice when
the output
ˆy
is a probability in [0
,
1] using
σ
Sig
is much more common. See Figure 5.6 for
plots of several activation functions.
(a) (b)
Figure 5.6: Several common scalar activation functions. (a) The step and tanh function have a
range of (
1
,
1) while the sigmoid function has a range of (0
,
1). We also plot a scaled sigmoid to
the range (
1
,
1) so it can be compared to tanh. (b) The ReLU activation function and the leaky
ReLU variant with different leaky ReLU parameters.
In earlier applications of deep learning, it was a matter of empirical research, practice, and
heuristics to choose between models or layers that use
σ
Sig
,
σ
Tanh
, or similar forms. However
in more recent years, a completely different type of scalar activation function became popular,
15
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
namely the rectified linear unit
4
or ReLU,
σ
ReLU
(u) = max(0, u) =
(
0 u < 0,
u u 0,
with, (5.17)
˙σ
ReLU
(u) = 1{u 0} =
(
0 u < 0,
1 u 0.
Note that while
σ
ReLU
(
u
) is not differentiable at
u
= 0 we still arbitrarily define the derivative
at u = 0 to be the same as the derivative for u > 0.
As we describe in Section 5.5, when parameters are initialized properly, the unboundedness
of
σ
ReLU
often presents an advantage in training over bounded activation functions such
as
σ
Sig
since it overcomes a training problem known as the vanishing gradient problem.
However in certain cases it also introduces a problem called dying ReLU since the derivative
is 0 for negative inputs. While in practice, this is often not considered a major problem, to
handle dying ReLU, one may use a related activation function, leaky ReLU, parameterized
by a fixed small ε 0 (e.g., ε = 0.01) and defined via,
σ
LeakyReLU
(u) = max(0, u) + min(0, εu) =
(
εu u < 0,
u u 0,
with,
˙σ
LeakyReLU
(u) = 1{u 0} + ε1{u < 0} =
(
ε u < 0,
1 u 0.
Observe that when
ε
= 0, this is just the ReLU activation function. Another variant called
PReLU (parametric ReLU) considers the leaky ReLU parameter
ε
as a learned parameter.
That is, the gradient based optimization for the parameters of the network also includes
improvement steps of ε, incorporating it as part of the parameters for -th layer, θ
[]
.
Note that feedforward models that only use piecewise affine scalar activation functions such
as the step, ReLU, leaky ReLU, or PReLU have some mathematical appeal. When only
such activations are used in a deep learning model, these functions imply that the complete
model can be described as a piecewise affine function of the input. That is, the trained model
implicitly partitions the input space
R
p
into polytopes (regions defined by intersections of
half spaces), and for each polytope, when the input
x
is in that polytope, the response of
the model is a fixed affine transformation of
x
. This property is simply a consequence of
the fact that compositions of piecewise affine functions remains a piecewise affine function.
There are no immediate practical applications for this property but it further hints at the
expressive power of neural network models further to the description in Section 5.2.
Also note that many other forms of scalar activation functions have been introduced and
experimented with. These include the arctan, softsign, softplus, elu, selu, and swish among
others. Figure 5.6 presents the key activation functions
σ
Step
,
σ
Sig
,
σ
Tanh
,
σ
ReLU
, and
σ
LeakyReLU
(with ϵ = 0.01), along with a few of these alternatives.
4
Note that source of the name is from electrical engineering where a rectifier is a device that converts
alternating current to direct current.
16
i
i
i
i
i
i
i
i
5.3 Activation Function Alternatives
Non-Scalar Activations and their Derivatives
Some layers also use non-scalar activation functions. That is,
S
[]
:
R
N
R
N
is a vector to
vector function that cannot be decomposed as in
(5.3)
. The most common example of this is
the softmax activation function, typically used for classification in the last layer
=
L
. We
now denote
N
L
=
K
since we often deal with classification of
K
classes as in Section
??
. In
such a case,
a
[L]
= S
Softmax
(z
[L]
). (5.18)
The R
K
R
K
softmax activation function is defined as,
S
Softmax
(z) =
1
P
K
i=1
e
z
i
e
z
1
··· e
z
K
,
which was also defined in
(??)
of Section
??
. Note that other examples of non-scalar
activations include max pooling layers, introduced in Section ?? of the next chapter.
As will become evident in the next section, denoting the training loss by
C
, the gradient
C/∂z
[L]
needs to be evaluated as part of the backpropagation algorithm. One approach for
this evaluation is
C
z
[L]
=
a
[L]
z
[L]
|{z}
N
L
×N
L
C
a
[L]
|{z}
N
L
×1
,
where we use the multivariate chain rule (see also Appendix
??
) and keep in mind that
a
[L]
is a vector,
z
[L]
is a vector, and
C
is a scalar. With this approach, when the last layer is a
softmax as
(5.18)
, one may essentially need to evaluate
a
[L]
/∂z
[L]
which is the transpose
of the Jacobian of
S
Softmax
(
·
). The elements of this Jacobian can be represented using the
(scalar) quotient rule for differentiation, and are,
[J
S
Softmax
(z)]
ij
=
z
j
e
z
i
P
K
k=1
e
z
k
=
[S
Softmax
(z)]
i
(1 [S
Softmax
(z)]
i
), i = j,
[S
Softmax
(z)]
i
[S
Softmax
(z)]
j
, i = j.
(5.19)
However, one rarely uses
(5.19)
since in the typical case where the loss function is the cross
entropy loss (see also Section
??
), we have a direct expression for
C/∂z
[L]
. To see, this
suppose the label
y
equals
k
, an element from
{
1
, . . . , K}
. In this case, as was shown in
(??)
from Section ??, the loss for a specific observation is,
C = log [S
Softmax
(z
[L]
)]
k
= log
K
X
i=1
e
z
[L]
i
z
[L]
k
.
Now to obtain
C/∂z
[L]
we compute the derivative with respect to
j
for every
j
= 1
, . . . , K
.
This yields,
C
z
[L]
j
=
e
z
[L]
j
P
K
i=1
e
z
[L]
i
1{j = k} = [S
Softmax
(z
[L]
)]
j
1{j = k}.
Hence in this case the direct expression is,
C
z
[L]
= S
Softmax
(z) e
k
, (5.20)
17
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
where
e
k
is the
K
-dimensional vector with 1 in the
k
-th position and 0 in other coordinates.
Note that in practice when using deep learning frameworks, it is often recommended to use
(5.20) directly in such a case.
5.4 The Backpropagation Algorithm
Now that we understand the model, we focus on gradient computation so as to facilitate
learning using variants of gradient descent as covered in Chapter
??
. A key algorithm is
the backpropagation algorithm which implements backward mode automatic differentiation
which is overviewed in Section
??
of Chapter
??
. Now we build the related backpropagation
algorithm of deep learning. We start with a general recursive model and then specialize to
feedforward neural networks where the parameters are weights and biases.
Backpropagation for the General Recursive Model
It is instructive to first consider a general recursive feedforward model as appearing in
(5.1)
.
For such a model, the recursive step is of the form
a
[]
=
f
[]
θ
[]
(
a
[1]
). However, in this section
it is convenient to use notation that treats the function
f
[]
separately as a function of
a
[1]
and of the parameter θ
[]
:
f
[]
(· ; θ
[]
) : R
N
1
R
N
, and f
[]
(a
[1]
; ·) : Θ
[]
R
N
.
Using this notation, the recursive step is,
a
[]
= f
[]
(a
[1]
; θ
[]
), for = 1, . . . , L, (5.21)
where
a
[0]
=
x
and
ˆy
=
a
[L]
. Given a single data sample (
x, y
) we assume there is a loss
function which depends on the given parameters
θ
, on the label value
y
, and on the output
of the model, a
[L]
. We denote this loss via C(a
[L]
, y ; θ).
Our purpose is to optimize the loss with respect to
θ
= (
θ
[1]
, . . . , θ
[L]
). For this, we require
the gradient with respect to θ and we denote its components via,
g
[]
θ
:=
C(a
[L]
, y ; θ)
θ
[]
. (5.22)
A key aspect of the automatic differentiation setup is that we are presented with computable
expressions or code, for evaluation of certain derivatives. In our context these are the
derivative of the loss
C
with respect to
a
[L]
, and the derivatives of
f
[]
(
·
;
·
) with respect
to the input arguments (both layer input and parameters). These known derivatives are
denoted via,
˙
C(u) :=
C(u, y ; θ)
u
,
˙
f
[]
a
(u) :=
f
[]
(u ; θ
[]
)
u
,
˙
f
[]
θ
(u) :=
f
[]
(a
[1]
; u)
u
. (5.23)
Note that the shape of these expressions varies based on the domain and co-domain of
C
(
·
)
and
f
(
·
;
·
). The derivative
˙
C
(
u
) is typically a gradient and thus vector valued with length
N
L
. The derivative
˙
f
[]
a
(
u
) is typically an
N
1
× N
matrix obtained via a transpose of a
Jacobian and thus matrix valued. This is because the input to the layer is
N
1
dimensional
and the output argument is
N
dimensional. Finally, the derivative
˙
f
[]
θ
(
u
) may take on
18
i
i
i
i
i
i
i
i
5.4 The Backpropagation Algorithm
various shapes depending on the form of
θ
. See Appendix
??
for a review of matrix derivatives.
On route to compute the desired gradients
(5.22)
we require intermediate gradients of the
θ
[1]
θ
[2]
θ
[L1]
θ
[L]
x = a
[0]
a
[1]
a
[2]
a
[L1]
a
[L]
ζ
[1]
ζ
[2]
ζ
[L1]
ζ
[L]
g
[1]
θ
g
[2]
θ
g
[L1]
θ
g
[L]
θ
Gradient Values
Parameter Values
Forward
Backward
Loss : C
y, ˆy
Figure 5.7: The variables and flow of information in the backpropagation algorithm for the general
recursive model.
loss with respect to the activation values
a
[1]
, . . . , a
[L]
. Keeping in mind that
a
[L]
=
ˆy
is a
function of these activation values, the intermediate gradients
5
are denoted,
ζ
[]
:=
C(a
[L]
, y ; θ)
a
[]
, = 1, . . . , L. (5.24)
Now based on the multivariate chain rule, the recursive step
(5.21)
, and the definitions
above, we observe,
ζ
[]
=
a
[+1]
a
[]
C
a
[+1]
=
˙
f
[+1]
a
(a
[]
) ζ
[+1]
, g
[]
θ
=
a
[]
θ
[]
C
a
[]
=
˙
f
[]
θ
(θ
[]
) ζ
[]
. (5.25)
Note that in the application of the chain rule above, we use the notation specified in
Appendix
??
. Hence once the activation values
a
[1]
, . . . , a
[L]
are populated via forward
propagation of (5.21), backward computation can be carried out via,
ζ
[]
=
(
˙
C(a
[L]
), = L,
˙
f
[+1]
a
(a
[]
) ζ
[+1]
, = L 1, . . . , 1,
(5.26)
and at each step the gradient can be obtained via g
[]
θ
=
˙
f
[]
θ
(θ
[]
) ζ
[]
.
5
These are also called adjoints. See (??) in Chapter ??.
19
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
This process is summarized in Algorithm 5.1. See also Figure 5.7 for an illustration of the
flow of information.
Algorithm 5.1: Backpropagation for the general recursive model
Input: Dataset D = {(x
(1)
, y
(1)
), . . . , (x
(n)
, y
(n)
)},
objective function C(·) = C(·; D), and
parameter values θ = (θ
[1]
, . . . , θ
[L]
)
Output: gradients of the loss g
[1]
θ
, . . . , g
[L]
θ
1 Compute a
[]
for = 1, . . . , L using (5.21) (Forward pass)
2 Compute ζ
[L]
=
˙
C(a
[L]
)
3 Compute g
[L]
θ
=
˙
f
[L]
θ
(θ
[L]
) ζ
[L]
4 for = L 1, . . . , 1 do
5 compute ζ
[]
=
˙
f
[+1]
a
(a
[]
) ζ
[+1]
6 compute g
[]
θ
=
˙
f
[]
θ
(θ
[]
) ζ
[]
An Unrolled Example
To get a better feel for Algorithm 5.1, the operation of backpropagation, and the associated
notation, we consider a simple hypothetical example as illustrated in Figure 5.8. This is
a feedforward neural network with
L
= 4, arbitrary input dimension
p
, output dimension
q
= 1, and a single neuron on each hidden layer. That is
N
0
=
p
and
N
1
, N
2
, N
3
, N
4
= 1.
Assume the loss function is
C(a
[L]
, y ; θ) =
1
2
(a
[L]
y)
2
,
and assume a model structure,
a
[]
= f
[]
(a
[1]
; θ
[]
) = σ
[]
(θ
[]
a
[1]
) for = 1, 2, 3, 4, (5.27)
similarly to the neural network structure of Section 5.1 where affine transformations are
followed by activation functions, yet without bias terms. Since
N
0
=
p
, we have that
Θ
[1]
=
R
p
and since
N
1
, . . . , N
4
are all unity, we have that Θ
[2]
,
Θ
[3]
, and Θ
[4]
are each
R
and the transpose in (5.27) is not needed for = 2, 3, 4.
x
|{z}
R
p
θ
[1]
R
p
θ
[1]
>
x
σ
[1]
(θ
[1]
>
a
[0]
)
| {z }
a
[1]
θ
[2]
R
θ
[2]
a
[1]
σ
[2]
(θ
[2]
a
[1]
)
| {z }
a
[2]
θ
[3]
R
θ
[3]
a
[2]
σ
[3]
(θ
[3]
a
[2]
)
| {z }
a
[3]
θ
[4]
R
θ
[4]
a
[3]
θ
[4]
a
[3]
| {z }
ˆy = a
[4]
Figure 5.8: A simple hypothetical example with L = 4 and scalar hidden units.
In this hypothetical illustrative example, we use various activation functions. Namely,
σ
[1]
(·) = σ
ReLU
(·), σ
[2]
(·) = σ
Tanh
(·), σ
[3]
(·) = σ
Sig
(·),
20
i
i
i
i
i
i
i
i
5.4 The Backpropagation Algorithm
and for the last step, we use the identity function, i.e.,
σ
[4]
(
u
) =
u
. With the model specified
we now have computable expressions for known derivatives as in
(5.23)
, see also Section 5.3,
˙
C(u) = u y,
˙
f
[4]
θ
(u) = a
[3]
,
˙
f
[4]
a
(u) = θ
[4]
,
˙
f
[3]
θ
(u) = a
[2]
σ
Sig
(ua
[2]
)(1 σ
Sig
(ua
[2]
))
| {z }
˙σ
Sig
(ua
[2]
)
,
˙
f
[3]
a
(u) = θ
[3]
˙σ
Sig
(
[3]
),
˙
f
[2]
θ
(u) = a
[1]
(1 σ
Tanh
(ua
[1]
)
2
)
| {z }
˙σ
Tanh
(ua
[1]
)
,
˙
f
[2]
a
(u) = θ
[2]
˙σ
Tanh
(
[2]
),
˙
f
[1]
θ
(u) = a
[0]
1{u
T
a
[0]
0}
| {z }
˙σ
ReLU
(u
T
a
[0]
)
.
(5.28)
Note that in this example we choose to treat
˙
f
[1]
θ
(
u
) as a
p
dimensional vector, while the
other functions in this example are scalar valued both in their domain and co-domain.
Using these computable expressions for the derivatives, Algorithm 5.1 can be used to compute
the gradient of the loss function with respect to
θ
. For illustration we unroll the algorithm
where we use the notation u v to indicate assignment of v to the variable u.
Step 1: We compute forward propagation,
a
[1]
σ
ReLU
(x
θ
[1]
),
a
[2]
σ
Tanh
(θ
[2]
a
[1]
),
a
[3]
σ
Sig
(θ
[3]
a
[2]
),
ˆy = a
[4]
θ
[4]
a
[3]
.
Steps 2 and 3:
ζ
[4]
a
[4]
y, g
[4]
θ
a
[3]
ζ
[4]
.
Then the loop in steps 4–6 executes for three iterations for = 3, 2, 1:
Iteration = 3 (steps 5 and 6):
ζ
[3]
θ
[4]
ζ
[4]
, g
[3]
θ
a
[2]
˙σ
Sig
(θ
[3]
a
[2]
)ζ
[3]
.
Iteration = 2 (steps 5 and 6):
ζ
[2]
θ
[3]
˙σ
Sig
(θ
[3]
a
[2]
)ζ
[3]
, g
[2]
θ
a
[1]
˙σ
Tanh
(θ
[2]
a
[1]
)ζ
[2]
.
Iteration = 1 (steps 5 and 6):
ζ
[1]
θ
[2]
˙σ
Tanh
(θ
[2]
a
[1]
)ζ
[2]
, g
[1]
θ
x ˙σ
ReLU
(x
θ
[1]
)ζ
[1]
.
By expanding out the resulting expressions and using the chain rule, we may verify that
each of the intermediate outputs
g
[4]
θ
, g
[3]
θ
, g
[2]
θ
, and
g
[1]
θ
yields the correct gradient expression.
21
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
Accounting for δ
[]
Instead of ζ
[]
Algorithm 5.1 summarizes backpropagation since it deals with gradient evaluation for the
general recursive model
(5.21)
. However, it does not make any use of the more specific
structure of
(5.2)
which has an affine transformation followed by a non-linear activation
function. It turns out that in the context of deep learning models as in
(5.2)
, one can
simplify the algorithm by keeping track of an alternative set of intermediate derivative values,
δ
[]
R
N
, instead of ζ
[]
. These are,
δ
[]
:=
C(a
[L]
, y ; θ)
z
[]
, = 1, . . . , L,
where we observe that in contrast to the
ζ
[]
values from above, these values are derivatives of
the loss with respect to
z
[]
values instead of with respect
a
[]
values. Usage of
δ
[]
is standard
in backpropagation for deep learning and serves the basis for the main backpropagation
algorithm that we present below. With this notation, the key recursive relationship for
backward computation is,
δ
[]
=
a
[]
z
[]
z
[+1]
a
[]
C
z
[+1]
=
a
[]
z
[]
z
[+1]
a
[]
δ
[+1]
, = L 1, . . . , 1, (5.29)
which in comparison to the left hand side of
(5.25)
breaks up the step
a
[]
a
[+1]
into two
steps: a
[]
z
[+1]
followed by z
[+1]
a
[+1]
. Further, for the final layer, = L, we have,
δ
[L]
=
a
[L]
z
[L]
C
a
[L]
. (5.30)
Backpropagation for Fully Connected Networks
We now expand and adapt Algorithm 5.1 using the key recursive relationships for
δ
[]
(5.29)
and
(5.30)
. Consider first the component
a
[]
/∂z
[]
associated with the (vector) activation
function S
[]
(·). In a typical layer , the vector activation function is composed of identical
scalar activation functions as in
(5.3)
and hence the transposed Jacobian (see also
(??)
in
Appendix ??) is,
a
[]
z
[]
= J
S
[]
(z
[]
)
= Diag
˙σ
[]
(z
[]
)
. (5.31)
Here
Diag
(
·
) transforms a vector into a diagonal matrix and
˙σ
[]
(
·
) is the derivative of
the scalar activation which is interpreted as operating element wise on the input vector.
Examples of scalar activation functions and their derivatives are in Section 5.3.
In other cases,
S
[]
(
·
) is not separable as in
(5.3)
and the transposed Jacobian of
S
[]
does
not have a simple diagonal form as in
(5.31)
. One such potential Jacobian is computed for
the softmax in
(5.19)
. However, the most common application of softmax is in the last layer
=
L
together with cross entropy loss. In such a case, using
(5.20)
directly in place of
(5.30)
is more efficient and practical.
Continuing now with the recursive relationship
(5.29)
, we consider the component
z
[+1]
/∂a
[]
.
Here since z
[+1]
= W
[+1]
a
[]
+ b
[+1]
, we have
z
[+1]
a
[]
= W
[+1]
.
22
i
i
i
i
i
i
i
i
5.4 The Backpropagation Algorithm
Hence, similarly to (5.26), putting the pieces together we have the recursive relationship,
δ
[]
=
(
C
z
[L]
, = L,
Diag
˙σ
[]
(z
[]
)
W
[+1]
δ
[+1]
, = L 1, . . . , 1.
(5.32)
As discussed above, the case
=
L
can be computed using the two components of
(5.30)
separately or can be computed according to
(5.20)
in case of softmax combined with cross
entropy loss.
Now with a recursive relationship supporting backpropagation of the
δ
[]
values, we are
ready to deal with the desired derivative values of the parameters
θ
[]
= (
W
[]
, b
[]
). Define,
g
[]
W
=
C
W
[]
, and g
[]
b
=
C
b
[]
, = 1, . . . , L.
Here
g
[]
W
is an
N
×N
1
matrix and
g
[]
b
is an
N
-vector. Paralleling the right hand equation
in
(5.25)
(which involves
ζ
[]
for the more general model), we seek equations to retrieve these
target values (gradient components) in terms of δ
[]
. This is done via,
g
[]
W
=
C
z
[]
z
[]
W
[]
= δ
[]
a
[1]
, and g
[]
b
=
z
[]
b
[]
C
z
[]
= δ
[]
. (5.33)
where the first expression results from
(??)
and the second expression from
(??)
, both in
Appendix ??.
All of these relationships are packaged in the backpropagation algorithm for fully connected
networks. See also Figure 5.9 for an illustration of the flow of information.
Algorithm 5.2: Backpropagation for fully connected networks
Input: Dataset D = {(x
(1)
, y
(1)
), . . . , (x
(n)
, y
(n)
)},
objective function C(·) = C(·; D), and
parameter values θ = (θ
[1]
, . . . , θ
[L]
)
Output: derivatives of the loss (g
[1]
W
, g
[1]
b
), . . . , (g
[L]
W
, g
[L]
b
)
1 Compute a
[]
and z
[]
for = 1, . . . , L (Forward pass)
2 Compute δ
[L]
=
C
z
[L]
3 Set g
[L]
W
= δ
[L]
a
[L1]
and set g
[L]
b
= δ
[L]
4 for = L 1 . . . , 1 do
5 Compute δ
[]
= Diag
˙σ
[]
(z
[]
)
W
[+1]
δ
[+1]
6 Set g
[]
W
= δ
[]
a
[1]
and set g
[]
b
= δ
[]
Backpropagation on a Whole Mini Batch
In Section
??
we discussed the concept of iterative optimization using mini-batches. With
this approach, instead of considering all of the data points, we only use
n
b
samples and use
the gradient
P
i
C
i
/n
b
, summed over the samples in the mini-batch
i
. See also equation
(??) in Chapter ??.
23
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
W
[1]
, b
[1]
W
[2]
, b
[2]
W
[L1]
, b
[L1]
W
[L]
, b
[L]
z
[1]
z
[2]
z
[L1]
z
[L]
δ
[1]
δ
[2]
δ
[L1]
δ
[L]
(g
[1]
W
, g
[1]
b
) (g
[2]
W
, g
[2]
b
) (g
[L1]
W
, g
[L1]
b
) (g
[L]
W
, g
[L]
b
)
x = a
[0]
a
[1]
a
[2]
a
[L1]
a
[L]
Gradient Values
Parameter Values
Forward
Backward
Loss : C
y, ˆy
Figure 5.9: The variables and flow of information in the backpropagation algorithm for fully
connected networks.
In fact, earlier in this chapter, in
(5.9)
, we introduced notation for executing the forward pass
simultaneously on a whole mini-batch. The key here is to properly organize the activation
values and the data values in matrices. This notation can be further extended to adapt the
backpropagation algorithm for computation of the mini-batch gradient. Hence once may
also specify a form of Algorithm 5.2, suitable for mini-batches.
While specific implementation details are not our focus, we mention that in most practical
deep learning frameworks, the common interface for gradient evaluation using backpropaga-
tion is based on input tensors, where each tensor is associated with a mini-batch. Then one
of the indices of the tensor indexes specific data samples within the mini-batch. For example
common formats for image data are based on 4 dimensional tensors, with a format referred
to as NHWC or NCHW. Here ‘N’ represents the index within the mini-batch, while ‘H’, ‘W’,
and ‘C’ are for height, width, and channels as appropriate for color image data. Note that
the concept of channels is taken from the context of convolutional neural networks which
are the topic of study in Chapter ??.
Vanishing and Exploding Gradients
The key backpropagation recursions are the forward step
a
[+1]
=
S
[]
W
[]
a
[1]
+
b
[]
as
in
(5.4)
and the backward step
δ
[]
=
Diag
˙σ
[]
(
z
[]
)
W
[+1]
δ
[+1]
,
as in
(5.32)
. From a
practical perspective, these steps are sometimes subject to instability when the number of
layers L is large.
To see this, let us first simplify the situation by ignoring the activation functions and
assuming that
W
[]
is with a fixed square dimension and the same weight matrix
W
, for
= 1
, . . . , L
1, and the last weight matrix is
W
[L]
. Further, ignore the bias terms
b
[]
. In
24
i
i
i
i
i
i
i
i
5.4 The Backpropagation Algorithm
this simplified case,
ˆy = a
[L]
= W
[L]
W
[L1]
W
[L2]
··· W
[3]
W
[2]
W
[1]
x = W
[L]
W
L1
x, (5.34)
where W
L1
is the L 1 power of W .
As is well known in linear algebra and systems theory, unless the maximal eigenvalues of
W
are exactly with a magnitude of unity, as
L
grows we have that
ˆy
either vanishes (towards
0) or explodes (with values of increasing magnitude). As a further simplified illustration of
this, if
W
=
wI
(a constant multiple of the identity matrix), then
ˆy
=
W
[L]
w
L1
x
, and for
any
w
= 1, a vanshing
ˆy
or exploding
ˆy
phenomena persists. This illustration shows that for
non-small network depths (large L), instability issues may arise in the forward pass.
The same type of instability problem can then also persist in the backward pass since
the backward recursion
δ
[]
=
Diag
˙σ
[]
(
z
[]
)
W
[+1]
δ
[+1]
also includes repeated matrix
multiplications, and if for simplicity we ignore the activation functions and again take a
constant matrix W , then,
δ
[]
=
W
L
δ
[L]
. (5.35)
Hence there is often a vanishing or exploding nature of
δ
[]
for large
L
and low values of
(the first layers of the network). Now with
(5.33)
in mind, we see that in general, the gradient
values
g
[]
W
and
g
[]
b
may get smaller and smaller (vanishing) or larger and larger (exploding)
as we go backward with every layer during backpropagation. These are respectively called
the vanishing gradient or exploding gradient problems.
In the worst case, vanishing gradients, may completely stop the neural network from training,
or exploding gradients may throw parameter values towards arbitrary directions. This may
result in oscillations around the minima or even overshooting the optimum again and again.
Another impact of exploding gradients is that huge values of the gradients may cause number
overflow resulting in incorrect computations or introductions of
NaN
floating point values
(“not a number”).
Gradient descent improvements such as RMSProp, integrated in ADAM (see Section
??
),
can help normalize such variation in the gradients. Nevertheless, numerical instability can
still persist. Further, with activation functions such as sigmoid or tanh, in cases of inputs far
from 0 the gradient components of
Diag
˙σ
[]
(
z
[]
)
may also vanish. Activation functions such
as ReLU or leaky ReLU handle such problems, yet the overarching phenomenon associated
with repeated matrix multiplication as exemplified in
(5.34)
and
(5.35)
still persists. A key
strategy for mitigating such a problem is based on weight initialization as we discuss in the
section below.
To handle exploding gradients, sometimes gradient clipping is employed. This approach
adjusts the gradient so that it, or its individual components, are not too big in magnitude.
One approach is to clip each coordinate of the gradient so that it does not exceed a pre-
specified value in absolute value. This approach can obviously change the direction of the
gradient since some coordinates may be clipped while others not. Another approach is to
scale the whole gradient by its norm and to multiply by a fixed factor. This maintains the
direction of the gradient as originally computed and ensures its magnitude is at a fixed
threshold.
25
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
5.5 Weight Initialization
While gradient based optimization generally works in advancing towards local minima,
as highlighted above, vanishing gradients or exploding gradients may significantly hinder
progress. In fact, starting with initial values that are either constant or 0 for the weights
and bias parameters may throw the learning process off. Such constant initial parameters
may impose symmetry on the activation values of the hidden units and in turn prohibit the
model from exploiting its expressive power.
Random initialization enables us to break any potential symmetries and is almost always
preferable. Below we outline specific principles of random parameter intilization, yet even if
these are not applied, the most basic random intilization approach is to set all parameters
of the weight matrices
W
[1]
, . . . , W
[L]
as independent and identically distributed standard
normal random variables and to set all the entries of the bias vectors b
[1]
, . . . , b
[L]
at 0.
In view of the potential vanishing gradient and exploding gradient problems highlighted
above, there is room for smarter weight intialization techniques. Specifically, a general
principle that is followed focuses on the activation values
a
[1]
, . . . , a
[L]
associated with the
initial weight and bias parameters. If we momentarily consider these activation values
as random entries, an overarching goal is that the distribution of each entry of
a
[+1]
is
approximately similar to that of each entry of
a
[]
. In such a case, recursing the forward
pass
(5.4)
, a form of distributional stability can persist when the number of layers
L
is large.
With such an approach, if we are to choose the distribution of the initial weight matrix
parameters judiciously, vanishing and exploding gradients may be mitigated at the onset of
learning.
More concretely, the goal of equating the distribution of
a
[+1]
entries with
a
[]
entries is
viewed via the first two moments of the distribution (mean and variance). Specifically, with
activation function σ(·) we have in the -th layer,
a
[]
i
= σ
z
[]
i
, where z
[]
i
=
N
1
X
j=1
w
[]
i,j
a
[1]
j
+ b
[]
i
. (5.36)
To develop an initialization approach, we assume that all
a
[1]
j
values for
j
= 1
, . . . , N
1
,
are identically distributed with mean 0, some variance
˘v
, and are statistically independent.
Further, we wish to initialize parameters randomly such that
a
[]
i
is also with a mean of 0
and the same variance
˘v
. Our strategy for this is to initialize the bias
b
[]
with entries 0 and
the weight parameters
w
[]
i,j
as normally distributed random variables with a mean of 0 and
some specified variance.
It turns out, that when the activation function,
σ
(
·
), is tanh, a sensible variance to use
for
w
[]
i,j
is 1
/N
1
. This form of initialization is called Xavier initialization. Further, when
the activation function is ReLU, a sensible variance to use is 2
/N
1
and this form of
initialization is called He initialization. Another commonly used heuristic is to use variance
2
/
(
N
1
+
N
) which is the harmonic mean of
1
N
1
and
1
N
. In any case, all of these are
heuristics, often implemented in practical deep learning frameworks.
26
i
i
i
i
i
i
i
i
5.5 Weight Initialization
Derivation of Xavier initialization
To get a feel for the considerations of the 1
/N
1
variance rule of Xavier initialization we
now present the derivation of this rule. The approach is to equate activation variances at
some
˘v
, based on
(5.36)
. To do so, we use an approximation of
σ
Tanh
(
·
) as
σ
(
u
) =
u
. This is
a first-order Taylor approximation of the tanh activation function. With this approximation
and with 0 bias entries, (5.36) yields,
Var
a
[]
i
= Var
z
[]
i
= Var
N
1
X
j=1
w
[]
i,j
a
[1]
j
=
N
1
X
j=1
Var
w
[]
i,j
a
[1]
j
. (5.37)
This is based on the assumption that all random variables, both activation values and weight
parameters, are statistically independent. Now to evaluate the variance summands on the
right hand side, we use the following property of the variance of a product of two independent
random variables X and Y ,
Var (XY ) = E[X]
2
Var (Y ) + Var(X)E[Y ]
2
+ Var (X)Var (Y ).
With this we obtain,
Var
w
[]
i,j
a
[1]
j
= E
h
w
[]
i,j
i
2
Var
a
[1]
j
+Var
w
[]
i,j
E
h
a
[1]
j
i
2
+Var
w
[]
i,j
Var
a
[1]
j
.
Now since we seek activation values with a mean of 0 and a variance of
˘v
, it turns out that
Var
w
[]
i,j
a
[1]
j
=
Var
(
w
[]
i,j
)
˘v
. Now also assuming this variance
˘v
for activations of layer
,
and combining in (5.37) we have,
˘v =
N
1
X
j=1
Var (w
[]
i,j
) ˘v.
By further requiring that all weight initialization variance entries for layer
are constant,
say at
Var
(
w
[]
i,j
) =
˘w
, we have
˘v
=
N
1
˘w
˘v
. Then, this shows that setting
˘w
= 1
/N
1
achieves the desired result.
Further Insight Regarding Vanishing or Exploding Values
The derivation of the Xavier initialization offers further insight about the nature of vanishing
or exploding values. Assume that the input features vector
x
is also random with each
entry having the same variance
˘v
x
. Then by recursing the forward pass
(5.4)
, continuing to
approximate the activation function as identity, and using the variance calculations above
we have,
Var
a
[L]
j
=
"
L
Y
=1
N
1
˘w
#
˘v
x
.
This further highlights that setting
˘w
= 1
/N
1
yields stability of the variance of the outputs
which is especially important when the number of layers
L
is large. Other choices where
N
1
˘w
<
1 across all layers
may yield vanishing activations, and similarly if
N
1
˘w
>
1
we may observe exploding activations.
27
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
5.6 Batch Normalization
In Section
??
we discussed standardization of data. As seen in equations
(??)
and
(??)
, this
is simply a transformation for each feature of the input data based on subtraction of the
mean and division by the standard deviation. There are multiple benefits in carrying out
standardization, one of which is a reshaping of the loss landscape. As an illustration, in
Section
??
we explored the effect that such standardization has on simple problems. It turns
out that with much more complicated deep neural networks, standardization also called
normalization, may also be very beneficial. In this section we present an extended idea called
batch normalization where the outputs of intermediate hidden layers are also normalized in
an adaptive manner.
The overarching idea of batch normalization is to normalize (or standardize) not just the
input data but also individual neuron values within the intermediate hidden layers or final
layer of the network. In a nutshell returning to the display
(5.2)
and taking
j
as an index
of a neuron in layer
, we may wish to have either
z
[]
j
or
a
[]
j
exhibit near-normalized
values over the input dataset. Such normalization of the neuron values then yields more
consistent training and mitigates vanishing or exploding gradient problems. It also has a
slight regularization effect which may prevent overfitting.
Our exposition here outlines normalization of the z
[]
j
values, yet the reader should keep in
mind that one may choose to do so for the
a
[]
j
values instead. When applying the batch
normalization technique that we describe here, the output of the training process involves
further parameters, some of which are trained via optimization (gradient descent), and
others are based on running averages in the training process. We now present the details.
The Idea of Per Unit Normalization
The main idea of batch normalization is to consider neuron
j
in layer
and instead of using
z
[]
j
as in
(5.2)
, to use a transformed version
˜z
[]
j
. Such a transformation takes place both in
training time and when using the model in production, with subtle differences between the
two cases as we describe below. The transformation aims to position the
˜z
[]
j
values so that
they have approximately zero mean and unit standard deviation over the data. Further, the
transformation involves a correction using trainable parameters.
The transformation requires estimates of the unit’s mean and standard deviation so that
the unit value can be normalized. During training time, at a given training epoch and for a
given mini-batch of size n
b
, such estimates are obtained via,
ˆµ
[]
j
=
1
n
b
n
b
X
i=1
z
[](i)
j
and ˆσ
[]
j
=
v
u
u
t
1
n
b
n
b
X
i=1
(z
[](i)
j
ˆµ
[]
j
)
2
, (5.38)
where
z
[](i)
j
is the value at unit
j
, at layer
, and sample
i
within the mini-batch, prior to
carrying out normalization.
28
i
i
i
i
i
i
i
i
5.6 Batch Normalization
With ˆµ
[]
j
and ˆσ
[]
j
available, we compute
¯z
[](i)
j
=
z
[](i)
j
ˆµ
[]
j
q
(ˆσ
[]
j
)
2
+ ε
, (5.39)
where
ε >
0 is a small fixed quantity that ensures that we do not divide by zero. At this
point
¯z
[](i)
j
has nearly zero mean and nearly unit standard deviation for all data samples
i
in the mini-batch (it is nearly and not exactly only due to ε).
Now finally, an additional transformation takes place in the form,
˜z
[](i)
j
= γ
[]
j
¯z
[](i)
j
+ β
[]
j
, (5.40)
where
γ
[]
j
and
β
[]
j
are trainable parameters. A consequence of
(5.39)
and
(5.40)
is that
˜z
[](i)
j
has a standard deviation of approximately
γ
[]
j
and a mean of approximately
β
[]
j
over
the data samples
i
in the mini-batch. These parameters are respectively initialized at 1
and 0, and then as training progresses,
γ
[]
j
and
β
[]
j
are updated using the same learning
mechanisms applied to the weights and biases of the network. Namely they are updated
using backpropagation, and gradient based learning. Specific backpropagation details are
presented at the end of this section.
As presented above, each unit or neuron has their own set of trainable parameters,
γ
[]
j
and
β
[]
j
. However, in practice, multiple neurons in the same layer often share the same batch
normalization parameters. This implies adjusting the mean and standard deviation estimates
(5.38)
to have summations over multiple neurons
j
. For example in convolutional neural
networks as presented in the next chapter, all neurons of the same channel typically have
the same batch normalization applied to them.
Batch Normalization in Production
When using the model in production we need to be able to apply the model to a single
input data sample in which case evaluation of
ˆµ
[]
j
and
ˆσ
[]
j
as in
(5.38)
is impossible. Instead,
averages from the training set are collected during train time and these are supplied and
deployed with the model. Practically, since parameters are updated during a training run
and this updating in turn affects the
z
[]
j
values, averages are often collected in parallel to
training. A common technique is to use exponential smoothing as in
(??)
in Chapter
??
,
and apply it to the sequence of computed values
ˆµ
[]
j
and
ˆσ
[]
j
between mini-batches during
training. The result of this exponential smoothing, denoted here as
ˆ
¯µ
[]
j
and
ˆ
¯σ
[]
j
, is then
deployed together with the model.
As a summary, in a deployed model, each unit (
j
in layer
), or set of units, to which batch
normalization is applied is deployed with the trained
γ
[]
j
and
β
[]
j
values as well as with the
exponentially smoothed estimates
ˆ
¯µ
[]
j
and
ˆ
¯σ
[]
j
. Then in production,
˜z
[]
j
= γ
[]
j
z
[]
j
ˆ
¯µ
[]
j
q
ˆ
¯σ
[]
j
2
+ ε
+ β
[]
j
, (5.41)
29
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
is used in place of z
[]
j
.
Interestingly, returning to
(5.2)
and observing that the vector
z
[]
=
W
[]
a
[1]
+
b
[]
is an
affine function of the previous activation vector, we see that when combined with
(5.41)
we
are left with an affine transformation
˜z
[]
=
˜
W
[]
a
[1]
+
˜
b
[]
with a modified weight matrix
˜
W
[]
and bias vector
˜
b
[]
. Hence at least in principle, deployment of batch normalization of this
sort can be done without the production model needing to be aware of batch normalization
at all since we can encode it in the weight matrices and bias vectors.
6
Backpropagation of Batch Normalization Parameters
We close this section with an exposition of how the batch normalization parameters
γ
[]
j
and
β
[]
j
can be updated as part of the backpropagation algorithm. To simplify the notation
we omit the superscript [] and the subscript j as the batch learning parameters are either
neuron-specific or are shared by a group of neurons. Our focus is thus on one pair
γ
and
β
for which we require
C/∂γ
and
C/∂β
respectively. We also require computing the
intermediate derivative value C/∂z
(i)
for backpropagation to operate.
Figure 5.10: A schematic of the computational graph for batch normalization at an arbitrary layer.
The goal is to compute the gradients of the loss with respect to γ, β and each z
(i)
.
At each iteration, the network estimates the mean
ˆµ
and the standard deviation
ˆσ
corre-
sponding to the current batch. In the forward pass, given inputs (output of the previous
layer)
z
(i)
for
i
= 1
, . . . , n
b
, we calculate the outputs of the batch normalization procedure
˜z
(i)
. Then, the backpropagation pass will propagate the gradient of the loss function
C
(
·
)
through this transformation and compute the gradients with respect to the parameters
γ
and β. We then also compute the intermediate derivatives C/∂z
(i)
.
A schematic of the computational graph associated to the batch normalization steps is
presented in Figure 5.10. Backpropagation of the higher layers yields the gradient
C
˜z
(i)
.
With this we want to compute
C
γ
,
C
β
, and
C
z
(i)
.
6
This consolidation of batch normalization into the weight matrix and bias vector is often not done in
practice, especially due to the fact that W is often a convolution matrix as described in the next chapter.
30
i
i
i
i
i
i
i
i
5.7 Mitigating Overfitting with Dropout and Regularization
The gradients
C/∂γ
and
C/∂β
are simple. By applying the chain rule with
(5.40)
, we get
C
γ
=
n
b
X
i=1
C
˜z
(i)
¯z
(i)
, and
C
β
=
n
b
X
i=1
C
˜z
(i)
. (5.42)
In order to compute
C
z
(i)
, we need also to evaluate
C
ˆµ
,
C
ˆσ
2
, and
C
¯z
(i)
since,
C
z
(i)
=
C
¯z
(i)
¯z
(i)
z
(i)
+
C
ˆσ
2
ˆσ
2
z
(i)
+
C
ˆµ
ˆµ
z
(i)
. (5.43)
With the multivariate chain rule and (5.39) and (5.38) we have,
C
¯z
(i)
=
C
˜z
(i)
γ,
C
ˆσ
2
=
n
b
X
i=1
C
¯z
(i)
¯z
(i)
ˆσ
2
=
n
b
X
i=1
C
¯z
(i)
z
(i)
ˆµ
1
2
ˆσ
2
+ ε
3/2
,
C
ˆµ
=
n
b
X
i=1
C
¯z
(i)
¯z
(i)
ˆµ
+
C
ˆσ
2
ˆσ
2
ˆµ
=
n
b
X
i=1
C
¯z
(i)
1
ˆσ
2
+ ε
+
C
ˆσ
2
P
n
b
i=1
2
z
(i)
ˆµ
n
b
.
These formulas present us with an explicit expression by transforming (5.43) to
C
z
(i)
=
C
¯z
(i)
1
ˆσ
2
+ ε
+
C
ˆσ
2
2(z
(i)
ˆµ)
n
b
+
C
ˆµ
1
n
b
. (5.44)
This representation of
(5.44)
can then be integrated with backpropagation through the
neural network.
5.7 Mitigating Overfitting with Dropout and Regularization
In Section
??
we discussed the challenges and tradeoffs associated with overfitting and the
need for generalization. On the one hand we seek a model that will make use of the available
data and properly capture the dependence on the input features, while on the other hand we
wish to avoid overfitting. As presented in Section
??
, one general approach for mitigating
overfitting is called regularization, where as in
(??)
we may augment the loss function
C
(
·
)
with a regularization term
R
λ
(
θ
), and optimize
min
θ
C
(
θ
;
D
) +
R
λ
(
θ
), in place of just
minimization of the loss function. Such practice using additive regularization is possible with
deep neural networks as well. However, there is also an alternative popular approach called
dropout which is specific to deep neural networks. We first describe the dropout approach
and then highlight relationships between dropout, ensemble methods, as well as addition of
regularization terms.
Dropout
The idea of dropout is to randomly zero out certain neurons during the training process.
This allows training to focus on multiple random subsets of the parameters and yields a
form of regularization.
With dropout, at any backpropagation iteration (forward pass and backward pass) on a
mini-batch, only some random subset of the neurons is active. Practically neurons in layer
,
31
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
x
1
x
1
x
2
x
2
x
3
x
3
x
4
x
4
x
5
x
5
dropout
Figure 5.11: An illustration of dropout during training for a network with
L
= 3 layers,
p
= 5
input features, and
q
= 1 output. In each iteration during training, the network is transformed to
the network on the right where the dropped out units are randomly selected.
for
= 0
, . . . , L
1, have a specified probability
p
[]
keep
(0
,
1] where if
p
[]
keep
= 1 dropout
does not affect the layer, and otherwise each neuron
i
of the layer is “dropped out” with
probability 1
p
[]
keep
. This is simply a zeroing out of the neuron activation
a
[]
i
as we illustrate
in Figure 5.11.
In the forward pass, when we get to the neurons of layer
+ 1, all the neurons in layer
that were zeroed out do not participate in the computation. Specifically, the update for a
neuron j in the next layer, assuming a scalar activation function σ(·), becomes,
a
[+1]
j
= σ
b
[+1]
j
+
X
i kept
w
[+1]
i,j
a
[]
i
. (5.45)
If
p
[]
keep
= 1 all neurons are kept and the summation is over
i
= 1
, . . . , N
as in
(5.7)
, but
otherwise the summation is over the random subset of kept neurons. The Bernoulli coin flips
determining which neurons are kept and which are not are all carried out independently
within the layer, across layers, and throughout the training iterations.
In the backward pass, the effect of dropping out a neuron is evident via
(5.33)
. When neuron
i
is dropped out in layer
, the weights
w
[+1]
i,j
for all neurons
j
= 1
, . . . , N
+1
are updated
based on the gradient [
g
[]
W
]
i,j
which is set at 0. With a pure gradient descent optimizer this
means that weights
w
[+1]
i,j
are not updated at all during the given iteration, whereas with a
momentum based optimizer such as ADAM it means that the descent step for those weights
has a smaller magnitude; see Section ?? for a review of such optimizers.
32
i
i
i
i
i
i
i
i
5.7 Mitigating Overfitting with Dropout and Regularization
At the end of training, even with dropout implemented, the trained model still has a complete
set of weight matrices without any zeroed out elements, similar to the case in which we do
not have dropout. Hence to account for the fact that neuron
i
in layer
only took part in a
proportion of iterations
p
[]
keep
during the training, when using the model in production (test
time), we would like to use a weight matrix
˜
W
[+1]
=
p
[]
keep
W
[+1]
in place of
W
[+1]
. The
rationale here is to have the production forward pass similar to the training pass. Namely
such a production forward pass has,
a
[+1]
j
= σ
b
[+1]
j
+
N
X
i=1
p
[]
keep
w
[+1]
i,j
a
[]
i
, (5.46)
and this serves as an approximation to
(5.45)
. To see the basis of this approximation, treat
the summands in
(5.45)
,
w
[+1]
i,j
a
[]
i
, as identically distributed random variables, say with
expected value
µ
j
. Now observe that the expected value of both the summation (over a
random number of elements) in
(5.45)
and the expected value of the summation in
(5.46)
are both N
p
[]
keep
µ
j
.
In practice, the training–production pair
(5.45)
(5.46)
is not typically used per-se. The more
practical alternative is instead of remembering
p
[]
keep
and deploying it with the production
model, the training forward pass is modified to have the reciprocal of
p
[]
keep
as a scaling
factor of the weights. Namely, the training forward pass is,
a
[+1]
j
= σ
b
[+1]
j
+
X
i kept
1
p
[]
keep
w
[+1]
i,j
a
[]
i
. (5.47)
This form allows to use the resulting model normally in production without having to take
dropout into consideration at all. Namely, the production forward pass is,
a
[+1]
j
= σ
b
[+1]
j
+
N
X
i=1
w
[+1]
i,j
a
[]
i
, (5.48)
With this setup, the expected value of the summations in
(5.47)
and
(5.48)
agree and further
the forward step (5.48) agrees with the standard model as in (5.2), (5.4), and (5.7).
In practice, this simple and easy to implement idea of dropout has improved performance of
deep neural networks in many empirically tested cases. It is now an integral part of deep
learning training. We now explore the idea a bit further though the viewpoint of ensemble
methods.
Viewing Dropout as an Ensemble Approximation
Dropout can be viewed as an approximation of an ensemble method, a general concept from
machine learning. Let us first present an overview of ensemble methods or ensemble learning
and then argue why dropout serves as an approximation.
In general when we seek a model
ˆy
=
f
θ
(
x
), we may use the same dataset to train multiple
models that all try to achieve the same task. We may then combine the models into an
ensemble (model). The latter is usually more accurate than each of the individual models.
33
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
Let us illustrate this in the case of a scalar output model. We can choose to use
M
models
and denote each model via
ˆy
{i}
=
f
{i}
θ
{i}
(
x
) for
i
= 1
, . . . , M
, where
θ
{i}
is taken here as the
set of parameters of the
i
-th model. The ensemble model on an input
x
is then the average,
f
θ
(x) =
1
M
M
X
i=1
f
{i}
θ
{i}
(x), where θ = (θ
{1}
, . . . , θ
{M}
). (5.49)
Clearly
f
θ
(
·
) is more computationally costly since it requires
M
models instead of a single
model. This implies storing
M
parameter sets, training
M
times instead of once, and
evaluating
M
models in
(5.49)
instead of once during production. Nevertheless, there are
benefits.
To illustrate the strength of this technique assume for simplicity that the models are
homogenous in nature and only differ due to randomness in the training process and not the
model choice or hyper-parameters. In this case, for some fixed unseen input
˜x
we may treat
the output of model
i
, denoted
ˆy
{i}
θ
{i}
(
˜x
), as a random variable that is identical in distribution
to every other model output
ˆy
{j}
θ
{j}
(
˜x
), yet generally not independent. We further assume
that any pair of model outputs is identically distributed to any other pair. In this case we
may denote,
E
ˆy
{i}
θ
{i}
(˜x)
= µ, Var
ˆy
{i}
θ
{i}
(˜x)
= σ
2
, and cor
ˆy
{i}
θ
{i}
(˜x), ˆy
{j}
θ
{j}
(˜x)
= ρ,
respectively, where
cor
(
·, ·
) is the correlation between two models
i
=
j
and is assumed to
be the same for all
i
,
j
pairs. Such an assumption on the correlation also imposes
7
a lower
bound on the correlation,
1
M 1
ρ, or 0 ρ +
1 ρ
M
. (5.50)
We can now evaluate the mean and variance of the ensemble model (5.49). Namely,
E[f
θ
(˜x)] =
1
M
E
M
X
i=1
f
{i}
θ
{i}
(˜x)
= µ, (5.51)
and further noting that ρσ
2
is the covariance between any two models we obtain,
8
Var
f
θ
(˜x)
=
1
M
2
Var
M
X
i=1
f
{i}
θ
{i}
(˜x)
=
1
M
2
Mσ
2
+ M(M 1)ρσ
2
=
ρ +
1 ρ
M
σ
2
.
(5.52)
With
(5.50)
it is confirmed that as required this variance expression in
(5.52)
is non-negative.
Further, we see that as the number of models in the ensemble,
M
, grows, the variance of the
ensemble model converges to
ρσ
2
. Since in addition to
(5.50)
,
ρ
1 and practically
ρ <
1,
this limiting variance is less than
σ
2
. For example if
ρ
= 0
.
5 as
M
grows the variance of the
estimator drops by 50%.
Putting the computational costs aside, these properties of ensemble models make them very
attractive because the bias does not change as shown in (5.51), but the variance decreases;
7
This may be shown based on the constraint that the covariance matrix is a positive semi-definite matrix.
8
The variance of a sum of random variables is the sum of the elements of the covariance matrix.
34
i
i
i
i
i
i
i
i
5.7 Mitigating Overfitting with Dropout and Regularization
recall also the general discussion of the bias variance tradeoff in Section
??
. Nevertheless,
deep learning models are not easily amenable for ensemble models in the form of
(5.49)
because the number of parameters and computational cost (both for training and production)
is too high. Training a single model may sometimes take days and the computational costs
of a single evaluation f
{i}
θ
{i}
(˜x) are also non-negligible. This is where dropout comes in.
We may loosely view dropout as an ensemble of
M
models where
M
is the number of training
iterations. In a practical training scenario,
M
can be on the order of hundreds, thousands,
or tens of thousands, depending on the number of epochs during training and the size of
the mini-batch in comparison to the number of training samples. For example, if
n
= 10
5
training samples and the mini-batch size is
n
b
= 100, then each epoch has 1000 iterations
and if we execute, say 1000 epochs of training then
M
= 10
6
. Then, since the production
model involves weights accumulated throughout all 10
6
iterations, we may loosely view the
production model as an ensemble model and we may expect its variance to be reduced from
σ
2
to approximately
ρσ
2
. Now clearly each of the
M
iterations did not execute training of
the model fully but rather only trained for a single iteration. Hence this ensemble view of
dropout is merely a heuristic description.
Addition of Regularization Terms and Weight Decay
In addition to dropout, as already introduced in Section
??
, addition of a regularization
term is another key approach to prevent overfitting and improve generalization performance.
Augmenting the loss with a regularization term R
λ
(θ) restricts the flexibility of the model,
and this restriction is sometimes needed to prevent overfitting. In the context of deep learning,
and especially when ridge regression style regularization is applied, this practice is sometimes
called weight decay when considering gradient based optimization. We now elaborate.
Take the original loss function
C
(
θ
) and augment it to be
˜
C
(
θ
) =
C
(
θ
) +
R
λ
(
θ
). In our
discussion here, let us focus on the ridge regression type regularization with parameter
λ >
0,
and,
R
λ
(θ) =
λ
2
R(θ), with R(θ) = θ
2
= θ
2
1
+ . . . + θ
2
d
.
Here for notational simplicity, we consider all the
d
parameters of the model as scalars,
θ
i
for
i
= 1
, . . . , d
(not to be confused with the notation used above for the parameters of a model
as part of an ensemble). Nevertheless, note that typically regularization is only applied to
the weight matrix parameters and not to the bias vectors. Further, we may even restrict
regularization to certain layers and not others.
Now assume we execute basic gradient descent steps as in
(??)
. With a learning rate
α >
0,
the update at iteration t is,
θ
(t+1)
= θ
(t)
α
˜
C(θ
(t)
).
In our ridge regression style penalty case we have
˜
C
(
θ
) =
C
(
θ
) +
λθ
, and hence the
gradient descent update can be represented as
θ
(t+1)
= (1 αλ)θ
(t)
αC(θ
(t)
). (5.53)
Now the interesting aspect of
(5.53)
, assuming that
αλ <
2, is that it involves shrinkage
or weight decay directly on the parameters in addition to gradient based learning. That is,
independently of the value of the gradient
C
(
θ
(t)
), in every iteration,
(5.53)
continues to
decay the parameters, each time multiplying the previous parameter by a factor 1 αλ.
35
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
This weight decay phenomena can then be extended algorithmically to enforce regularization
not directly via addition of a regularization term, but rather simply by augmenting the
gradient descent updates to include weight decay. For example we may consider popular
gradient based algorithms such as ADAM in Section
??
and the other algorithms in Section
??
,
and in each case add an additional step which incorporates multiplying the weights by a
constant less than but close to unity.
36
i
i
i
i
i
i
i
i
5.7 Mitigating Overfitting with Dropout and Regularization
Notes and References
The origins of deep learning date back to the same era during which the digital computer materialised.
In fact, early ideas of artificial neural networks were introduced first in 1943 with [
39
]. Then, in the
post WWII era, Frank Rosenblatt’s perceptron was the first working implementation of a neural
network model, [
49
]. The perceptron and follow up work drew excitement in the 1960’s yet with the
1969 paper [
41
], there was formal analysis that shone negative light on limited capabilities of single
layer neural networks and this eventually resulted in a decline of interest, which is sometimes termed
the “AI winter” of 1974–1980. Before this period there were even implementations of deep learning
architectures with [
28
], and with [
27
] where 8 layers were implemented. In 1967 such multi-layer
perceptrons were even trained using stochastic gradient descent, [2].
Some attribute the end of the 1974–1980 AI winter to a few developments that drew attention and
resulted in impressive results. Some include Hopfield networks which are recurrent in nature (see
Chapter
??
), and also formalism and implementation of the backpropagation algorithm in 1981,
[
57
]. In fact, early ideas of backpropagation can be attributed to a PhD thesis in 1970, [
36
] by S.
Linnainmaa. See also our notes and references on early developments of reverse mode automatic
differentiation at the end of Chapter
??
. In parallel to this revival of interest in artificial intelligence
in the early 1980’s there were many developments that led to today’s contemporary convolutional
neural networks. See the notes and references at the end of Chapter
??
. Historical accounts of deep
learning can be found in [
51
] and [
52
] as well as a website by A. Kurenkov.
9
The book [
18
] was a key
reference of neural networks, summarizing developments up to the turn of the twenty first century.
The 2015 Nature paper by Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, [
33
] captures more
contemporary developments.
Positive results about the universal approximation ability of neural networks, such as Theorem 5.1
presented in this chapter, appear in [
10
] for a class of sigmoid activations and in [
22
] for a larger
class of non-polynomial functions. With such results, it became evident that with a single hidden
layer, neural networks are very expressive. Still, the practical insight to add more hidden layers to
increase expressivity arose with the work of Geoffrey Hinton et al. in 2006 [
20
], Yoshua Bengio et
al. 2006 [
6
] and other influential including works such as [
5
], [
8
] and [
32
]. The big explosion was in
2012 with AlexNet in [29].
Our example of a multiplication gate as in Figure 5.4 comes from [34]. Our motivation to use this
example and the analysis we present around Figure 5.5 dealing with the expressivity of deeper
networks is motivated by a 2017 talk of Niklas Wahlström.
10
Beyond our elementary presentation,
many researchers have tried to provide theoretical justifications to why deep neural networks are so
powerful. Some justifications of the power of deep learning are in [
53
] using information theory, and
[
9
], [
12
],[
47
] , and [
55
], using other mathematical reasoning approaches. See also [
40
] and [
46
] for
surveys of theoretical analysis of neural networks.
In terms of activation functions, the sigmoid function was the most popular function in early neural
architectures with the tanh function serving as an alternative. The popularity of ReLU grew with
[
42
], especially after its useful application in the AlexNet convolutional neural network [
29
]. Note
that ReLU was first used in 1969, see [
13
]. Other activation functions such as leaky ReLU were
introduced in [
38
], as well as parameterized activation functions such as PReLU studied in [
19
] in
order to mitigate vanishing and exploding gradient issues. A general survey of activation functions
is in [11].
The backpropagation algorithm can be attributed to [
50
], yet has earlier origins with general back-
ward mode automatic differentiation surveyed in [
4
] (see also notes of Chapter
??
). Our presentation
in Algorithm 5.2 is specific to the precise form of feedforward networks that we considered, yet
variants can be implemented. Importantly, with the advent of automatic differentiation frameworks
such as TensorFlow [
1
] followed by PyTorch [
44
], Flux.jl [
24
], JAX [
7
], and others, the use of back-
propagation as part of deep learning has become standard. Such software frameworks automatically
implement backpropagation as a special case of backward mode automatic differentiation where the
computational graph is constructed, often automatically based on code. Early deep learning work
up to about 2014, did not have such software frameworks available and hence “hand coding” of
backpropagation was more delicate and time consuming. It is fair to say that with the proliferation
of deep learning software frameworks, innovations in research and industry accelerated greatly. We
9
https://www.skynettoday.com/overviews/neural-net-history.
10
See https://www.it.uu.se/katalog/nikwa778/talks/DL_EM2017_online.pdf.
37
i
i
i
i
i
i
i
i
5 Feedforward Deep Networks
also note that properly considering matrix analysis aspects of backpropagation often requires matrix
calculus for which useful references are [16] and [45].
Extensive discussion of vanishing and exploding gradients is in [
43
]. Related aspects of training
including weight initialization are discussed in chapter 7 of [
48
]. The Xavier initialization technique
was introduced in [
15
]. Related initialization techniques and their analysis are studied in [
19
]. The
idea of batch normalization was initially introduced in [
26
] and analyzed in [
3
] and [
25
]. Since then,
batch normalization has been extended in several ways including instance normalization in [
56
]. A
survey is in [
23
]. Our analysis of backpropagation of batch-normalization parameters is from [
26
].
Dropout was initially introduced in [
21
] and [
54
]. Analysis of dropout as an ensemble approximation
is in [
17
]. See also [
35
] for an overview of ensemble methods. Further study of dropout and its
implications is in [
14
]. Regularization in feedforward networks was reviewed in [
31
] and analysis of
weight decay is in [30] and [37].
38
i
i
i
i
i
i
i
i
Bibliography
[1]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, et al. TensorFlow: Large-scale machine learning on
heterogeneous distributed systems. arXiv:1603.04467, 2016.
[2]
S. Amari. A theory of adaptive pattern classifiers. IEEE Transactions on Electronic
Computers, 1967.
[3]
S. Arora, Z. Li, and K. Lyu. Theoretical analysis of auto rate-tuning by batch normal-
ization. arXiv:1812.03981, 2018.
[4]
A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic dif-
ferentiation in machine learning: A survey. Journal of Machine Learning Research,
2018.
[5]
Y. Bengio. Learning deep architectures for ai. Foundations and Trends® in Machine
Learning, 2009.
[6]
Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of
deep networks. Advances in Neural Information Processing Systems, 2006.
[7]
J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula,
A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang. JAX: Composable
Transformations of Python+NumPy Programs. http://github.com/google/jax, 2018.
[8]
D. C. Cire
s
,
an, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep, big, simple
neural nets for handwritten digit recognition. Neural Computation, 2010.
[9]
N. Cohen, O. Sharir, and A. Shashua. On the Expressive Power of Deep Learning: A
Tensor Analysis, 2016.
[10]
G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of
Control, Signals, and Systems, 1989.
[11]
S. R. Dubey, S. K. Singh, and B. B. Chaudhuri. Activation functions in deep learning:
A comprehensive survey and benchmark. Neurocomputing, 2022.
[12]
R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In
Conference on learning theory, 2016.
[13]
K. Fukushima. Visual feature extraction by a multilayered network of analog threshold
elements. IEEE Transactions on Systems Science and Cybernetics, 1969.
39
i
i
i
i
i
i
i
i
Bibliography
[14]
Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model
uncertainty in deep learning. In International conference on machine learning, 2016.
[15]
X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward
neural networks. In Proceedings of the thirteenth international conference on artificial
intelligence and statistics, 2010.
[16] G. H. Golub and C. F. Van Loan. Matrix computations. JHU press, 2013.
[17]
K. Hara, D. Saitoh, and H. Shouno. Analysis of dropout learning regarded as ensemble
learning. In Artificial Neural Networks and Machine Learning–ICANN 2016: 25th
International Conference on Artificial Neural Networks, Barcelona, Spain, September
6-9, 2016, Proceedings, Part II 25, 2016.
[18] S. Haykin. Neural networks: a comprehensive foundation. Prentice Hall, 1998.
[19]
K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-
level performance on imagenet classification. In Proceedings of the IEEE international
conference on computer vision, 2015.
[20]
G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief
nets. Neural Computation, 2006.
[21]
G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdi-
nov. Improving neural networks by preventing co-adaptation of feature detectors.
arXiv:1207.0580, 2012.
[22]
K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural
Networks, 1991.
[23]
L. Huang, J. Qin, Y. Zhou, F. Zhu, L. Liu, and L. Shao. Normalization techniques in
training dnns: Methodology, analysis and application. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2023.
[24]
M. Innes. Flux: Elegant Machine Learning with Julia. Journal of Open Source Software,
2018.
[25]
S. Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-
normalized models. Advances in Neural Information Processing Systems, 2017.
[26]
S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In International conference on machine learning, 2015.
[27]
A. G. Ivakhnenko. Polynomial theory of complex systems. IEEE Transactions on
Systems, Man, and Cybernetics, 1971.
[28]
A. G. Ivakhnenko and V. G. Lapa. Cybernetic predicting devices. Purdue Univ Lafayette
Ind School of Electrical Engineering, appearing in The Defense Technical Information
Center, 1966.
40
i
i
i
i
i
i
i
i
Bibliography
[29]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep
convolutional neural networks. Advances in neural information processing systems,
2012.
[30]
A. Krogh and J. Hertz. A Simple Weight Decay Can Improve Generalization. In
Advances in Neural Information Processing Systems, 1991.
[31]
J. Kukačka, V. Golkov, and D. Cremers. Regularization for deep learning: A taxonomy.
arXiv:1710.10686, 2017.
[32]
H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin. Exploring strategies for training
deep neural networks. Journal of Machine Learning Research, 2009.
[33] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015.
[34]
H. W. Lin, M. Tegmark, and D. Rolnick. Why does deep and cheap learning work so
well? Journal of Statistical Physics, 2017.
[35]
A. Lindholm, N. Wahlström, F. Lindsten, and T. B. Schön. Machine Learning - A First
Course for Engineers and Scientists. Cambridge University Press, 2022.
[36]
S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as
a taylor expansion of the local rounding errors. Master’s Thesis (in Finnish), Univ.
Helsinki, 1970.
[37]
I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv:1711.05101,
2017.
[38]
A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural
network acoustic models. In International Conference on Machine Learning, 2013.
[39]
W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous
activity. The Bulletin of Mathematical Biophysics, 1943.
[40]
G. Menghani. Efficient deep learning: A survey on making deep learning models smaller,
faster, and better. ACM Computing Surveys, 2023.
[41]
M. Minsky and S. A. Papert. Perceptrons: An Introduction to Computational Geometry.
MIT Press, 1969.
[42]
V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines.
In International conference on machine learning, 2010.
[43]
R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural
networks. In International conference on machine learning, 2013.
[44]
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,
L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. 31st Conference on
Neural Information Processing Systems (NIPS2017), 2017.
41
i
i
i
i
i
i
i
i
Bibliography
[45]
K. B. Petersen and M. S. Pedersen. The matrix cookbook. Technical University of
Denmark, 2012.
[46]
T. Poggio, A. Banburski, and Q. Liao. Theoretical issues in deep networks. Proceedings
of the National Academy of Sciences, 2020.
[47]
T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao. Why and When Can
Deep but Not Shallow Networks Avoid the Curse of Dimensionality: a Review, 2017.
[48] S. J. D. Prince. Understanding Deep Learning. MIT Press, 2023.
[49]
F. Rosenblatt. The perceptron: a probabilistic model for information storage and
organization in the brain. Psychological review, 1958.
[50]
D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by
back-propagating errors. Nature, 1986.
[51]
J. Schmidhuber. Annotated history of modern AI and Deep learning. arXiv:2212.11279,
2022.
[52] T. J. Sejnowski. The Deep Learning Revolution. MIT Press, 2018.
[53]
R. Shwartz-Ziv and N. Tishby. Opening the black box of deep neural networks via
information. arXiv:1703.00810, 2017.
[54]
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout:
a simple way to prevent neural networks from overfitting. The journal of machine
learning research, 2014.
[55]
M. Telgarsky. Benefits of depth in neural networks. In Conference on learning theory,
2016.
[56]
D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing
ingredient for fast stylization. arXiv:1607.08022, 2016.
[57]
P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In System
Modeling and Optimization: Proceedings of the 10th IFIP Conference, 1981, 2005.
42