See the latest book content here.

4 General Fully Connected Neural Networks

Learning outcomes from this chapter

The full neural network
Forward, backward, chain-rule
Universal Approximation Theorems
Activation function and derivative
Matrix representation
Automatic differentiation
Dropout, Mini-batch and batch-normalization
Vanishing gradients and overcoming them
Basic implementation from first principles

4.1 The full neural network

4.1.1 The architecture of neural networks

The ideal network architecture for a task must be found via experimentation guided by monitoring the validation set error.

Let’s consider the following architecture.

The left most layer is called the input layer, and the neurons within the layer are called input neurons. A neuron of this layer is of a special kind since it has no input and it only outputs an $x_{j}$ value the $j$ th features.
The right most or output layer contains the output neurons (just one here). The middle layer is called a hidden layer, since the neurons in this layer are neither inputs nor outputs. We do not observe (directly) what goes out of this layer.

Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons.

The design of the input and output layers in a network is often straightforward: as many neurons in the input layer than the number of explanatory/features variables; as many neurons in the output layer than the number of possible values for the response variable (if it is qualitative).

But there can be quite an art to the design of the hidden layers. Rectiﬁed linear units are an excellent default choice of hidden unit. Rectiﬁed linear units are easy to optimize because they are so similar to linear units.

When the output from one layer is used as input to the next layer (with no loops), we speak about feedforward neural networks. Other models are called recurrent neural networks.

Deep feedforward networks, also often called feedforward neural networks,or multilayer perceptrons (MLPs), are the quintessential deep learning models.

4.1.2 Goals

The goal of a feedforward network is to approximate some function $f^{*}$ . A feedforward network deﬁnes a mapping $y = f (x; θ)$ and learns the value of the parameters $θ$ that result in the best function approximation.
The function $f$ is composed of a chain of functions: $f = f^{(k)} (f^{(k - 1)} (\dots f^{(1)})),$ where $f^{(1)}$ is called the firstlayer, and so on. The depth of the network is $k$ . The ﬁnal layer of a feedforward network is called the output layer.

Rather than thinking of the layer as representing a single vector-to-vector function, we can also think of the layer as consisting of many unit that act in parallel, each representing a vector-to-scalar function.

Architecture of the network is an art:

how many layers the network should contain
how these layers should be connected to each other
how many units should be in each layer.

4.1.3 One Layer NN

In this shalow neural network, we have:

$x_{1}, x_{2}, x_{3}$ are inputs of a Neural Network. These elements are scalars and they are stacked vertically. This also represents an input layer.
Variables in a hidden layer are not seen in the input set.
The output layer consists of a single neuron only $\hat{y}$ is the output of the neural network.

Let precise some notation that we will use

${\begin{array}{rcl} z_{1}^{[1]} & = & {w_{1}^{[1]}}^{T} x + b_{1}^{[1]} a_{1}^{[1]} = σ (z_{1}^{[1]}) \\ z_{2}^{[1]} & = & {w_{2}^{[1]}}^{T} x + b_{2}^{[1]} a_{2}^{[1]} = σ (z_{2}^{[1]}) \\ z_{3}^{[1]} & = & {w_{3}^{[1]}}^{T} x + b_{3}^{[1]} a_{3}^{[1]} = σ (z_{3}^{[1]}) \\ z_{4}^{[1]} & = & {w_{4}^{[1]}}^{T} x + b_{4}^{[1]} a_{4}^{[1]} = σ (z_{4}^{[1]}) \end{array}$

where $x = (x_{1}, x_{2}, x_{3})^{T}$ and $w_{j}^{[1]} = (w_{j, 1}^{[1]}, w_{j, 2}^{[1]}, w_{j, 3}^{[1]}, w_{j, 4}^{[1]})^{T}$ (for $j = 1, \dots, 4$ ).

Then, the output layer is defined by:

$\begin{array}{rcl} z_{1}^{[2]} & = & {w_{1}^{[2]}}^{T} a^{[1]} + b_{1}^{[2]} a_{1}^{[2]} = σ (z_{1}^{[2]}) \end{array}$

where $a^{[1]} = (a_{1}^{[1]}, \dots, a_{4}^{[1]})^{T}$ and $w_{1}^{[2]} = (w_{1, 1}^{[2]}, w_{1, 2}^{[2]}, w_{1, 3}^{[2]}, w_{1, 4}^{[2]})^{T}$

One can use matrix representation for efficiency computation:

$\begin{matrix} (1) & [\begin{matrix} - & {w_{1}^{[1]}}^{T} & - \\ - & {w_{2}^{[1]}}^{T} & - \\ - & {w_{3}^{[1]}}^{T} & - \\ - & {w_{4}^{[1]}}^{T} & - \end{matrix}] [\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \end{matrix}] + [\begin{matrix} b_{1}^{[1]} \\ b_{2}^{[1]} \\ b_{3}^{[1]} \\ b_{4}^{[1]} \end{matrix}] = [\begin{matrix} {w_{1}^{[1]}}^{T} x + b_{1}^{[1]} \\ {w_{2}^{[1]}}^{T} x + b_{2}^{[1]} \\ {w_{3}^{[1]}}^{T} x + b_{3}^{[1]} \\ {w_{4}^{[1]}}^{T} x + b_{4}^{[1]} \end{matrix}] = [\begin{matrix} z_{1}^{[1]} \\ z_{2}^{[1]} \\ z_{3}^{[1]} \\ z_{4}^{[1]} \end{matrix}] \end{matrix}$

and by defining
$W^{[1]} = [\begin{matrix} - & {w_{1}^{[1]}}^{T} & - \\ - & {w_{2}^{[1]}}^{T} & - \\ - & {w_{3}^{[1]}}^{T} & - \\ - & {w_{4}^{[1]}}^{T} & - \end{matrix}] b^{[1]} = [\begin{matrix} b_{1}^{[1]} \\ b_{2}^{[1]} \\ b_{3}^{[1]} \\ b_{4}^{[1]} \end{matrix}] z^{[1]} = [\begin{matrix} z_{1}^{[1]} \\ z_{2}^{[1]} \\ z_{3}^{[1]} \\ z_{4}^{[1]} \end{matrix}] a^{[1]} = [\begin{matrix} a_{1}^{[1]} \\ a_{2}^{[1]} \\ a_{3}^{[1]} \\ a_{4}^{[1]} \end{matrix}]$ we can write $z^{[1]} = W^{[1]} x + b^{[1]}$ and then by applying Element-wise Independent activation function $σ (\cdot)$ to the vector $z^{[1]}$ (meaning that $σ (\cdot)$ are applied independently to each element of the input vector $z^{[1]}$ ) we get:

$a^{[1]} = σ (z^{[1]}) .$ The output layer can be computed in the similar way:

$z^{[2]} = W^{[2]} a^{[1]} + b^{[2]}$

where

$W^{[2]} = [\begin{matrix} w_{1, 1}^{[2]} \\ w_{1, 2}^{[2]} \\ w_{1, 3}^{[2]} \\ w_{1, 4}^{[2]} \end{matrix}] b^{[2]} = [\begin{matrix} b_{1}^{[2]} \\ b_{2}^{[2]} \\ b_{3}^{[2]} \\ b_{4}^{[2]} \end{matrix}]$

and finally:

$a^{[2]} = σ (z^{[2]}) ⟶ \hat{y}$

So far we have used:

The superscript number $^{[i]}$ for denoting the layer number and the subscript number $_{j}$ denotes the neuron number in a particular layer
$x$ is the input vector consisting of 3 features.
$W_{j}^{[i]}$ is the weight vector associated with neuron $j$ present in the layer $i$
$b_{j}^{[i]}$ is the bias associated with neuron j present in the layer i.
$z_{j}^{[i]}$ is the intermediate output associated with neuron $j$ present in the layer $i$ .
$a_{j}^{[i]}$ is the final output associated with neuron $j$ present in the layer $i$ .
As an example $σ (\cdot)$ is the sigmoid activation function

Using this notation the forward-propagation equations are:

${\begin{array}{rcl} z^{[1]} & = & W^{[1]} x + b^{[1]} \\ a^{[1]} & = & σ (z^{[1]}) \\ z^{[2]} & = & W^{[2]} a^{[1]} + b^{[2]} \\ \hat{y} & = & a^{[2]} = σ (z^{[2]}) \end{array}$

Why non-linear Activation is important. Consider this neural network without activation functions:

${\begin{array}{rcl} z^{[1]} & = & W^{[1] T} x + b^{[1]} \\ \hat{y} & = & z^{[2]} = W^{[2] T} z^{[1]} + b^{[2]} \end{array}$ Then, it follows

${\begin{array}{rcl} z^{[1]} & = & W^{[1]} x + b^{[1]} \\ \hat{y} & = & z^{[2]} = W^{[2]} W^{[1]} x + W^{[2]} b^{[1]} + b^{[2]} \\ \hat{y} & = & z^{[2]} = W_{n e w} x + b_{n e w} \end{array}$ The output is then a linear combination of a new weight matrix, input and a new bias. Thus if we use an identity activation function then the Neural Network will output linear output of the input. Indeed,a composition of two linear functions is a linear function and so we lose the representation power of a NN. However, a linear activation function is generally recommended and implemented in the output layer in case of regression.

4.1.4 N layers Neural Network

We can generalize this simple previous neural network to a Multi-layer fully-connected neural networks by sacking more layers get a deeper fully-connected neural network defining by the following equations:

${\begin{array}{rcl} a^{[1]} & = & g^{[1]} (W^{[1]} x + b^{[1]}) \\ a^{[2]} & = & g^{[2]} (W^{[2]} a^{[1]} + b^{[2]}) \\ \dots & = & \dots \\ a^{[r - 1]} & = & g^{[r - 1]} (W^{[r - 1]} a^{[r - 2]} + b^{[r - 1]}) \\ \hat{y} = a^{[r]} & = & g^{[r]} (W^{[r]} a^{[r - 1]} + b^{[r]}) \end{array}$

This neural network is composed of $r$ layers based on $r$ weight matrices $W^{[1]}, \dots, W^{[r]}$ and $r$ bias vectors $b^{[1]}, \dots, b^{[r]}$ . The $r$ activation functions noted $g^{[r]}$ might be different for each layer $r$ . The number of neurons in each layer could be also be not equal and will be noted $m_{r}$ . Using this notation:

The total number of neurons in the network: $m_{1} + \dots + m_{r}$
Total number of parameters in this network is $(d + 1) m_{1} + (m_{1} + 1) m_{2} + \dots + (m_{r - 1} + 1) * m_{r}$
$W^{[1]} \in ℜ^{m_{1} \times d}$ , $W^{[k]} \in ℜ^{m_{k} \times m_{k - 1}}$ ( $k = 2, \dots, r - 1$ ) and $W^{[r]} \in ℜ^{1 \times m_{r - 1}}$

How do we count layers in a neural network?:

When counting layers in a neural network we count hidden layers as well as the output layer, but we don’t count an input layer.

It is a four layer neural network with three hidden layers.

4.1.5 Vectorizing Across Multiple Training Examples

So far we have defined our Neural Network using only one inpute feature vector $x$ to generate prediction $\hat{y}$ $x ⟶ a^{[2]} = \hat{y}$ Let consider $m$ training example $x^{[1]}, \dots, x^{[m]}$ and we can use our NN to get for each training example $x^{[i]}$ a prediction

$x^{(i)} ⟶ a^{[2] (i)} = \hat{y} i = 1, \dots m$ One can use a loop for getting all the predictions. However, matrix representation will help us to overcome the computational issue of using loop strategy.

Let first define the matrix $X$ which every column is a feature vector for one training sample:

$X = [\begin{matrix} | & | & \dots & | \\ x^{(1)} & x^{(2)} & \dots & x^{(m)} \\ | & | & \dots & | \end{matrix}] .$

Then, we define the matrix $Z^{[1]}$ with columns $z^{[1] (1)} \dots z^{[1] (m)}$

$Z^{[1]} = [\begin{matrix} | & | & \dots & | \\ z^{[1] (1)} & z^{[1] (2)} & \dots & z^{[1] (m)} \\ | & | & \dots & | \end{matrix}] .$ The activation matrix for the first hidden layer $A^{[1]}$ is defined similary by:

$A^{[1]} = [\begin{matrix} | & | & \dots & | \\ a^{[1] (1)} & a^{[1] (2)} & \dots & a^{[1] (m)} \\ | & | & \dots & | \end{matrix}],$ where for example the element in the first row and in the first column of a matrix $A^{[1]}$ is an activation of the first hidden unit and the first training example. For clarification:

$A^{[1]} = [\begin{matrix} 1^{s t} u n i t o f 1. t r . e x a m p l e & 1^{s t} u n i t o f 2^{n d} t r . e x a m p l e & \dots & 1^{s t} u n i t o f m^{t h} . t r . e x a m p l e \\ 2^{n d} u n i t o f 1^{s t} t r . e x a m p l e & 2^{n d} u n i t o f 2^{n d} t r . e x a m p l e & \dots & 2^{n d} u n i t o f m^{t h} t r . e x a m p l e \\ t h e l a s t u n i t o f 1^{s t} t r . e x a m p l e & t h e l a s t u n i t o f 2^{n d} t r . e x a m p l e & \dots & t h e l a s t u n i t o f m^{t h} t r . e x a m p l e \end{matrix}]$ Based on this matrix representation we get:

${\begin{array}{rcl} Z^{[1]} & = & W^{[1]} X + b^{[1]} \\ A^{[1]} & = & σ (Z^{[2]}) \\ Z^{[2]} & = & W^{[2]} A^{[1]} + b^{[2]} \\ A^{[2]} & = & σ (Z^{[2]}) \end{array}$

One can notice that we add $b^{[1]} \in ℜ^{4 \times 1}$ to $W^{[1]} X \in ℜ^{4 \times m}$ , which is strictly not allowed following the rules of linear algebra. In practice however, this addition is performed using broadcasting. By defining

${\tilde{b}}^{[1]} = [\begin{matrix} | & | & \dots & | \\ b^{[1]} & b^{[1]} & \dots & b^{[1]} \\ | & | & \dots & | \end{matrix}] .$ we can compute:

$Z^{[1]} = W^{[1]} X + {\tilde{b}}^{[1]}$

4.2 Universal Approximation Theorems

Leshno and Schocken (1991) has showed that a neural network with one (possibly huge) hidden layer can uniformly approximate any continuous function on a compact set if the activation function is not a polynomial (i.e. tanh, logistic, and ReLU all work, as do sin,cos, exp, etc.).

Let $ψ (\cdot)$ be any non-polynomial function (an activation function).
Let define $f : K ⟶ ℜ$ be any continuous function on a compact set $K \subset ℜ^{m}$

Then $\forall ϵ > 0$ , there exists an integer $N$ (the number of hidden units), and parameters $v_{i}$ , $b_{i}$ $\in ℜ$ such that the function $F (x) = \sum_{i = 1}^{N} v_{i} ψ (w_{i}^{T} x + b_{i})$

satisfies $| F (x) - f (x) | > ϵ$ for all $x \in K$ .

Leshno and Schocken (1991) has noted that this doesn’t work without the bias term $b_{i}$ . They call it the threshold term.

4.3 Activation function and derivative

Which activation functions to use in the hidden layers ?
Which activation functions to use in the output unit of the Neural Network ?

Let first create a simple plot function for each activation function and its derivative.

library(ggplot2)
f <- function(x) {x}
plot_activation_function <- function(f, title, range){
  ggplot(data.frame(x=range), mapping=aes(x=x)) + 
    geom_hline(yintercept=0, color='red', alpha=1/4) +
    geom_vline(xintercept=0, color='red', alpha=1/4) +
    stat_function(fun=f, colour = "dodgerblue3") +
    ggtitle(title) +
    scale_x_continuous(name='x') +
    scale_y_continuous(name='') +
    theme(plot.title = element_text(hjust = 0.5))
}

4.3.1 Sigmoid function

$σ (z) = g (z) = \frac{1}{1 + e^{- z}}$

Its derivative :

$\frac{d}{d z} σ (z) = σ (z) (1 - σ (z))$

f <- function(x){1 / (1 + exp(-x))}
df <- function(x){f(x)*(1-f(x))}
plotf <- plot_activation_function(f, 'Sigmoid', c(-4,4))
plotdf <- plot_activation_function(df, 'Derivative', c(-4,4))

library(ggpubr)
ggarrange(plotf,plotdf,nrow=1,ncol=2)

Figure 4.1: Sigmoid and derivative function

4.3.2 ReLU function

A recent invention which stands for Rectified Linear Units.

$R e L U (z) = max (0, 𝑧)$

Despite its name and appearance, it’s not linear and provides the same benefits as Sigmoid but with better performance.

Its derivative :

$\frac{d}{d z} R e L U (z) = {\begin{matrix} 1 i f z > 0 \\ 0 i f z < 0 \\ u n d e f i n e d i f z = 0 \end{matrix}$

rec_lu_func <- function(x){ ifelse(x < 0 , 0, x )}
drec_lu_func <- function(x){ ifelse(x < 0 , 0, 1)}
plotf <- plot_activation_function(rec_lu_func, 'ReLU', c(-4,4))
plotdf <- plot_activation_function(drec_lu_func, 'Derivative', c(-4,4))

ggarrange(plotf,plotdf,nrow=1,ncol=2)

Figure 4.2: ReLU and derivative function

4.3.3 LeakyRelu function

Leaky Relu is a variant of ReLU. Instead of being 0 when $z < 0$ , a leaky ReLU allows a small, non-zero, constant gradient $α$ (usually, $α = 0.01$ ). However, the consistency of the benefit across tasks is presently unclear.

$L e a k l y R e L U (z) = max (α z, 𝑧)$

Its derivative :

$\frac{d}{d z} L e a k l y R e L U (z) = {\begin{cases} α & i f z < 0 \\ 1 & i f z \geq 0 \end{cases}$

rec_lu_func <- function(x){ ifelse(x < 0 , 0.01*x, x )}
drec_lu_func <- function(x){ ifelse(x < 0 , 0.01, 1)}
plotf <- plot_activation_function(rec_lu_func, 'LeaklyReLU', c(-4,4))
plotdf <- plot_activation_function(drec_lu_func, 'Derivative', c(-4,4))

ggarrange(plotf,plotdf,nrow=1,ncol=2)

Figure 4.3: LeaklyReLU and derivative function

4.3.4 Tanh function

Tanh squashes a real-valued number to the range $[- 1, 1]$ . It’s non-linear. But unlike Sigmoid, its output is zero-centered. Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity.

$t a n h (z) = \frac{e^{z} - e^{- z}}{e^{z} + e^{- z}}$

Its derivative :

$\frac{d}{d z} t a n h (z) = 1 - t a n h (z)^{2}$

tanh_func <- function(x){tanh(x)}
dtanh_func <- function(x){1-(tanh(x))**2}
plotf <- plot_activation_function(tanh_func, 'TanH', c(-4,4))
plotdf <- plot_activation_function(dtanh_func, 'Derivative', c(-4,4))

ggarrange(plotf,plotdf,nrow=1,ncol=2)

Figure 4.4: Tanh and derivative function

4.4 Forward, backward and chain-rule

4.4.1 Simple Case with two layers

Let derive the backpropagation algorithm for the following neural network by considering ReLu activations function for all the neurons of the first Layer and the identity function for the output layer.

Remind the main equations:

${\begin{array}{rcl} z^{[1]} & = & W^{[1]} x + b^{[1]} \\ a^{[1]} & = & R e L u (Z^{[1]}) \\ z^{[2]} & = & W^{[2]} a^{[1]} + b^{[2]} \\ \hat{y} = a^{[2]} & = & g (z^{[2]}) \end{array}$

Let consider a regression framework and consider the identity function for the output activation function: $g (x) = x$ . In this framework it is natural to use the lesat square cost function:

$J = \frac{1}{2} (y - \hat{y})^{2}$ Let precise some dimension of our objects:

$d$ is the number of features and $x \in ℜ^{d}$
$m_{1}$ number of neurons in layer 1 and so $W^{[1]} \in ℜ^{m_{1} \times d}$
$m_{2} = 1$ number of neurons in output layer and so $W^{[2]} \in ℜ^{1 \times m_{1}}$

Computing derivatives using Chain Rule using Backward strategy:

-(1) Compute $\frac{\partial J}{\partial W_{i}^{[2]}}$ then get vectorize version $\frac{\partial J}{\partial W^{[2]}}$

-(2) Compute $\frac{\partial J}{\partial W_{i j}^{[1]}}$ then get vectorize version $\frac{\partial J}{\partial W^{[1]}}$

-(3) Compute $\frac{\partial J}{\partial Z_{i}^{[1]}}$ then get vectorize version $\frac{\partial J}{\partial Z^{[1]}}$

-(4) Compute $\frac{\partial J}{\partial a_{i}^{[1]}}$ then get vectorize version $\frac{\partial J}{\partial a^{[1]}}$

Step 1:

$\begin{array}{rcl} \frac{\partial J}{\partial W_{i}^{[2]}} & = & \frac{\partial J}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial W_{i}^{[2]}} \\ = & (\hat{y} - y) \frac{\partial \hat{y}}{\partial W_{i}^{[2]}} \\ = & (\hat{y} - y) a_{i}^{[1]} \end{array}$ where $\hat{y} = \sum_{i = 1}^{m_{1}} W_{i}^{[2]} a_{i}^{[1]} + b^{[2]}$ $\frac{\partial J}{\partial W^{[2]}} = (\hat{y} - y) a^{[1] T} \in ℜ^{1 \times m_{1}}$

Similarly:

$\frac{\partial J}{\partial b^{[2]}} = (\hat{y} - y) \in ℜ$

Step 2:

$\begin{array}{rcl} \frac{\partial J}{\partial W_{i j}^{[1]}} & = & \frac{\partial J}{\partial z_{i}^{[1]}} \frac{\partial z_{i}^{[1]}}{\partial W_{i j}^{[1]}} \\ = & \frac{\partial J}{\partial z_{i}^{[1]}} x_{j} \end{array}$ where $z_{i}^{[1]} = \sum_{k = d}^{m_{1}} W_{i k}^{[1]} x_{k} + b_{i}^{[1]}$

$\frac{\partial J}{\partial W^{[1]}} = \frac{\partial J}{\partial z^{[1]}} x^{T} \in ℜ^{m_{1} \times d}$

Step 3:

$\begin{array}{rcl} \frac{\partial J}{\partial z_{i}^{[1]}} & = & \frac{\partial J}{\partial a_{i}^{[1]}} \frac{\partial a_{i}^{[1]}}{\partial z_{i}^{[1]}} \\ = & \frac{\partial J}{\partial a_{i}^{[1]}} 1_{{z_{i}^{[1]} \geq 0}} \end{array}$

$\frac{\partial J}{\partial z^{[1]}} = \frac{\partial J}{\partial a^{[1]}} ⊙ σ^{^{'}} (z)$ where $σ^{^{'}} (\cdot)$ is the element-wise derivative of the activation function $σ$ (here $R e L U$ function}) and $⊙$ denotes the element-wise product of two vectors of the same dimensionality.

Step 4:

$\begin{array}{rcl} \frac{\partial J}{\partial a_{i}^{[1]}} & = & \frac{\partial J}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial a_{i}^{[1]}} \\ = & (\hat{y} - y) w_{i}^{[2]} \end{array}$

where $\hat{y} = \sum_{i = 1}^{m_{1}} W_{i}^{[2]} a_{i}^{[1]} + b^{[2]}$

$\frac{\partial J}{\partial a^{[1]}} = (\hat{y} - y) W^{[2] T}$

These steps enable us to define the Back-propagation algorithm for this simple Neural Network.

4.4.2 Back-propagation Algorithm

Algorithm : Back-propagation for two-layer neural netwoks

Compute the values of $z^{[1]}$ , $a^{[1]}$ and $\hat{y}$ using forward pass
Compute

$\begin{array}{rcl} δ^{[2]} & = & \frac{\partial J}{\partial \hat{y}} = (\hat{y} - y) \\ δ^{[1]} & = & \frac{\partial J}{\partial Z^{[1]}} = (W^{[2] T} (\hat{y} - y)) ⊙ 1_{{z^{[1]} \geq 0}} \end{array}$

Compute

$\begin{array}{rcl} \frac{\partial J}{\partial W^{[2]}} & = & δ^{[2]} a^{[1] T} \\ \frac{\partial J}{\partial b^{[2]}} & = & δ^{[2]} \\ \frac{\partial J}{\partial W^{[1]}} & = & δ^{[1]} x^{T} \\ \frac{\partial J}{\partial b^{[1]}} & = & δ^{[1]} \end{array}$

4.4.3 General Case with r layers

Consider the general case of a fully-connected Multi-layer networks defining by the following equations:

${\begin{array}{rcl} a^{[0]} & = & x \\ z^{[1]} & = & W^{[1]} a^{[0]} + b^{[1]} \\ a^{[1]} & = & R e L u (Z^{[1]}) \\ z^{[2]} & = & W^{[2]} a^{[1]} + b^{[2]} \\ a^{[2]} & = & R e L u (Z^{[2]}) \\ \dots & = & \dots \\ z^{[r - 1]} & = & W^{[r - 1]} a^{[r - 2]} + b^{[r - 1]} \\ a^{[r - 1]} & = & R e L u (Z^{[r - 1]}) \\ z^{[r]} & = & W^{[r]} a^{[r - 1]} + b^{[r]} \\ \hat{y} = a^{[r]} & = & z^{[r]} \\ J & = & \frac{1}{2} (y - \hat{y})^{2} \end{array}$

4.4.4 Back-propagation multi-layer

First, note that the updated parameter of interests (the weights and bias) depend of intermediate following intermediate variables: $z^{[k]} = W^{[k]} a^{[k - 1]} + b^{[k]}, k \in {1, \dots, r}$

Then the cost function depends of weights and bias via the intermediate variables $z^{[k]}$ . Using chain rule we get

${\begin{array}{rcl} \frac{\partial J}{\partial W^{[k]}} & = & \frac{\partial J}{\partial z^{[k]}} a^{[k - 1] T} \\ \frac{\partial J}{\partial b^{[k]}} & = & \frac{\partial J}{\partial z^{[k]}} \end{array}$

Using similar notation as our simple NN, we define $δ^{[k]} = \frac{\partial J}{\partial z^{[k]}}$ and compute it in a backward manner from $k = r$ to 1.

k=r:

$δ^{[r]} = \frac{\partial J}{\partial z^{[r]}} = (z^{[r]} - y)$

k<r:

$δ^{[k]} = \frac{\partial J}{\partial z^{[k]}} = \frac{\partial J}{\partial a^{[k]}} ⊙ {ReLU}^{^{'}} (z^{[k]})$ By noting that $z^{[k + 1]} = W^{[k + 1]} a^{[k]} + b^{[k + 1]}$ and assuming we have computed $δ^{[k + 1]}$ then we try to compute $δ^{[k]}$ . First note that

$\frac{\partial J}{\partial a^{[k]}} = W^{[k + 1] T} \frac{\partial J}{\partial z^{[k + 1]}}$ then we get

$\begin{array}{rcl} δ^{[k]} & = & (W^{[k + 1] T} \frac{\partial J}{\partial z^{[k + 1]}}) ⊙ {ReLU}^{^{'}} (z^{[k]}) \\ = & (W^{[k + 1] T} δ^{[k + 1]}) ⊙ {ReLU}^{^{'}} (z^{[k]}) \end{array}$

Algorithm : Back-propagation for multi-layer

Compute the values of $z^{[k]}$ , $a^{[k]}$ for $k = 1, \dots, r$ and $J$ using forward pass
for $k = r$ to $1$ do

if $k = r$ then compute $δ^{[r]} = \frac{\partial J}{\partial z^{[r]}}$

else compute $δ^{[k]} = \frac{\partial J}{\partial z^{[k]}} = (W^{[k + 1] T} δ^{[k + 1]}) ⊙ {ReLU}^{^{'}} (z^{[k]})$

Compute

$\begin{array}{rcl} \frac{\partial J}{\partial W^{[k]}} & = & δ^{[k]} a^{[k - 1] T} \\ \frac{\partial J}{\partial b^{[k]}} & = & δ^{[k]} \end{array}$

4.5 Weight Initialization

Using our notation, let consider a neural network with $L$ layer ( $L - 1$ hidden layers and $1$ input and output layer each). The parameter are updated during the training step and are stored in:

$W^{[l]}$ weight matrix of dimension $m_{l} \times m_{l - 1}$ ( $m_{l}$ is the size of the layer $l$ )
$b^{[l]}$ bias vectors odf dimension $m_{l} \times 1$

Updated biases and weights

${\begin{array}{rcl} b^{[l]} & := & b^{[l]} - α \frac{\partial J}{\partial b^{[l]}} \\ W^{[l]} & := & W^{[l]} - α \frac{\partial J}{\partial W^{[l]}} \end{array}$

where the gradients are computed using backpropagation technique

4.5.1 General practice

The biases are initialized with 0 and weights are initialized with random numbers.

What if weights are initialized with 0? or even same constant value

Let consider a neural network with two hidden units and initialize the biases to 0 and all the weights to a constant value $γ$ .

By propagating an input sample $(x_{1}, x_{2})$ the output of both hidden units will be the same: $R e L U (γ x_{1} + γ x_{2})$ . This will have identical influence on the cost function, which will lead to identical gradients. Thus, this makes hidden units symmetric and the neural network will perform very poorly.

A nice animation post on the influence of the weight initialization could be found here Initializing neural networks

Random initialization enables us to break the symmetry. However, initializing much high or low value can result in slower optimization. In practice the natural the weights are randomly gerenated from standard normal distribution. However, while working with a (deep) network can potentially lead to 2 issues: vanishing gradients or exploding gradients.

Vanishing gradients In case of deep networks, for any activation function, $| \frac{\partial J}{\partial W^{[l]}} |$ will get smaller and smaller as we go backwards with every layer during back propagation. The earlier layers are the slowest to train in such a case. Thus, the update is minor and results in slower convergence. This makes the optimization of the loss function slow. In the worst case, this may completely stop the neural network from training further.

More specifically, in case of $s i g m o i d (z)$ and $t a n h (z)$ , if your weights are large, then the gradient will be vanishingly small, effectively preventing the weights from changing their value. This is because abs(dW) will increase very slightly or possibly get smaller and smaller every iteration. With $R e L U (z)$ vanishing gradients are generally not a problem as the gradient is 0 for negative (and zero) inputs and 1 for positive inputs

Exploding gradients This is the exact opposite of vanishing gradients. Consider you have non-negative and large weights. When these weights are multiplied along the layers, they cause a large change in the cost. Thus, the gradients are also going to be large. This means that the changes in $W^{[l]}$ will be in huge steps. This may result in oscillating around the minima or even overshooting the optimum again and again and the model will never learn!

Another impact of exploding gradients is that huge values of the gradients may cause number overflow resulting in incorrect computations or introductions of NaN’s. This might also lead to the loss taking the value NaN

4.5.2 Solution

For networks that are not too deep, ReLU or leaky RELU activation functions are exploited, as they are relatively robust to the vanishing/exploding gradient issue. In the case of leaky RELU’s, they never have 0 gradient. Thus they never die and training continues.

For deep networks,heuristic to initialize the weights depending on the non-linear activation function are generally used. The most common practice is to draw the element of the matrix $W^{[l]}$ from normal distribution with variance $k / m_{l - 1}$ , where $k$ depends on the activation function. While these heuristics do not completely solve the exploding/vanishing gradients issue, they help mitigate it to a great extent.

for $R e L U$ activation: $k = 2$
for $t a n h$ activation: $k = 1$ . The heuristic is called Xavier initialization. It is similar to the previous one, except that $k$ is 1 instead of 2.
Another commonly used heuristic is to draw from normal distribution with variance $2 / (m_{l - 1} + m_{l})$

Note also that Gradient Clipping is another way of dealing with the exploding gradient problem.

Finally, the bias terms can be safely initialized to 0 as the gradients with respect to bias depend only on the linear activation of that layer, and not on the gradients of the deeper layers.

4.6 Batch-normalization and mini-batch

4.6.1 Normalizing inputs to speed up learning

Another common technique in Deep Learning is to normalize the input data. It often leads to a better performance because gradient descent converges faster after normalization.

Let consider two inputs $x_{1}$ and $x_{2}$ .

- Compute mean and variance of training data

$μ_{j} = \frac{1}{m} \sum_{i = 1}^{m} x_{j}^{[i]}, σ_{j}^{2} = \frac{1}{m} \sum_{i = 1}^{m} (x_{j}^{[i]} - μ_{j})^{2}$

Normalize

${\tilde{x}}_{j}^{[i]} = \frac{x_{j}^{[i]} - μ_{j}}{σ_{j}}$

Note that we also have to use $μ_{j}$ and $σ_{j}^{2}$ to normalize validation/test data.

Why normalized the inputs data

For example the inputs $x_{1}$ and $x_{2}$ are not normalized then the corresponding cost function depending of the parameters $w_{1}$ and $w_{2}$ can be considered as unormalized. Thus it leads to slower convergence.

4.6.2 Batch Normalization

Not only the input data can be normalized. The inputs of each layer are generally normalized too. This technique is called Batch Normalization (BN). It has been introduced in 2015 and it is one of the most efficient techniques for training deep neural networks. Batch normalization enables to use higher learning rate without getting issues with vanishing or exploding gradients. Moreover, it has also a slight regularization effect.

4.6.3 Batch Normalization on each mini-batch

For every mini-batch, do the following:

Compute mean and variance for every unit $j$ in all layers $l$

$μ_{j}^{(l)} = \frac{1}{m_{b a t c h}} \sum_{i = 1}^{m_{b a t c h}} z_{j}^{(l) [i]}, (σ_{j}^{(l)})^{2} = \frac{1}{m} \sum_{i = 1}^{m} (z_{j}^{(l) [i]} - μ_{j}^{(l)})^{2}$ where $z_{j}^{(l) [i]}$ is the hidden unit before the activation

Normalize every unit $j$ in all layers $l$

${\bar{z}}_{j}^{[i]} = \frac{z_{j}^{(l) [i]} - μ_{j}^{(l)}}{\sqrt{(σ_{j}^{(l)})^{2} + ϵ}}$ where $ϵ$ is used for numerical stability.

Scale and shift every unit

${\tilde{z}}_{j}^{[i]} = γ_{j}^{(l)} {\bar{z}}_{j}^{[i]} + β_{j}^{(l)}$ where $γ_{l}^{(l}$ and $β_{j}^{(l)}$ are learned parameters ( called batch normalization layer ) that allow the new variable to have any mean and standard deviation.

Why batch normalization ?

The motivation of the prinsep paper is based on internal covariante shift (see for more details Ioffe, Sergey, and Christian Szegedy. ``Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.’’ International Conference on Machine Learning. 2015. arxiv version
It has been recently shown that it makes the loss landscape more smooth and easier to optimize (see Santurkar, Shibani, et al. ”How does batch normalization help optimization?.” Advances in Neural Information Processing Systems. 2018) arxiv version.

4.7 Dropout

Dropout is a popular and efficient regularization technique.

Dropout is a regularization technique where we randomly drop units.
The term dropout refers to dropping out units (hidden and visible) in a neural network.
By dropping a unit out, meaning temporarily removed it from the network, along with all its incoming and outgoing connections.
The choice of which units to drop is random.

4.7.1 Main idea

Consider the following network to be trained

1st iteration: Keep and update each unit with probability $p$ , drop remanding ones
2st iteration: Keep and update another random selection of units, drop remanding units.
Continue in the same manner.
Use all units. Weight multiplied by $p$ .

4.7.2 Why does this avoid overfitting?

In fact, Dropout can be viewed as an ensemble member with two clever approximations.

For a neural network with $M$ units there are $2^{M}$ possible thinned neural networks. Consider this as our ensamble.

Approximation 1: At each iteration we sample one ensemble member and update it. Most of the networks will never be updated since $2^{M} >>$ the number of iterations.

At test time we would need to average over all $2^{M}$ which is not feasible when $2^{M}$ is huge.

Approximation 2: Instead, at test time we evaluate the full neural network where the weight are multiplied by $p$ .

Figure 4.5: Dropout

It has been empirically shown that this is a good approximation of the average of all ensemble members.

4.8 Basic implementation from first principles

4.8.1 Toy Example: handwritten digits

A simple network to classify handwritten digits

We want to classify images ( $28 \times 28 = 784$ pixels) such as these

into 10 classes (0 to 9)

To recognize individual digits we will use a three-layer neural network:

The input pixels are greyscale, with a value of 0.0 representing white, a value of 1.0 representing black, and in between values representing gradually darkening shades of grey.

We number the output neurons from 0 through 9, and figure out which neuron has the highest activation value.

Page built: 2021-03-04 using R version 4.0.3 (2020-10-10)

See the latest book content here.

4 General Fully Connected Neural Networks

4.1 The full neural networkCopy link

4.1.1 The architecture of neural networksCopy link

4.1.2 GoalsCopy link

4.1.3 One Layer NNCopy link

4.1.4 N layers Neural NetworkCopy link

4.1.5 Vectorizing Across Multiple Training ExamplesCopy link

4.2 Universal Approximation TheoremsCopy link

4.3 Activation function and derivativeCopy link

4.3.1 Sigmoid functionCopy link

4.3.2 ReLU functionCopy link

4.3.3 LeakyRelu functionCopy link

4.3.4 Tanh functionCopy link

4.4 Forward, backward and chain-ruleCopy link

4.4.1 Simple Case with two layersCopy link

4.4.2 Back-propagation AlgorithmCopy link

4.4.3 General Case with r layersCopy link

4.4.4 Back-propagation multi-layerCopy link

4.5 Weight InitializationCopy link

4.5.1 General practiceCopy link

4.5.2 SolutionCopy link

4.6 Batch-normalization and mini-batchCopy link

4.6.1 Normalizing inputs to speed up learningCopy link

4.6.2 Batch NormalizationCopy link

4.6.3 Batch Normalization on each mini-batchCopy link

4.7 DropoutCopy link

4.7.1 Main ideaCopy link

4.7.2 Why does this avoid overfitting?Copy link

4.8 Basic implementation from first principlesCopy link

4.8.1 Toy Example: handwritten digitsCopy link

4.1 The full neural network

4.1.1 The architecture of neural networks

4.1.2 Goals

4.1.3 One Layer NN

4.1.4 N layers Neural Network

4.1.5 Vectorizing Across Multiple Training Examples

4.2 Universal Approximation Theorems

4.3 Activation function and derivative

4.3.1 Sigmoid function

4.3.2 ReLU function

4.3.3 LeakyRelu function

4.3.4 Tanh function

4.4 Forward, backward and chain-rule

4.4.1 Simple Case with two layers

4.4.2 Back-propagation Algorithm

4.4.3 General Case with r layers

4.4.4 Back-propagation multi-layer

4.5 Weight Initialization

4.5.1 General practice

4.5.2 Solution

4.6 Batch-normalization and mini-batch

4.6.1 Normalizing inputs to speed up learning

4.6.2 Batch Normalization

4.6.3 Batch Normalization on each mini-batch

4.7 Dropout

4.7.1 Main idea

4.7.2 Why does this avoid overfitting?

4.8 Basic implementation from first principles

4.8.1 Toy Example: handwritten digits