
i
i
i
i
i
i
i
i
5.7 Mitigating Overfitting with Dropout and Regularization
Notes and References
The origins of deep learning date back to the same era during which the digital computer materialised.
In fact, early ideas of artificial neural networks were introduced first in 1943 with [
39
]. Then, in the
post WWII era, Frank Rosenblatt’s perceptron was the first working implementation of a neural
network model, [
49
]. The perceptron and follow up work drew excitement in the 1960’s yet with the
1969 paper [
41
], there was formal analysis that shone negative light on limited capabilities of single
layer neural networks and this eventually resulted in a decline of interest, which is sometimes termed
the “AI winter” of 1974–1980. Before this period there were even implementations of deep learning
architectures with [
28
], and with [
27
] where 8 layers were implemented. In 1967 such multi-layer
perceptrons were even trained using stochastic gradient descent, [2].
Some attribute the end of the 1974–1980 AI winter to a few developments that drew attention and
resulted in impressive results. Some include Hopfield networks which are recurrent in nature (see
Chapter
??
), and also formalism and implementation of the backpropagation algorithm in 1981,
[
57
]. In fact, early ideas of backpropagation can be attributed to a PhD thesis in 1970, [
36
] by S.
Linnainmaa. See also our notes and references on early developments of reverse mode automatic
differentiation at the end of Chapter
??
. In parallel to this revival of interest in artificial intelligence
in the early 1980’s there were many developments that led to today’s contemporary convolutional
neural networks. See the notes and references at the end of Chapter
??
. Historical accounts of deep
learning can be found in [
51
] and [
52
] as well as a website by A. Kurenkov.
9
The book [
18
] was a key
reference of neural networks, summarizing developments up to the turn of the twenty first century.
The 2015 Nature paper by Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, [
33
] captures more
contemporary developments.
Positive results about the universal approximation ability of neural networks, such as Theorem 5.1
presented in this chapter, appear in [
10
] for a class of sigmoid activations and in [
22
] for a larger
class of non-polynomial functions. With such results, it became evident that with a single hidden
layer, neural networks are very expressive. Still, the practical insight to add more hidden layers to
increase expressivity arose with the work of Geoffrey Hinton et al. in 2006 [
20
], Yoshua Bengio et
al. 2006 [
6
] and other influential including works such as [
5
], [
8
] and [
32
]. The big explosion was in
2012 with AlexNet in [29].
Our example of a multiplication gate as in Figure 5.4 comes from [34]. Our motivation to use this
example and the analysis we present around Figure 5.5 dealing with the expressivity of deeper
networks is motivated by a 2017 talk of Niklas Wahlström.
10
Beyond our elementary presentation,
many researchers have tried to provide theoretical justifications to why deep neural networks are so
powerful. Some justifications of the power of deep learning are in [
53
] using information theory, and
[
9
], [
12
],[
47
] , and [
55
], using other mathematical reasoning approaches. See also [
40
] and [
46
] for
surveys of theoretical analysis of neural networks.
In terms of activation functions, the sigmoid function was the most popular function in early neural
architectures with the tanh function serving as an alternative. The popularity of ReLU grew with
[
42
], especially after its useful application in the AlexNet convolutional neural network [
29
]. Note
that ReLU was first used in 1969, see [
13
]. Other activation functions such as leaky ReLU were
introduced in [
38
], as well as parameterized activation functions such as PReLU studied in [
19
] in
order to mitigate vanishing and exploding gradient issues. A general survey of activation functions
is in [11].
The backpropagation algorithm can be attributed to [
50
], yet has earlier origins with general back-
ward mode automatic differentiation surveyed in [
4
] (see also notes of Chapter
??
). Our presentation
in Algorithm 5.2 is specific to the precise form of feedforward networks that we considered, yet
variants can be implemented. Importantly, with the advent of automatic differentiation frameworks
such as TensorFlow [
1
] followed by PyTorch [
44
], Flux.jl [
24
], JAX [
7
], and others, the use of back-
propagation as part of deep learning has become standard. Such software frameworks automatically
implement backpropagation as a special case of backward mode automatic differentiation where the
computational graph is constructed, often automatically based on code. Early deep learning work
up to about 2014, did not have such software frameworks available and hence “hand coding” of
backpropagation was more delicate and time consuming. It is fair to say that with the proliferation
of deep learning software frameworks, innovations in research and industry accelerated greatly. We
9
https://www.skynettoday.com/overviews/neural-net-history.
10
See https://www.it.uu.se/katalog/nikwa778/talks/DL_EM2017_online.pdf.
37