
i
i
i
i
i
i
i
i
6.6 Beyond Classification
Notes and References
Before we outline notes and references associated with explicit details of this chapter, here is a brief
description of early convolutional neural network developments. Initial ideas originated in the 1950’s
and 1960’s with the study of the visual cortices of animals, primarily by Hubel and Wiesel over a
series of publications including [
21
] and [
22
]. Early concrete models that have some similarity with
modern convolutional neural networks are the 1980 neocognitron [
12
] for pattern recognition, as
well as the 1988 time delay neural network [
31
] for speech recognition. In the 1990s convolutional
neural networks saw industrial applications for the first time with [
17
] for handwritten character
recognition and [5] for signature verification. Other significant early works include [32] for written
digit recognition, [
56
] for face recognition, and [
58
] for phoneme recognition. Finally we mention
that the LeNet-5 model developed in the late 1990’s by Yann LeCun et al. [
34
] is recognized as an
early form of contemporary convolutional neural networks and it was used for classifying 28
×
28
size images of grayscale handwritten digits. We also mention that in 1989 with [
32
] and [
33
], LeCun
et al. developed the first multi-layered convolutional networks for handwritten character recognition
trained using backpropagation.
The structure of convolutional layers in neural networks as we present in this chapter solidified at
around the 2012–2016 period and best fits the VGG model [
50
]. This model followed the pivotal
AlexNet model [
28
] from 2012 which was specifically designed for training on two parallel GPUs.
Other notable convolutional architectures of this period are the GoogLeNet or inception network
model of [
53
], the batch normalization inception model [
24
] which uses batch normalization of layer
inputs, and ResNets which were introduced in [
19
]. All of these models competed in the ImageNet
challenges of that era with the results from each model effectively outperforming those that came
prior to it. Other developments included the SqueezeNet model of [
23
], which marked a key milestone
in reducing parameter size and memory footprint of convolutional network without compromising
accuracy; this model achieved the AlexNet-level accuracy with much fewer parameters and a much
smaller memory footprint. Also, see the Network-in-Network model of [
37
] which inspired the
inception networks and [
52
] that uses dropout mechanism to reduce overfitting on convolutional
layers. See [
42
] for a comprehensive survey of convolutional neural networks of that time as well
as the more recent survey [
36
]. In times closer to the publication of this book, paradigms such as
EfficientNet appeared in [54], see also the more recent version, EfficientNet v2 in [55].
Ideas of dilation in convolutional networks were introduced in [64] for dense prediction, where the
goal is to compute a label for each pixel in the image. Furthermore, dilation for residual networks is
introduced in [65]. See also the discussion of group normalization in [62].
A general overview of linear time invariant systems can be found in standard texts such as [
29
] which
is also useful for understanding basic filtering. The book [
1
] can provide a more mathematically
rigorous foundation and can also be useful for understanding the delta function in continuous time.
The probabilistic interpretation of a convolution is standard and can be found in any elementary
probability textbook such as [
45
]. The multiplication of polynomials interpretation, also coupled
with the study of the fast Fourier transform can be found in [
8
]. A simple explanation of the
representation of discrete convolutions in terms of Toeplitz matrices can be found in [
4
]. For analysis
of convolutions of classic image processing applications as well as many other classic image processing
techniques see [
25
]. The Sobel filter is one of many convolution based filtering operations. It was
developed by Sobel and Feldman, and presented at a 1968 scientific talk; see [
51
] for an historical
review.
The rise of convolutional neural networks drove the development of many paradigms using these
networks for different tasks. In terms of object detection, early works are [
15
] and [
16
] and recent
work in this direction is [
59
] where YOLOv7 model enhances the landmark YOLO (you only look
once) work of [
43
]. A recent survey on object detection can be found in [
70
]. The important area of
semantic segmentation has received much attention with notable papers being [
44
] (U-net), as well
as [
41
]. Instance segmentation is studied in [
3
], [
18
], and [
38
]. For additional recent surveys of the
subsequent developments in semantic and instance segmentation see [
57
] and [
40
]. See also [
13
] for
a survey of video semantic segmentation. Influential work on identification (face recognition) is in
[46] and early ideas of siamese networks are from [7]; see also [20] and [61].
Over the years, many effective network visualization methods were developed for understanding
inner layers and derived features. Before the era of great popularity of convolutional networks,
the work in [
10
] introduced a technique aimed at optimizing the input to maximize the activity
43