
i
i
i
i
i
i
i
i
6.6 Beyond Classification
Notes and References
Before we outline notes and references associated w ith explicit details of this chapter, here is a brief
description of early convolutional neural network developments. Initial ideas originated in the 1950’s
and 1960’s with the study of the visual cortices of animals, primarily by Hubel and W iesel over a
series of publications including [
195
] and [
196
]. Early concrete models that have some similarity
with modern convolutional neural networks are the 1980 neocognitron [
127
] for pattern recognition,
as well as the 1988 time delay neural network [
247
] for speech recognition. In the 1990s convolutional
neural networks saw industrial applications for the first time with [
154
] for handwritten character
recognition and [
65
] for signature verification. Other significant early works include [
250
] for written
digit recognition, [
402
] for face recognition, and [
412
] for phoneme recognition. Finally we mention
that the LeNet-5 model developed in the late 1990’s by Yann LeCun et al. [
252
] is recognized as an
early form of contemporary convolutional neural ne tworks and it was used for classifying 28
◊
28
size images of grayscale handwritten digits. We also mention that in 1989 with [
250
] and [
251
],
LeCun et al. developed the first multi-layered convolutional networks for handwritten character
recognition trained using backpropagation.
The structure of convolutional layers in neural networks as we present in this chapter solidified at
around the 2012–2016 period and best fits the VGG model [
380
]. This model followed the pivotal
AlexNet model [
239
] from 2012 which was s pecifically designed for training on two parallel GPUs.
Other notable convolutional architectures of this period are the GoogLeNet or inception network
mo del of [
394
], the batch normalization inception model [
203
] which uses batch normalization of
layer inputs, and ResNets which were introduced in [
172
]. All of these models competed in the
ImageNet challenges of that era with the results from each model effectively outperforming those
that came prior to it. Other developments included the SqueezeNet model of [
199
], which marked a
key milestone in reducing parameter size and mem ory footprint of convolutional network without
compromising accuracy; this model achieved the AlexNet-level accuracy with much fewer parameters
and a much smaller m emory footprint. Also, see the Network-in-Network model of [
260
] which
inspired the inception networks and [
388
] that uses dropout mechanism to reduce overfitting on
convolutional layers. See [
344
] for a comprehensive survey of convolutional neural networks of
that time as well as the more recent survey [
258
]. In times closer to the publication of this book,
paradigms such as EfficientNet appeared in [
395
], see also the more recent version, EfficientNet v2
in [396].
Ideas of dilation in convolutional networks were introduced in [
435
] for dense prediction, where the
goal is to compute a label for each pixel in the image. Furthermore, dilation for residual networks is
introduced in [436]. See also the discussion of group normalization in [426].
A general overview of linear time invariant systems can be found in standard texts such as [
244
] which
is also useful for understanding basic filtering. The bo ok [
14
] can provide a more mathematically
rigorous foundation and can also be useful for understanding the delta function in continuous time.
The probabilistic interpretation of a convolution is standard and can be found in any elementary
probability textbook such as [
355
]. The multiplication of p oly nomials interpretation, also coupled
with the study of the fast Fourier transform can be found in [
93
]. A simple explanation of the
representation of discrete convolutions in terms of Toeplitz matrices can be found in [
56
]. For
analysis of convolutions of classic image processing applications as well as many other classic image
processing techniques see [
207
]. The Sobel filter is one of many convolution based filtering operations.
It was developed by Sobel and Feldman, and pre sented at a 1968 scientific talk; see [
383
] for an
historical review.
The rise of convolutional neural networks drove the development of many paradigms using these
networks for different tasks. In terms of object detection, early works are [
135
] and [
136
] and recent
work in this direction is [
415
] where YOLOv7 model enhances the landmark YOLO (you only look
once) work of [
346
]. A recent survey on object detection can be found in [
454
]. The imp ortant area
of semantic segmentation has received much attention with notable papers being [
352
] (U-net),
as well as [
312
]. Instance segmentation is studied in [
45
], [
170
], and [
267
]. For additional recent
surveys of the subsequent developments in semantic and instance segmentation see [
405
] and [
290
].
See also [
129
] for a survey of video semantic segmentation. Influential work on identification (face
recognition)isin[368] and early ideas of siamese networks are from [84]; see also [192] and [425].
Over the years, many effective network visualization methods were developed for understanding
inner laye rs and derived features. Before the era of great p opularity of convolutional networks, the
247