Mathematical Engineering

of Deep Learning

Benoit Liquet, Sarat Moka and Yoni Nazarathy

March 3, 2026

6 Convolutional Neural Networks

While oﬀering generality and versatility, the fully connected feedforward neural networks

described in the previous chapter are often too general to be eﬀective on their own right. For

many applications, such dense architectures can have too many parameters and are not able

to generalize well. This is especially the case when considering vision, sound, or similar data

for which the spatial orientation of pixels or features is a key deﬁning attribute. For such

data, learned rules associated with certain features often need to be repeated systematically.

Convolutional neural networks oﬀer an ability to do so by training convolutional ﬁlters that

can be applied in a spatially homogenous manner. Such networks yield a signiﬁcant reduction

in the number of trained parameters. Understanding convolutional neural networks requires a

grasp of the convolution operation and how it is incorporated in a deep learning architecture

together with the concept of channels and additional operations such as max-pooling. This

chapter covers the main details of such convolutional neural networks as well as speciﬁc

convolutional architectures that have by now become standard. As the main application of

convolutional neural networks is images, we also outline key tasks of deep learning in image

processing applications.

We start the chapter with an overview in Section 6.1 where we introduce general concepts

of convolutional neural networks. We ﬁrst touch ﬁltering in signal processing and then

consider a high level view of the VGG19 network as a concrete example. In Section 6.2 we

study basics of the convolution operation both in one dimension as well as more generally.

Towards that end we relate convolutions to systems theory, to probability distributions,

and to multiplication of polynomials. In Section 6.3 we focus on a single convolutional

layer. First we motivate such a layer and then focus on details such as padding, stride, and

dilation. We then introduce the concept of channels and the way that volume convolutions are

carried out. In Section 6.4 we put the pieces together and discuss how multiple convolutional

layers, and other layers such as max-pooling and fully connected layers, are combined into a

network model. Here again, the VGG19 serves as a complete example. In Section 6.5 we

describe common landmark architectures and key ideas of convolutional neural networks.

The ideas and architectures surveyed include inception networks, ResNets, as well as ways for

interpreting the meaning of internal features of the networks. We close with Section 6.6 where

we discuss the various tasks that one may consider for vision problems beyond classiﬁcation.

These tasks include object localization, face identiﬁcation, segmentation models, and others.

6.1 Overview of Convolutional Neural Networks

Convolutional neural networks (CNNs) are designed to handle grid-structured data, such as

image data, where there is a strong local dependency between the neighboring items of the

grid. For instance, in an image, there is a high chance that adjacent pixels carry similar

properties. Such a grid based structure is also present in many sequential data formats such

as text and sound, where a strong correlation exists among adjacent items. Even though this

chapter as well as most of the literature on convolutional neural networks focuses on image

6 Convolutional Neural Networks

data, these networks are suitable for working with any temporal, spatial, or spatiotemporal

data.

Convolutional neural networks are computationally more eﬃcient than the fully connected

neural networks studied in Chapter

and are more suitable for grid-structured data. This

is primarily because convolutional neural networks require fewer parameters than fully

connected networks, with a parameter structure focused on feature (pixel) locality. For

instance, consider a classiﬁcation task for detecting cats within a dataset consisting of images

of diﬀerent animals. Such data exhibits two key properties:

Translation invariance: The classiﬁcation decision for each image is independent of the

position of the animal on the image. A cat is a cat irrespective of whether it appears

at the top or at the bottom of an image.

Locality: The classiﬁcation decision does not really depend on a pixel that is far away

from the animal on the image. A cat is still a cat irrespective of whether far away

pixels correspond to a building or a tree.

Ideally, we want our neural network to take advantage of these two properties. When using

the fully connected neural networks of Chapter

for images, the ﬁrst step is to convert

each image to an input features vector. By doing this we may lose both the above mentioned

structural properties of images. On the other hand, convolutional neural networks retain

and exploit these properties while generally using fewer parameters.

Filtering

To understand convolutional neural networks it is helpful to have a basic understanding

of ﬁltering, a well-established ﬁeld in signal processing, and particularly in the subﬁelds

of image processing and computer vision. Filtering is a method or process that removes

certain unwanted information from a signal or an image, or alternatively enhances it by

accentuating certain information. Taking image processing as an example domain, ﬁltering

applies mathematical operations on input images, with the most common operation being

the convolution. A convolution can be viewed as an operation between two mathematical

objects, such as two matrices or two functions, where one object represents an image and

the other a ﬁlter. All of Section 6.2 is devoted to a basic introduction of convolutions.

In the ﬁeld of signal processing, each ﬁlter is often custom designed depending on the speciﬁc

task at hand. For example, a popular ﬁlter, called the Sobel ﬁlter, is useful for the task of edge

detection. Figure 6.1 illustrates ﬁltering for extracting edges in an image using two Sobel

ﬁlters applied on an input image appearing in (a). In (b) we detect the vertical edges and in

we get the ﬁnal image shown in (d) which captures most of the vertical and horizontal edge

information. More details on edge detection using Sobel ﬁlters are in Section 6.2. Beyond

edge detection, there are several other tasks that traditional image processing ﬁlters can

oﬀer, including blurring, smoothing, sharpening, and accentuating images, and each of these

is achieved using a custom designed ﬁlter.

Convolutional neural networks build upon the classic ideas of ﬁltering using convolutional

layers. Each convolutional layer is made up of one or more ﬁlters, also known as kernels,

each of which aims to extract a particular feature of the input to the layer. Early layers

of the network usually aim to detect lower-level features such as edge detection while the

6.1 Overview of Convolutional Neural Networks

latter layers focus on higher-level features such as identifying cats, dogs, cars, etc. The ﬁnal

hidden layer, ultimately, provides a summary of the input image. Then for example, when

considering classiﬁcation tasks, the ﬁnal hidden layer is used to classify the input image into

diﬀerent classes.

(a) (b)

Figure 6.1: Edge detection using the Sobel ﬁlter.

The ﬁltering operation at each layer in a convolutional neural network is similar to classical

ﬁltering illustrated with edge detection above. However, unlike classical ﬁltering, ﬁlters in

the convolutional neural network are learned rather than designed. A training dataset is

used for learning the ﬁlters before using the network for image processing. Here, learning a

ﬁlter means learning the entries of the matrix that represents the ﬁlter, and these entries are

called weights as the ﬁlters play a similar role to the weight matrices of a fully connected

neural networks, covered in the previous chapter.

An Example: VGG19

To get a feel for convolutional neural networks let us consider the task of classifying color

images. As an example assume input images are of dimension 3

224

224, where 3 is the

number of channels (red, green, and blue), and 224

224 speciﬁes the pixel dimensions.

Hence the number of input features is

= 3

224

≈

150

000. Assume we wish to

use such networks for classiﬁcation of

= 1

000 possible classes. If we are to use a fully

connected network with no hidden layer, as in the multinomial regression model of Section

we already use

p × K

learned parameters. This is an order of 150 million parameters.

Further, deeper networks that extend the multinomial regression model by adding more

6 Convolutional Neural Networks

layers as in Chapter

, generally require even more parameters. Yet limiting the number of

parameters in any machine learning model is important since it bounds computational time,

limits usage of computational resources, and reduces overﬁtting while respecting training

data limitations. We now explore what can be done with a convolutional neural network for

such a task using approximately the same number of parameters.

The 3

224

224 input dimensions agree with inputs of the VGG19

model, ﬁrst touched

upon in Section

. The VGG19 model has about 144 million parameters (similar to the

dense

p × K

multinomial regression case). However, in contrast to the dense single layer

model, VGG19 spans 19 trainable layers! This depth makes the model much more expressive;

see Section

for a discussion on the beneﬁts of depth in networks. Indeed, convolutional

neural networks such as VGG19 are speciﬁcally suited for image tasks and have a relatively

low number of parameters which allow them to be deep.

3 × 224 × 224

64 × 224 × 224

64 × 112 × 112

128 × 112 × 112

128 × 56 × 56

256 × 56 × 56

256 × 28 × 28

Layer 12

512 × 28 × 28

512 × 14 × 14

512 × 7 × 7

ˆy

1000

ˆy

999

ˆy

1000

ˆy

999

ˆy

1000

ˆy

999

ˆy

1000

ˆy

999

ˆy

1000

ˆy

999

ˆy

1000

ˆy

999

ˆy

4096 4096

1000

Figure 6.2: The VGG19 network architecture. An input

is a 3

224

224 color image. It is

processed through a series of convolutional layers followed by fully connected layers. The resulting

output, ˆy

, . . . , ˆy

1000

is a vector of probabilities indicating the class of the image.

While Sections 6.2, 6.3, and 6.4, introduce the components of convolutional neural networks

in detail, at this point let us informally explore the VGG19 model illustrated in Figure 6.2.

Like the feedforward networks of Chapter

, it is composed of layers where data ﬂows

between layers down stream from the input

to the output

ˆy

. However, unlike feedforward

fully connected layers, most layers are not composed of a dense matrix multiplication as in

equation

(??)

of Chapter

, but are rather made of ﬁltering operations implemented via a

combination of convolutions and non-linear activation functions. Only the ﬁnal layers are

dense layers.

The rectangular boxes in Figure 6.2 represent neurons, also known as internal features,

that are computed via the successive application of convolutional layers. Each such box,

is in a sense an image or a tensor, yet unlike the input with 3 channels, these internal

representations generally have a diﬀerent number of channels (also known as feature maps),

not directly corresponding to color values but rather to internal features. As an example

VGG stands for Visual Geometry Group, the group at Oxford University that created the network.

6.2 The Convolution Operation

consider the layer

ℓ

= 12, pointed at with a red arrow in the ﬁgure. That layer has 512

channels each containing a 28 × 28 pixel “image”.

The network also incorporates operations called max-pooling without learned parameters,

discussed in detail in Section 6.3. These operations are generally used to reduce dimensions

as data ﬂows downstream in the network. Importantly, and quite characteristically of

convolutional networks, the VGG19 network starts with a succession of convolutional layers

with interleaved max-pooling operations, and towards the end, has dense layers that are

similar to the layers of feedforward networks of chapter ??.

6.2 The Convolution Operation

The convolution operation is a key component of convolutional neural networks. In this

section we study convolutions via various mathematical and engineering viewpoints. We

consider linear time invariant systems, probability distributions, multiplication of polynomials,

and the general representation of a convolution as a linear operation. We then consider

multi-dimensional convolutions and focus on an engineering ﬁltering example, the Sobol

ﬁlter, used above in Figure 6.1.

A convolution can be viewed as an operation on two functions which creates a third function.

In ﬁnite domains, these functions may be represented as vectors, matrices, or tensors. We

begin the presentation by focusing on one dimensional convolutions. Suppose

f, g

Z → R

are two functions (or sequences) with discrete domains. Then the convolution between

and g is deﬁned as

(f ⋆ g)(t) =

τ ∈Z

f(t − τ)g(τ ), t ∈ Z. (6.1)

In other words, the convolution

f ⋆ g

between

and

at a point

is obtained by taking

summation of the product of the two functions after one of them is ﬂipped at the origin

and then shifted by

. The convolution is commutative, namely, (

f ⋆ g

)(

) = (

g ⋆ f

)(

). This

property can be easily observed via a change of variables in the summation of (6.1).

In case

and

have continuous domains, say

f, g

R → R

, the convolution between

and

g is deﬁned as

(f ⋆ g)(t) =

f(t − τ)g(τ )dτ, t ∈ R. (6.2)

In both the discrete convolution

(6.1)

and the continuous convolution

(6.2)

we assume that

the summation or integral, respectively, converges. In the context of deep learning we focus

on convolutions of vectors, matrices, and tensors, in which case

(6.2)

is used on a ﬁnite

domain and hence always converges. We now present a few viewpoints of one dimensional

convolutions before stepping up to multi-dimensional cases.

Convolutions in LTI Systems

To get a feel for the importance of convolutions we consider Linear Time Invariant (LTI)

systems. These objects are key in classic control theory and signal processing, and they have

inﬂuenced machine learning, eventually leading to the development of convolutional neural

networks. An LTI system, denoted here by

(

), maps an input signal

(

) :

t ∈ R or Z}

6 Convolutional Neural Networks

to an output signal

via

L (x)

. LTI systems satisfy the linearity and time invariance

properties:

Linearity: For any two input signals x

(t) and x

(t) and scalars α

and α

L (α

+ α

) = α

L (x

) + α

L (x

) .

Time invariance: When the shifted (delayed by

) signal

˜x

(

) =

(

t − τ

) is given as

an input, then the corresponding output signal

˜y

L (˜x)

˜y

(

) =

(

t − τ

), where

L (x)

. Namely, the output of the shifted input is the shifted output of the original

input.

An important input signal to consider for any LTI system is the impulse signal. In the

discrete time case, the impulse signal, denoted by

(

), takes 1 at

= 0 and 0 for any other

t, that is,

δ(t) =

(

1, if t = 0,

0, otherwise.

Now, the output of the system when the input is the impulse signal is called the impulse

response and is denoted here as

L (δ)

. It turns out that the operation of any LTI system

on an any input signal

is equivalent to a convolution of

with the impulse response

That is, L (x) = w ⋆ x.

To see this, using this impulse function, any signal

(

) :

t ∈ Z}

can be represented as,

x(t) =

∞

τ =−∞

x(τ)δ(t − τ ),

where observe that

(

t − τ

) takes 1 at

and 0 otherwise. Consequently,

(

) is equal to,

L(x)(t) =

∞

τ =−∞

x(τ)L (δ(t − τ)) =

∞

τ =−∞

x(τ)L(δ)(t − τ ) =

∞

τ =−∞

x(τ)w(t − τ) = (w ⋆ x)(t),

where the ﬁrst and second equalities respectively follow from the linearity and the time

invariance properties of LTI systems.

A similar result exists for continuous time LTI systems, where the impulse response is the

output of the system when the input is a generalized impulse function,

(

), called the Dirac

delta function.

Generally, convolutional neural networks do not rely on such continuous time

representations. Nevertheless, we mention it here for completeness because most treatments

of LTI systems use the delta function.

The Dirac delta function is not an

R → R

function in the standard sense but is rather a generalized

function. It is a mathematical abstraction which allows one to describe an object,

(

), that satisﬁes

(

) = 0

for

t 

= 0 as well as

∞

−∞

(

)

= 1. No such standard

R → R

function exists, but the formalism of generalized

functions allows us to treat

(

) as though it was standard function. Conceptually, we may also consider

(

)

as the limit of a Gaussian density centered at zero, with the standard deviation approaching zero.

6.2 The Convolution Operation

Convolutions in Probability

Convolutions also appear naturally in the context of probability. This is when considering

the distribution of a random variable that is a sum of two independent random variables.

For example consider

for two discrete valued independent random variables

and

with probability mass functions

(

) and

(

), respectively. Then manipulating the

probabilities and noting that

(

A | B

) is the conditional probability of event

given event

B, we have,



ξ = t



= P (ξ

+ ξ

= t)

∞

τ =−∞



= τ





+ ξ

= t



= τ



∞

τ =−∞



= τ





= t − τ



∞

τ =−∞

(τ)f

(t − τ)

= (f

⋆ f

)(t).

In other words, the probability mass function

is equal to the convolution of the

probability mass functions of ξ

and ξ

Multiplication of Polynomials and the Convolution Matrix

The convolution also arises when multiplying polynomials. Consider two example polynomials

(

) =

and

(

) =

. . .

, and their product polynomial

z(r) = f(r)g(r). The degree of z(r) is 2 + 5 = 7 with coeﬃcients z

, . . . , z

, as follows,

= f

+ f

= f

+ f

= f

+ f

= f

+ f

= f

+ f

= f

+ f

= f

(6.3)

With these expressions it is evident that, if we were to set

= 0 for

t /∈ {

}

and

similarly g

= 0 for t /∈ {0, 1, 2, 3, 4, 5}, then we could denote,

∞

τ =−∞

t−τ

= (f ⋆ g)

Further, it is also useful to consider the following alternative ﬁnite sum representation of

given by,

i+j=t

where the sum is over (

i, j

) pairs with

and further requiring

i ∈ {

}

and

j ∈ {

}

. In this context we can also view

as a vector of length 8,

as a vector

Analogous results exist for continuous random variables where probability density functions are used in

place of probability mass functions and a continuous convolution such as (6.2) is applied.

6 Convolutional Neural Networks

of length 3 and

as a vector of length 6. We can then create an 8

6 Toeplitz matrix

(

)

that encodes the values of f such that z = T (f)g. More speciﬁcally,



















0 0 0 0 0

0 0 0 0

0 0 0

0 f

0 0

0 0 f

0 0 0 f

0 0 0 0 f

0 0 0 0 0 f







| {z }

T (f)













This shows that the convolution of

with

is a linear transformation given by the matrix-

vector product

(

)

. Since convolutions are commutative operations, we can also represent

the output z as z = T (g)f where T (g) is an 8 × 3 Toeplitz matrix. In this case,



















0 0

0 g

0 0 g







| {z }

T (g)









At this point, having seen that convolutions may be encoded via Toeplitz matrices such as

(

) or

(

), we see that the convolution operation is a linear operation. The same also

holds for multi-dimensional generalizations which we discuss now.

Multi-dimensional Generalizations

The convolution operation

(6.1)

can be generalized to multivariate functions. In fact, for

deep learning, convolutions are almost always multivariate. Suppose

f, g

→ R

are two

multivariate functions with discrete domains. Then the convolution between

and

is a

commutative operation given by

(f ⋆ g)(u) =

v∈Z

f(u − v) g(v) =

v∈Z

f(v) g(u − v), u ∈ Z

. (6.4)

This is a direct extension of

(6.1)

with the shifting and the ﬂipping of the functions carried out

across all dimensions. Also, similarly to the univariate case over a continuous domain, shown

(6.2)

, multivariate convolutions have continuous domain representations and these are

not presented here because convolutional neural networks use discrete domain convolutions.

The applications presented above for univariate convolutions, namely LTI systems, addition of

independent random variables, and multiplication of polynomials, also extend to multivariate

This is a matrix with constant values on the diagonals. Observe that an

n × m

Toeplitz matrix requires

at most n + m − 1 parameters.

6.2 The Convolution Operation

cases. Speciﬁcally, the probability law of the sum of two independent random vectors can be

obtained via a convolution, the coeﬃcients of the product of multivariate polynomials can

be obtained via a convolution, and the action of an LTI system operating on a multivariate

input signal can be represented via a convolution. This last case is particularly important

for this chapter since one often considers a multivariate convolution.

Any vector, matrix, or tensor can be seen as a function from

with

= 1 for vectors,

d = 2 for matrices, and d ≥ 3 for general tensors. As a result, the convolution between two

vectors, two matrices, or two tensors respectively returns a third vector, matrix, or tensor.

In particular for the

= 2 case, suppose

= [

i,j

], for

= 1

, . . . , K

and

= 1

, . . . , K

is a

× K

matrix and

= [

i,j

], for

= 1

, . . . , M

and

= 1

, . . . , M

, is an

× M

matrix.

In this scenario,

and

(6.4)

can be seen as functions from

by using

the matrices

and

respectively by assigning zeros outside the range of their indices.

Then we denote the convolution

f ⋆ g

W ⋆ x

. By ignoring obvious zeros in

W ⋆ x

, we can

represent this convolution as a matrix of dimension

+ K

− 1) × (M

+ K

− 1). (6.5)

To see how such output dimensions arise, refer to the analogy in the one dimensional

polynomial multiplication

(6.3)

, where we consider an example with input dimensions 3 and

6 (for second and ﬁfth degree polynomials), and thus the output dimension is 3 + 6

−

1 = 8,

matching a 7th degree polynomial.

While

(6.5)

describes the dimensions of such classical convolutions, the convolution operation

used in deep learning diﬀers. In convolutional neural networks, the dimensions of one matrix

are smaller than the corresponding dimensions of the other matrix; namely,

≤ M

and

≤ M

. In this special case, taking the dimension of

× K

and the dimension of

x as M

× M

, the convolution W ⋆ x is usually deﬁned to be a matrix of dimension

− K

+ 1) × (M

− K

+ 1), (6.6)

where now for output at

′

= 1

, . . . , M

−K

+1 and

′

= 1

, . . . , M

−K

+1, the convolution

action is,

′

= (W ⋆ x)

′

, j

′

−1

i=0

−1

j=0

−i, K

−j

′

+i, j

′

, (6.7)

Observe that the convolution in

(6.7)

is a submatrix of the result one would get if applying

(6.4)

. Further note that in this case,

⋆

is not a commutative operation. Figure 6.3 (a) presents

a schematic of the convolution operation

(6.7)

where

is of dimesnion

× K

= 3

and

is of dimesion

× M

= 6

7. Here the output

W ⋆ x

is a 4

5 dimensional

matrix according to (6.6).

The green entry

1,1

in Figure 6.3 (a) is based on the green values in

and all of

. In more

detail, Figure 6.3 (b) presents the computation of the ﬁrst element

1,1

. For this we consider

the ﬂipped

to obtain another 3

3 matrix and then take the element-wise product with

the sub-matrix of dimension 3

3 at the top left corner on the matrix

shown in (a) or also

shown in green in (b). Similarly, in (a), the red entry

1,2

is obtained by sliding the window

to the right by one pixel on the matrix

to consider the next 3

3 sub-matrix (denoted in

red).

We use the subscripts

and

in (

, M

) or (

, K

) throughout this chapter. These subscripts

stand for ‘horizontal’ (rows) and ‘vertical’ (columns) respectively.

6 Convolutional Neural Networks

1,1

2,1

3,1

4,1

5,1

6,1

1,2

2,2

3,2

4,2

5,2

6,2

1,3

2,3

3,3

4,3

5,3

6,3

1,4

2,4

3,4

4,4

5,4

6,4

1,5

2,5

3,5

4,5

5,5

6,5

1,6

2,6

3,6

4,6

5,6

6,6

1,7

2,7

3,7

4,7

5,7

6,7

1,1

2,1

3,1

1,2

2,2

3,2

1,3

2,3

3,3

1,1

2,1

3,1

4,1

1,2

2,2

3,2

4,2

1,3

2,3

3,3

4,3

1,4

2,4

3,4

4,4

1,5

2,5

3,5

4,5

(a)

1,1

2,1

3,1

1,2

2,2

3,2

1,3

2,3

3,3

1,1

2,1

3,1

1,2

2,2

3,2

1,3

2,3

3,3

2,3

1,3

3,2

2,2

1,2

3,1

2,1

1,1



Flipping

1,1

(b)

Figure 6.3: (a) Convolution between two matrices

and

to get

W ⋆ x

. The dimensions of

and

are

× K

= 3

3 and

× M

= 6

7, respectively. The dimension of the output

is (

− K

+ 1)

(

− K

+ 1) = 4

5. (b) Computation of the ﬁrst element

1,1

of the

convolution

W ⋆ x

. Here,

⊙

denotes the element-wise product between two matrices of the same

dimension and

denotes the summation of all the elements of a matrix.

The convolution operation continues with this process until we reach the top right corner on

to obtain the last element

1,5

of the ﬁrst row of

. To obtain the second row,

2,1

, . . . , z

2,5

we repeat the same process by moving the window one row down on

. Ultimately, after the

window is at 4 horizontal positions and 5 vertical positions the 4

5 dimensional output

is obtained. Note that from an implementation perspective, especially when using graphical

processing units (GPUs), the computation of

can be parallelized by carrying out the

operations illustrated in Figure 6.3 (b) simultaneously for each output z

i,j

Now suppose

and

are 3-dimensional tensors with respective dimensions

× K

and

×M

, where similarly to before

is smaller than

in the sense that

≤ M

, and

≤ M

. Here the new dimension sizes

and

are referred to as

the depth of the corresponding tensor. For instance, if

denotes a color image then the

depth

= 3 is attributed to the red, blue, and green components of the image. In this

3-dimensional setup,

(6.7)

can be generalized to provide a volume convolution,

W ⋆ x

, with

6.2 The Convolution Operation

output dimension,

− K

+ 1) × (M

− K

+ 1) × (M

− K

+ 1). (6.8)

Here, for k

′

= 1, . . . , M

− K

+ 1, i

′

= 1, . . . , M

− K

+ 1, and j

′

= 1, . . . , M

− K

+ 1,

(W ⋆ x)

′

−1

k=0

−1

i=0

−1

j=0

−k, K

−i, K

−j

′

+k, i

′

+i, j

′

. (6.9)

For deep learning, an important special scenario is the case in which the depths of both

and

are the same, namely

. In this case the depth is also called the number

of input channels. In such a scenario, the dimensions in

(6.8)

have a depth of 1 and thus

the output of the volume convolution

W ⋆ x

deﬁned by

(6.9)

can be viewed as a matrix of

dimension (6.6). This convolution can also be represented as,

W ⋆ x =

i=1

(i)

⋆ x

(i)

, (6.10)

where the

⋆

on the right hand side denotes the matrix convolution as in

(6.7)

and the

summation is element-wise. Here the matrices that are convolved are

(i)

which is the

matrix along the depth of

and

(i)

which is the

th matrix along the depth of

(also

called the i’th input channel).

Edge Detection Revisited

From an engineering viewpoint, convolutions implement ﬁlters, and in the context of image

processing (of monochrome images) these are often two dimensional convolutions. We now

explore the operation of one such engineered ﬁlter, the Sobel ﬁlter for edge detection, ﬁrst

mentioned in Section 6.1 and applied in Figure 6.1.

Suppose an input image

= [

i,j

] is of dimension

× M

. As we have seen earlier, edge

detection involves two separate operations, namely, vertical edge detection and horizontal

edge detection, exempliﬁed in Figure 6.1 (b) and (c) respectively. With Sobel ﬁltering, each

of these operations is a convolution of x with a 3 × 3 kernel matrix given by either,

(↔)





−1 −2 −1

0 0 0

1 2 1





, or W

(↕)





−1 0 1

−2 0 2

−1 0 1





for horizontal or vertical edge detections, respectively. The actual entries of

(↔)

and

(↕)

are part of the Sobel ﬁlter design and were engineered

to achieve edge detection.

Such ﬁlters were developed via engineering intuition, trial and error, and experimentation.

From our perspective the actual entries of

(↔)

and

(↕)

are merely an example since

in convolutional neural networks, the values of ﬁlters (weights in convolutional layers) are

automatically learned during training.

Suppose

(↔)

and

(↕)

are the outputs corresponding to horizontal edge detection and

vertical edge detection, respectively. These outputs are each computed using

(6.7)

with

Sobel ﬁlters work by approximating the gradient of the image intensity via a discrete diﬀerentiation

operation.

6 Convolutional Neural Networks

replaced by

(↔)

and

(↕)

respectively, both using the same input image

. The overall

edge detection can be obtained by superimposing the two outputs as the pixel-wise sum

(↔)

(↕)

, or average (

(↔)

(↕)

)

2. In case of color images, one may apply Sobel ﬁlters

seperatly on each color component, or seek other generalizations and use the convolution

formulas (6.9) or (6.10).

6.3 Building a Convolutional Layer

In Chapter

, we have seen the construction of general fully connected neural networks,

each of which consists of a series of layers where every neuron in a given layer is connected

to every neuron in the next layer. These networks are general in the sense that they are

structure agnostic, that is, there are no speciﬁc assumptions made about the structure of

the input. This property makes fully connected neural networks versatile. However, they are

inadequate when dealing with speciﬁc applications, such as image classiﬁcation, where the

input has rich structural properties.

Convolutional neural networks make use of the aforementioned two key properties of grid-

structured data, namely translation invariance and locality. As a result, the number of

parameters to learn in convolutional neural networks is signiﬁcantly smaller than that

of corresponding fully connected neural networks. Convolutional layers are based on the

convolution operation, and in this section we focus on building a single convolutional layer.

Motivating a Convolutional Layer

Convolutional neural networks are designed so that the spatial properties of the image data

are inherited from one layer to the next. Therefore, for image processing, it is better to

represent both the input and output of a convolutional layer as images. As we are familiar

from Chapter

with fully connected neural networks, to build a convolutional layer, we

begin with a fully connected layer and then we show how the number of learned parameters

is reduced using translation invariance and locality.

Consider an input dataset consisting of two dimensional grey scaled images

of dimension

[0]

× M

[0]

. For the time being, we focus on the ﬁrst hidden layer of this fully connected

network and the superscript [0] denotes that

is an input to this layer. Each input image

is a matrix with the (

i, j

)-th element denoting the pixel value at the (

i, j

)-th location on the

image. When treating

as an input to a fully connected neural network, it is represented as

[0]

· M

[0]

dimensional vector consisting of all the elements of

. Since such a matrix

to vector conversion is executed in a consistent manner,

without loss of generality we can

continue to index the elements of the vector

via tuples (

i, j

)

∈ {

, . . . , M

[0]

}×{

, . . . , M

[0]

}

We wish to represent the output of the ﬁrst layer also as an image,

in this instance

having dimension

[1]

× M

[1]

. Thus, as with the input, the output vector

[1]

can also

be represented as a matrix, indexed by tuples (

′

, j

′

)

∈ {

, . . . , M

[1]

} × {

, . . . , M

[1]

}

. As

described in Chapter

, the output

[1]

is composed of an aﬃne transformation of

parameterized by

[1]

and

[1]

composed with a non-linear activation function

[1]

(

); see

(??)

. Here, with our image based indexing we represent each element of

[1]

′

),(i,j)

This can be in column major or row major form, and the speciﬁc choice between the two is insigniﬁcant

as long as consistency is maintained.

This requires a non-prime number of neurons in the ﬁrst layer.

6.3 Building a Convolutional Layer

and each element of b

[1]

as b

[1]

′

. With this notation, the output of the layer is

[1]

= S

[1]

), where z

[1]

′

(i,j)

[1]

′

),(i,j)

i,j

+ b

[1]

′

. (6.11)

It is useful to represent each element of

[1]

slightly diﬀerently. For this ﬁx (

′

, j

′

) and reindex

the terms in the summation by setting (i

′′

, j

′′

) for each (i, j) such that

i = i

′

+ i

′′

, and j = j

′

+ j

′′

Now z

[1]

′

can be represented as

[1]

′

′′

)

[1]

′

),(i

′

′′

′

′′

)

′

′′

′

′′

+ b

[1]

′

, (6.12)

where in the summation, (

′′

, j

′′

)

∈ {

− i

′

, . . . , M

[0]

− i

′

} × {

− j

′

, . . . , M

[0]

− j

′

}

. Observe

that generally these indices,

′′

and

′′

, take on both positive and negative values as they

reﬂect the oﬀset relative to i

′

and j

′

respectively.

We now return to the ﬁrst structural property of image data, namely, translation invariance.

With this property, we expect that any shift in

results only as a shift in the output. As

an illustration, let us revisit edge detection and consider a pelican in ﬂight as shown in

Figure 6.4. In (a) we see an input to an edge detection ﬁlter and in (b) we have the output.

Similarly, (c) and (d) are input–output pairs of a similar image. Observe that the pairs

(a)-(b) and (c)-(d) are essentially the same, except for the fact that the position of the

pelican in the output depends only on its position in the input. In other words, the ﬁltering

operation’s action on the object is generally independent of the location of the object in the

image.

In mathematical terms, such translation invariance implies that the weights

[1]

′

),(i

′

′′

′

′′

)

must be independent of the output indices (

′

, j

′

) because (

′

, j

′

) is the pixel location in

the output image. With the change of variables, we can use

′′

and

′′

as relative oﬀsets to

that pixel coordinate instead of absolute coordinates. We can then deﬁne a smaller set of

parameters made of weights

′′

and a scalar bias

such that for all output coordinates

′

, j

′

), the original parameters are

[1]

′

),(i

′

′′

′

′′

)

= w

′′

and b

[1]

′

= b.

This simpliﬁes the expression for z

[1]

′

in (6.12) to be,

[1]

′

′′

)

′′

′

′′

′

′′

+ b. (6.13)

The expression

(6.13)

already indicates a signiﬁcant reduction in the number parameters to

learn in comparison to

(6.12)

. To see this observe that in

(6.12)

our weights potentially vary

based on i

′

and j

′

whereas in (6.13) they do not.

We now see further reduction of the parameters by invoking the second structural property,

namely locality. Viewed in terms of pixels, this property states that a pixel

i,j

is not

signiﬁcantly inﬂuenced by far away pixels. A motivational illustration is in Figure 6.5

6 Convolutional Neural Networks

(a) (b)

Figure 6.4: Edge detection of images with a pelican to illustrate the property of translation

invariance.

consisting of pelicans and seagulls, with each individual bird enclosed in a red box. Generally,

the structural property of locality implies that if we are seeking information about one of

these speciﬁc birds, then it is suﬃcient to know the pixel information only within the box

that is enclosing the bird. Similarly, at a much ﬁner level when we seek information about

edges or similar features, it is often enough to consider 1, 2, or 3, neighboring pixels in each

direction – yielding convolution kernels of size 3 × 3, 5 × 5, or 7 × 7 respectively.

(a) (b)

Figure 6.5: Images of birds to illustrate the property of locality. The pixel information within each

red box is typically suﬃcient for understanding the characteristics of the bird inside the box.

6.3 Building a Convolutional Layer

To mathematically enforce locality for the evaluation of

[1]

′

, we ignore the pixel values

′

′′

′

′′

for

′′

≥ K

, and

′′

≥ K

for some chosen

, K

0; e.g.

, K

at 3, 5, or 7. Equivalently, we set

′′

= 0 for all (

′′

, j

′′

) with

′′

/∈ {

, . . . , K

−

}

and j

′′

/∈ {0, . . . , K

− 1}. This further reduces the layer’s aﬃne transformation to be,

[1]

′

−1

′′

−1

′′

′

′′

′

′′

{z }

⋆

+ b, (6.14)

where the ﬁrst term marked by

⋆

is essentially a convolution

W ⋆ x

with

denoting a kernel

matrix

of dimension K

× K

. Hence, the operation of the layer can be represented as

[1]

= S

[1]

), where z

[1]

= (W ⋆ x) + b, (6.15)

where the addition of the scalar bias

is element wise to each element of the matrix

W ⋆ x

Note that the

⋆

convolution operation in

(6.14)

and

(6.15)

is slightly diﬀerent from

(6.7)

studied in the previous section. To see this diﬀerence, recall that the (

′

, j

′

)-th element of

the convolution operation (6.7) is given by

−1

′′

−1

′′

−i

′′

−j

′′

′

′′

′

′′

and compare this with the summation marked by

⋆

(6.14)

. Hence in our case,

W ⋆ x

the conventional convolution if we replace each

′′

with

−i

′′

−j

′′

; i.e., ﬂipping at

the origin. In the context of neural networks, such a replacement only implies reindexing

of the learned parameters and has no eﬀect on the network structure or its performance.

For instance, if we observe the edge detection operation illustrated in Figure 6.1, the ﬁlter

is ﬂipped only once, and after that for any input

we obtain an element-wise product

between the ﬂipped

and sub-matrices of

. Therefore, learning a ﬁlter and learning its

ﬂipped version are equivalent. As a result, in deep learning, the ﬂipping operation is avoided

for simplicity. In any case, the kernel matrix W is still called a convolutional kernel.

In summary we have seen that at its core, a single convolutional layer involves the following

actions on the input

. First it is convolved with a convolution kernel

. Then the result is

shifted by a scalar bias

. Finally an activation function

[1]

(

) is applied. These actions are

summarized in

(6.15)

. Note that when the input dimension is

[0]

× M

[0]

, using

(6.6)

, the

dimension of the output is,

[1]

× M

[1]

= (M

[0]

− K

+ 1) × (M

[0]

− K

+ 1). (6.16)

For an illustration of the reduction in the number of parameters that a convolutional layer

has in comparison to a fully connected layer, consider an example with input dimension

[0]

× M

[0]

= 224

224 and a case with kernel dimension

× K

= 3

3. Here with

(6.16)

, the output dimension is

[1]

× M

[1]

= 222

222. If we were to seek the same size of

output dimension with a fully connected layer, we have 222

222 = 49

284 neurons. Since

the input size is 224

224 = 50

176, the dimension of the weight matrix is the product of the

input size and output size (number of neurons), and together with the bias vector (one entry

In Chapter

the notation

is used for weight matrices whereas here it is a (generally) smaller kernel

matrix. Note that it implicitly deﬁnes a weight matrix, not directly used in computation.

6 Convolutional Neural Networks

for each neuron) we have 2

472

923

268 parameters. In contrast, in the convolutional layer

there are only 3

3 + 1 = 10 parameters. While on its own, such a single convolutional layer

is certainly not as expressive as the fully connected layer with 2.5 billion learned parameters,

as we see below, combining convolutional layers in tandem yields very powerful networks

with much fewer parameters than their fully connected counterparts.

Alterations to the Convolution: Padding, Stride, and Dilation

The convolution appearing in

(6.14)

is often tweaked and modiﬁed in the context of image

data. Speciﬁcally, alterations to the convolution operation, known as padding, stride, and

dilation, are sometimes employed. For a ﬁxed kernel of dimension

× K

, the combination

of these modiﬁcations allows us to control the output size as well as the eﬀective input

size. Before diving into the details, we mention that these alterations are parameterized by

non-negative integer pairs, (

, p

) for padding, (

, s

) for stride, and (

, d

) for dilation,

where the subscript h is for height and the subscript v is for width.

In the basic convolution operation above, the absence of padding, stride, and dilation is

via a selection of (0

0) for padding and stride, as well as a selection of (1

1) for dilation.

Such a choice yields output dimension as in

(6.16)

. However, when increasing these integers

(typically by small single digit numbers), the output dimension formula

(6.16)

is generalized

to,

[1]

× M

[1]



1 +

[0]

− d

− 1) − 1 + p





1 +

[0]

− d

− 1) − 1 + p



(6.17)

where

⌊u⌋

represents the largest integer not greater than

. We now introduce and motivate

each of these alterations separately and develop

(6.17)

. The reader may verify that with the

aforementioned default settings (0 for padding and stride, and 1 for dilation),

(6.17)

reduces

to (6.16).

To motivate padding, recall the edge detection example above. Due to the convolution

operation, the output image dimension is smaller than the input image dimension. In

particular, since the ﬁlter dimension

× K

is 3

3 (Sobel ﬁlter), when the input

dimension is

[0]

× M

[0]

, the output dimension is equal to (

[0]

−

(

[0]

−

2) as in

(6.16)

. Hence we see a slight reduction of the image size at the output. Since convolutional

neural networks typically consist of several convolutional layers, the dimension reductions in

each of these layers can accumulate, making the overall downstream dimension undesirably

small. Padding is a simple solution to overcome this problem by adding extra zero-valued

pixels around the input so that the eﬀective input dimension is higher, and the desired

output dimension is obtained.

To illustrate padding consider the example in Figure 6.3 (a). Here a convolutional layer with

a kernel of dimension 3

3 is applied to inputs of dimension 6

7. Without padding, for

each input we get an output of dimension 4

5. Now suppose we increase the dimension

of the input to 8

9 by adding zeros around the input image. Then when we apply the

convolution on the modiﬁed input, the output dimension is 6

7, which is equal to the

unpadded input image dimension. Figure 6.6 illustrates this operation.

More generally, again suppose that the input dimension is

[0]

× M

[0]

and the kernel

dimension is

× K

. Further suppose that each input image is modiﬁed by adding

rows

6.3 Building a Convolutional Layer

1,1

2,1

3,1

1,2

2,2

3,2

1,3

2,3

3,3

1,1

2,1

3,1

4,1

5,1

6,1

1,2

2,2

3,2

4,2

5,2

6,2

1,3

2,3

3,3

4,3

5,3

6,3

1,4

2,4

3,4

4,4

5,4

6,4

1,5

2,5

3,5

4,5

5,5

6,5

1,6

2,6

3,6

4,6

5,6

6,6

1,7

2,7

3,7

4,7

5,7

6,7

1,1

2,1

3,1

4,1

5,1

6,1

1,2

2,2

3,2

4,2

5,2

6,2

1,3

2,3

3,3

4,3

5,3

6,3

1,4

2,4

3,4

4,4

5,4

6,4

1,5

2,5

3,5

4,5

5,5

6,5

1,6

2,6

3,6

4,6

5,6

6,6

1,7

2,7

3,7

4,7

5,7

6,7

1,1

2,1

3,1

4,1

5,1

6,1

1,2

2,2

3,2

4,2

5,2

6,2

1,3

2,3

3,3

4,3

5,3

6,3

1,4

2,4

3,4

4,4

5,4

6,4

1,5

2,5

3,5

4,5

5,5

6,5

1,6

2,6

3,6

4,6

5,6

6,6

1,7

2,7

3,7

4,7

5,7

6,7

x with padding

Padding

Figure 6.6: Illustration of convolution with padding. In this example a 3

3 convolution with a

padding setting of (

, p

) = (2

2) maintains the same dimensions for the output

as the input

roughly half on the top and half on the bottom, and

columns roughly half on the left

and half on the right. Then it is easy to check that

(6.16)

is modiﬁed so that the output

dimension is

[1]

× M

[1]

= (M

[0]

− K

+ p

+ 1) × (M

[0]

− K

+ p

+ 1). (6.18)

Note that setting (

, p

) = (

−

, K

−

1) is a mechanism for ensuring that the input

and the output are of the same dimension. Also note that typically convolutional neural

networks are designed to have kernels of odd height and odd width. Hence it is common to

pad with exactly

2 rows of zeros on the left and

2 rows of zeros on the right, and

similarly for the vertical dimension as shown in

(6.6)

. This helps maintain spatial symmetry

while conducting convolutions.

The convolutions we presented up to now involved shifts of the convolution kernel by one

pixel at a time. This is called a convolution with a stride of one, or (

, s

) = (1

1). However,

in many applications, we may wish to slide the convolution kernel with bigger steps in order

to either reduce the computational cost, or to reduce the dimension of the output of the

convolutional layer. This is achieved by adjusting the stride size (

, s

) to be greater than

one.

As a toy example consider Figure 6.7 where the dimension of the input is 10

10 (potentially

after padding), and the kernel is of dimension 3

3. For this example let us use an hypothetical

stride setting of (s

, s

) = (5, 4). This setting implies that the convolution kernel is shifted

in each step with 5 pixels down, or 4 pixels to the right. As usual we start from the top-left

corner, placing the 3

3 convolution kernel on the input image to compute the ﬁrst element

of the output. After computing each element, we move the convolution kernel by 4 pixels to

6 Convolutional Neural Networks

xxxxxxxxxx

= 4

= 5

WWW

Figure 6.7: Illustration of a convolution with stride settings (

, s

) = (5

4). In this hypothetical

example there is no overlap, yet in practice one often uses smaller stride settings.

the right and compute the next element of the row. Once a row of the output is ﬁnished, we

move the convolution kernel downwards by 5 pixels and repeat the horizontal shifting for

the next row of the output. Each time we compute an element of the output, we make sure

there are enough selected input pixels for the convolution kernel.

Note that in this example, for ease of presentation in the ﬁgure, we chose stride settings

greater than the size of the convolution kernel and this implies no overlap of the sliding

windows. However, in practice one typically uses stride settings of size 2, 3, or similar small

steps, smaller than

and

, and this yields overlap in the convolution multiplications.

In general the eﬀect of a stride is in data reduction allowing us to create outputs that

are smaller in dimensions than the input, yet capture the essential information. A second

mechanism for such reductions is pooling, a concept described in Section 6.4.

The alternation of convolutions, with stride settings (

, s

), modiﬁes the output dimension

equation from (6.18) to

[1]

× M

[1]

1 +

[0]

− K

+ p

1 +

[0]

− K

+ p

. (6.19)

The expression results from the fact that the number of elements computed in each row

of the output after computing the ﬁrst element of the row is equal to the number of

rightwards moves allowed. With an eﬀective input row size of

[0]

, this number is

⌊

(

[0]

− K

)

⌋

. Adding 1 due to the ﬁrst element, yields the width of the output;

similarly for the height.

We now focus on dilation, a technique for increasing the receptive ﬁeld. The receptive ﬁeld

of an individual ﬁlter is marked by the dimensions of the window in the input

that aﬀect

a single pixel in the output. For example with a standard 3

3 convolution, the receptive

ﬁeld is 3

3 since each pixel in

W ⋆ x

is inﬂuenced by a 3

3 window in

. When layers are

6.3 Building a Convolutional Layer

composed, the receptive ﬁeld has a more general meaning since as data propagates down

the network, the receptive ﬁeld grows.

Dilation increases the respective ﬁeld by spreading out the elements of the kernel matrix

via the insertion of zeros between elements. This alteration allows the kernel to cover a larger

area of the input image without increasing the number of learned parameters. The level of

dilation is determined by the settings (

, d

) where dilation converts a kernel of size

×K

to a kernel of size

′

× K

′

= (

(

−

1) + 1)

(

−

1) + 1). Speciﬁcally, dilation adds

−

1 all-zero rows between each pair of adjacent rows from the original kernel, and similarly

adds

−

1 columns. Thus the number of all zero rows added is (

−

1)(

−

1), and

similarly (

−

1)(

−

1) for columns. See Figure 6.8 for an example with (

, d

) = (2

2).

1,1

2,1

3,1

1,2

2,2

3,2

1,3

2,3

3,3

1,1

2,1

3,1

4,1

5,1

6,1

1,2

2,2

3,2

4,2

5,2

6,2

1,3

2,3

3,3

4,3

5,3

6,3

1,4

2,4

3,4

4,4

5,4

6,4

1,5

2,5

3,5

4,5

5,5

6,5

1,6

2,6

3,6

4,6

5,6

6,6

1,7

2,7

3,7

4,7

5,7

6,7

1,1

2,1

1,2

2,2

1,3

2,3

1,1

2,1

3,1

1,2

2,2

3,2

1,3

2,3

3,3

Dilation

Figure 6.8: Illustration of dilation operation with (

, d

) = (2

2) extending a 3

3 convolution

ﬁlter to create a receptive ﬁeld of 5 × 5.

Overall, together with a padding of size (

, p

) and a stride of size (

, s

), a dilation factor

of (

, d

) implies that the output dimension is determined by

(6.17)

. To see this, replace

and K

in (6.19) with the eﬀective kernel sizes K

′

and K

′

, respectively.

Inputs with Multiple Channels

So far in this section we have looked at the case where each input is a matrix, usually

representing a grey scale image. However, convolutional networks often deal with inputs

comprised of multiple channels. For instance, a color image has three channels representing

the red, green, and blue components. When we have such data with multiple channels, input

to a convolutional layer is no longer a matrix but is rather represented as a three dimensional

tensor. We denote this tensor’s dimensions via

[0]

× M

[0]

× M

[0]

, where the depth

[0]

denotes the number of channels, and the other two numbers are for the height and width

6 Convolutional Neural Networks

dimensions as used previously. Hence, for color images we have

[0]

= 3 and further, as we

describe in the sequel, for hidden layers we often have more than 3 input channels to the

layer.

[0]

(1)

(2)

(3)

= =

W ? x

Figure 6.9: Graphical representation of the typical convolution operation with

[0]

. In this

example M

[0]

= 3 input channels and we use an M

[0]

× 3 × 3 convolution kernel.

To deal with inputs with multiple channels we often conduct a volume convolution as in

(6.9)

. For this we use a kernel

with depth greater than one which is a three dimensional

tensor with dimensions denoted via

× K

, such that

≤ M

[0]

≤ M

[0]

, and

≤ M

[0]

. In fact, the typical case is to set

[0]

where the output is a matrix and

the convolution is as in (6.10).

Namely for input tensor

, the convolution

W ⋆ x

is a matrix which is computed via an

element-wise sum of the two dimensional convolutions

(i)

⋆ x

(i)

for

= 1

, . . . , M

[0]

. Each

(i)

⋆x

(i)

is a matrix of dimension

[1]

×M

[1]

as in

(6.17)

. This two dimensional convolution

is based on the

th channel in the input tensor, denoted

(i)

, and on

(i)

which is the

corresponding

× K

dimensional matrix matching channel

in the convolution kernel

tensor

. Note that the same settings of padding, stride, and dilation are applied across all

the channels. Figure 6.9 illustrates such a volume convolution for the case of

[0]

= 3.

After the volume convolution is carried out, a single scalar bias term,

, is added to each

element of the matrix

W ⋆ x

. Then a (generally) non-linear activation function

[1]

(

) is

6.3 Building a Convolutional Layer

applied. Hence the action of the convolution on multiple input channels parallels

(6.15)

and

is,

[1]

= S

[1]

), where z

[1]



[0]

i=1

(i)

⋆ x

(i)



+ b. (6.20)

Outputs with Multiple Channels

Until now, regardless of the number of input channels, the output is a matrix, denoted via

[1]

(6.20)

. This is because, so far there is only one kernel, possibly a tensor, operating on

the input to the convolutional layer. However, most popular convolutional neural networks

have convolutional layers with multiple kernels operating on the input simultaneously. In this

case, the output of the layer is a collection of matrices denoted by

[1]

(j)

for

= 1

, . . . , M

[1]

where

[1]

is the number of output channels (also known as feature maps). Consequently,

the output can be viewed as a 3-dimensional tensor of dimension M

[1]

× M

[1]

× M

[1]

[0]

(1)

(2)

(1)

(2)

= =

W ? x

Figure 6.10: Illustration of a convolutional layer with 3 input channels and 2 output channels.

In this case, the convolutional layer is parameterized by multiple kernels

(j)

for

, . . . , M

[1]

, each with a scalar bias term

(j)

. In particular, the kernel

(j)

and bias term

(j)

correspond to the output channel

. With this notation, the operation of the layer can

be represented as,

[1]

(j)

= S

[1]

(j)

), where z

[1]

(j)



[0]

i=1

(j),(i)

⋆ x

(i)



+ b

(j)

, (6.21)

6 Convolutional Neural Networks

for

= 1

, . . . , M

[1]

, where similarly to

(6.20)

(j),(i)

is the matrix corresponding to the

input channel for the

th kernel. See Figure 6.10 for an illustration in the case of

[0]

= 3

and M

[1]

= 2 (3 input channels and 2 output channels).

It is a common practice to use the same dimension

× K

for all kernels

(j)

the layer with the same settings of padding (

, p

), stride (

, s

), and dilation (

, d

) for

all the channels. In that case, the dimension

[1]

× M

[1]

of each output channel is given

by (6.17).

As an illustrative hypothetical example of multiple output channels, assume that the input

to the ﬁrst layer is a color image with three channels. One kernel can be used to extract

horizontal edges in each input channel of the image while another kernel of the same size

extracts vertical edges. In that case, the output has two channels where one consists of

horizontal edges and the other consists of the vertical edges. More generally, in trained

networks, we can think of diﬀerent channels of the output as diﬀerent feature extractions

from the input. These channels jointly help in overall feature extraction for the whole

network.

6.4 Building a Convolutional Neural Network

We have now acquired all the crucial elements necessary for constructing convolutional

neural networks, such as the VGG19 model depicted in Figure 6.2. We now put the pieces

together for constructing a convolutional neural network that, in addition to convolutional

layers, includes fully connected layers, as studied in Chapter

, and pooling layers described

in this section. This section also oﬀers complete details of the previously introduced VGG19

network, serving as an illustrative example. It also introduces fully convolutional networks,

an architecture that uses convolutional layers in place of fully connected layers.

A convolutional neural network is generally deep with multiple layers, similar to feedforward

networks studied in Chapter

. Unlike feedforward networks which consist of only fully

connected layers, convolutional neural networks have diﬀerent types of layers, of which some

are trainable and the others are not, and the trainable layers are further broken up into

convolutional layers and dense layers. Using the notation of Chapter

, we use

for the

number of layers, and decompose L to

L = L

train

+ L

pool

, where L

train

= L

conv

+ L

dense

Here

train

, counts the number of trainable layers, whereas

pool

counts the number of layers

that do not have trainable parameters. Further, the trainable layers are either convolutional

layers, counted by

conv

, or fully connected layers, counted by

dense

. It is important to

note that in terms of naming conventions, in some instances the depth of the network is

taken as L, whereas in others it is taken as L

train

. For example, in the VGG19 network,

L = 24, L

train

= 19, L

pool

= 5, L

conv

= 16, L

dense

= 3, (6.22)

yet the network is called VGG19 and not “VGG24”.

Similar to a feedforward network, the goal of a convolutional neural network is to approximate

some unknown function

∗

(

). For instance, for classiﬁcation of image data with animal

faces, the function value

∗

(

) for any given image

may yield a probability vector with the

6.4 Building a Convolutional Neural Network

highest weight on the index associated with the label of the image

. A convolutional neural

network deﬁnes a mapping

(

) and learns the values of the unknown parameters

that

ideally result in

∗

(

)

≈ f

(

) for as many input images

as possible. In general, similar

to equation

(??)

for feedforward networks, the approximating function

(

) is recursively

composed as

(x) = f

[L]

[L−1]

(. . . (f

[1]

(x)) . . .)),

where for each

ℓ

, the function

(ℓ)

[ℓ]

(

) is associated with the

ℓ

th layer which depends on the

layer’s parameters

[ℓ]

∈

[ℓ]

. Note that for layers that are not trainable (as counted via

pool

), the parameter space Θ

[ℓ]

is empty.

In general, similarly to feedforward networks, it is useful to denote the neuron activations of

the network via a

[1]

, a

[2]

. . . , a

[L]

where a

[L]

= ˆy is the output, and for ℓ = 1, . . . , L − 1,

[ℓ]

= f

[ℓ]

[ℓ−1]

with

[0]

. We mention that the shape of the neurons per layer

[ℓ]

varies as it is sometimes

a tensor (of order 3) and sometimes a vector, depending on the type of layer.

Convolutional Layers

When the

ℓ

-th layer of the network is a convolutional layer, then

(ℓ)

[ℓ]

(

) uses

(6.21)

, treating

[ℓ−1]

as the input. In this case the input and output are generally 3-tensors as we have seen

in the previous sections. In particular,

(ℓ)

[ℓ]

: R

[ℓ−1]

×M

[ℓ−1]

×M

[ℓ−1]

−→ R

[ℓ]

×M

[ℓ]

×M

[ℓ]

maps

[ℓ−1]

of dimension

[ℓ−1]

× M

[ℓ−1]

× M

[ℓ−1]

[ℓ]

of dimension

[ℓ]

× M

[ℓ]

× M

[ℓ]

Now

(6.21)

is implement for the

[ℓ]

output channels and this operation can be represented

as,

(ℓ)

[ℓ]



[ℓ−1]



= S

[ℓ]

h

[ℓ]

(j)

[ℓ−1]

i=1

[ℓ]

(j),(i)

⋆ a

[ℓ−1]

(i)

| {z }

[ℓ]

(j)

j=1,...,M

[ℓ]



where the input tensor has

[ℓ−1]

channels and the output tensor has

[ℓ]

channels. Using

similar notation to

(6.21)

, the kernel

[ℓ]

(j)

is of dimension

[ℓ]

× K

[ℓ]

× K

[ℓ]

(the same for all

output channels

) where the kernel matrix for the

-th input channel and

-th output channel

is denoted

[ℓ]

(j),(i)

. The tensor after the volume convolutions and bias term is denoted using

the notation



[ℓ]

(j)



j=1,...,M

[ℓ]

where each z

[ℓ]

(j)

is a matrix of dimension M

[ℓ]

× M

[ℓ]

Note that the activation function

[ℓ]

(

) is now considered as a function applied on a tensor of

dimension

[ℓ]

× M

[ℓ]

× M

[ℓ]

. It is typically an element-wise application of scalar activation

functions

[ℓ]

(

) similarly to previous feedforward examples. In fact, the common activation

function is σ

ReLU

(·); see Section ??.

6 Convolutional Neural Networks

Observe that the number of learned parameters for the layer is ,

[ℓ]



[ℓ−1]

· K

[ℓ]

· K

[ℓ]

+ 1



, (6.23)

since there are

[ℓ]

kernels (one per output channel) each of dimension

[ℓ]

× K

[ℓ]

× K

[ℓ]

where

[ℓ]

[ℓ−1]

(the number of input channels to the layer) and since each output

channel adds a scalar bias term.

Pooling Layers

As mentioned above, there are also non-trainable layers counted by

pool

and these are

typically pooling layers. The main idea of a pooling layer is to reduce the height and width

of the input tensor

[ℓ−1]

to achieve a lower dimensional output tensor

[ℓ]

while retaining

the same number of channels. The operation of the layer can be summarized with a function,

(ℓ)

pool

: R

[ℓ−1]

×M

[ℓ−1]

×M

[ℓ−1]

−→ R

[ℓ]

×M

[ℓ]

×M

[ℓ]

with M

[ℓ−1]

= M

[ℓ]

, M

[ℓ−1]

> M

[ℓ]

, and M

[ℓ−1]

> M

[ℓ]

Generally for some ﬁxed channel

, and pixel coordinates of the output (

i, k

), a pooling

operation operates on pixels from a window in the input denoted via

(i,k)

. Here

(i,k)

is a set

of pixel coordinates in the input that are maped to the speciﬁc output pixel (

i, k

). There are

two popular pooling techniques used in practice, namely, max-pooling and average-pooling.

For each channel j, the pooling operation can be summarized as,

[ℓ]

(j)

i,k











max

′

)∈I

(i,k)

[ℓ−1]

(j)

′

, (max-pooling)

(i,k)

′

)∈I

(i,k)

[ℓ−1]

(j)

′

. (average-pooling)

As is evident, max-pooling takes the maximal pixel value within the window as the output,

while average pooling averages pixel values within the window for the output.

−4

−2

−1

−2

−11

(a)

−4

−2

−1

−2

−11

−3

(b)

Figure 6.11: An example of pooling with a 2

2 window. (a) Max-pooling. (b) Average-pooling.

6.4 Building a Convolutional Neural Network

The speciﬁcs of the pooling operation deﬁne exactly how

(i,k)

is determined. Generally,

similar to the convolution operation and its alternations with stride and padding, we may

view pooling as moving a small window over the input to compute an output. The way in

which this window moves implicitly deﬁnes

(i,k)

. As a concrete example see Figure 6.11

which illustrates a case of pooling with a window of dimensions 2

2. With this,

(i,k)

= 4,

and then each output pixel (i, k) is computed based on all 4 pixels (i

′

, k

′

) ∈ I

(i,k)

from the

input image which form a 2

2 window in

[ℓ−1]

(j)

. A typical pooling stride of the window

which covers all input pixels while forming non-overlapping windows is to shift each time

with the size of the window as in Figure 6.11. In general other pooling stride settings are

also possible.

The idea of pooling interplays with the notion that the initial layers of a convolutional

network focus on pixel level features similar to edge detection, and as we progress towards

the ﬁnal layers of the network, the information is aggregated to address general questions

about the whole image. Thus deeper layers are less sensitive to translation changes on

the input image compared to the initial layers. For instance, the answer to a question “is

there a bird in the photo?” is the same for both images in Figure 6.5, even though the

corresponding outputs from the initial layers look diﬀerent. Pooling layers are applied after

convolutional layers to help achieve such aggregation by reducing the spatial dimension of

the outputs. In addition, the dimension reduction which pooling layers oﬀer is important

from a computational perspective.

We now return to the notion of a receptive ﬁeld, previously discussed in the context of

dilation and with respect to a single convolution. Now we consider it in the context of a

whole network. In particular we consider the receptive ﬁeld of a derived feature. Consider

a neuron in the network,



[ℓ]

(j)



i,k

, for layer

ℓ

, channel

, and pixel coordinates (

i, k

). This

neuron or activation is a derived feature inside the network. Using the dimensions and

speciﬁcations of the layers up to that neuron, namely 1

, . . . , ℓ

, it can be determined which

input pixels from the input image

, aﬀect the value of



[ℓ]

(j)



i,k

. For example if the neuron

is at a ﬁrst layer involving a 5

5 convolution kernel, then the value of the neuron is only

determined by 25 pixels in the input image. However, if the layer

ℓ

is a hidden layers with

multiple convolutions and pooling layers prior to it, it may be that



[ℓ]

(j)



i,k

is determined

by the whole input image

or a signiﬁcant portion. In general, pooling layers help increase

the receptive ﬁeld of neurons of hidden layers. This allows the derived features towards the

end of the network to depend on the whole input image, or signiﬁcant parts of it.

Fully Connected Layers

When the

ℓ

-th layer of the network is a fully connected layer then the operation of

(ℓ)

[ℓ]

(

) is

as in

(??)

of Chapter

. Such layers are typically deployed at the end of the network. This

is because the typical task of the last layers of convolutional neural network is to address

general questions, such as classiﬁcation of the objects in the image. Note that since fully

connected layers operate on vectors as the input, in cases where the previous layer has a

tensor as output, the tensor is ﬂattened to a vector.

It is common to adapt the ﬁnal fully connected layers of convolutional networks for speciﬁc

tasks. For example, the VGG19 model can have the ﬁnal layers ﬁne tuned for tasks such as

object localization discussed in Section 6.6. In doing so, we may take the network trained for

classiﬁcation, and then ﬁne tune it for the other task by only training the fully connected

6 Convolutional Neural Networks

layers. This is sometimes called freezing layers (the ones not trained) during training..

Similarly, convolutional networks that were trained on generic images from a general domain,

such as ImageNet, can be ﬁne tuned by training the ﬁnal layers on more speciﬁc images

from a speciﬁc domain (e.g., only on speciﬁc animal images of a certain type). This process,

also used in other non-convolutional models, is called transfer learning.

VGG19 Revisited

We now take a closer look at the architecture of our running example network, VGG19.

While this is not the most modern convolutional architecture, it is instructive to consider it

here since it falls directly within the paradigms discussed above. Other popular convolutional

architectures are surveyed in the next section. We have seen in

(6.22)

the counts of diﬀerent

layer types in VGG19 which has

= 24 layers of which 19 are trainable. Table 6.1 provides

complete details.

Each input to the network is a color image

of dimension

[0]

×M

[0]

×M

[0]

= 3

224

224.

In the basic form we present here, the network is conﬁgured for a classiﬁcation task with

= 1

000 classes. Thus, the output

ˆy

of the network is a probability vector of length 1

000,

where the ith element, ˆy

, denotes the probability of x is of class i ∈ {1, . . . , K}.

In this architecture all the convolutional kernels in the network are of the same dimension,

× K

= 3

3. The padding and stride settings are the same for all the convolutional

layers with (

, p

) = (2

2) for padding and (

, s

) = (1

1) for stride. There is no dilation,

i.e. (

, d

) = (1

1). With these settings, it is evident from

(6.17)

that for each convolutional

layer, the input height and width dimensions are identical to the output height and width

dimensions. Thus with this network, height and width dimensions are reduced only via

pooling. All the pooling layers are max-pooling using 2

2 windows that are moved with

a stride of (2

2) without padding. Thus each such pooling layer halves the height and

width dimensions. The dimensions start at 224

224 and are halved using the sequence,

224

112

14, and 7. Yet as layers progress, more channels are added where we start

with 3 channels in the input and increase to 64 channels in the ﬁrst layer. Then after some

of the pooling layers, we double the number of channels so that eventually by layer

ℓ

= 12

there are 512 channels.

We see from Table 6.1 that the tensor output of the 21st layer, which is a max-pooling

layer, is ﬂattened to a vector that is given as an input to the ﬁrst fully connected layer,

layer

ℓ

= 22; namely 512

7 = 25

088. In terms of activation function, the architecture

uses the Rectiﬁed Linear Unit (ReLU) activation function for all the hidden trainable layers

and soft-max for the output layer so that each output assigns a probability to each of the

possible 1, 000 classes.

In the original VGG19 paper,

the network was trained on the ImageNet dataset and

nowdays when one uses this network, one often uses a pretrained version. In the original

paper the input images were preprocessed by subtracting the mean red, green, and blue value,

computed over the entire ImageNet training set, from each pixel. This type of preprocessing

is needed in production (test time) as well. Note that in the original paper, to obtain the

input size 224

224, input images were randomly cropped from rescaled training images, one

The VGG19 architecture achieved state-of-the-art performance on the ImageNet classiﬁcation task in

2014, with a top-5 error rate of 7

3%. This network is often used as a pre-trained model for transfer learning

tasks, where the lower layers are ﬁxed and the higher layers are ﬁne tuned for a speciﬁc task.

6.4 Building a Convolutional Neural Network

Table 6.1: Speciﬁcations of the VGG19 architecture. The number of learned parameters for the

covolutional layers is computed using

(6.23)

. The number of learned parameters for a fully connected

layer with input size

[ℓ−1]

and output size

[ℓ]

[ℓ−1]

· N

[ℓ]

; see Section

(??)

for more

details on the learned parameters of fully connected layers.

Layer Type of Output Number of Number of

Number Layer Dimension Neurons Learned Parameters

0 Input 3 × 224 × 224 - -

1 Convolution 64 × 224 × 224 3, 211, 264 1, 792

2 Convolution 64 × 224 × 224 3, 211, 264 36, 928

3 Max-pooling 64 × 112 × 112 802, 816 0

4 Convolution 128 × 112 × 112 1, 605, 632 73, 856

5 Convolution 128 × 112 × 112 1, 605, 632 147,584

6 Max-pooling 128 × 56 × 56 401, 408 0

7 Convolution 256 × 56 × 56 802, 816 295, 168

8 Convolution 256 × 56 × 56 802, 816 590, 080

9 Convolution 256 × 56 × 56 802, 816 590, 080

10 Convolution 256 × 56 × 56 802, 816 590, 080

11 Max-pooling 256 × 28 × 28 200, 704 0

12 Convolution 512 × 28 × 28 401, 408 1, 180, 160

13 Convolution 512 × 28 × 28 401, 408 2, 359, 808

14 Convolution 512 × 28 × 28 401, 408 2, 359, 808

15 Convolution 512 × 28 × 28 401, 408 2, 359, 808

16 Max-pooling 512 × 14 × 14 100, 352 0

17 Convolution 512 × 14 × 14 100, 352 2, 359, 808

18 Convolution 512 × 14 × 14 100, 352 2, 359, 808

19 Convolution 512 × 14 × 14 100, 352 2, 359, 808

20 Convolution 512 × 14 × 14 100, 352 2, 359, 808

21 Max-pooling 512 × 7 × 7 25, 088 0

Flattening to a vector of length 25, 088

22 Fully connected 4, 096 4, 096 102, 764, 544

23 Fully connected 4, 096 4, 096 16, 781, 312

24 Fully connected 1, 000 1, 000 4, 097, 000

Total: 16,391,656 Total: 143,667,240

crop per image per each iteration of the stochastic gradient descent optimization algorithm.

This type of data augmentation is further discussed in Chapter ??.

One by One Convolutions and Fully Convolutional Networks

A one by one convolutional layer is a special case of a convolutional layer where we apply

[ℓ]

× K

[ℓ]

= 1

1 dimensional kernel matrices on all the input channels. At ﬁrst glance,

if one returns to the basics of two dimensional convolutions as in

(6.7)

, it may seem like a

one by one convolution is nothing but a scalar multiplication. However, since now there are

[ℓ]

(or

[ℓ−1]

) channels at play, the one by one convolution allows us to create a linear

combination of the input channels. For example, in image processing when one converts a

6 Convolutional Neural Networks

red, green, and blue color image into a monochrome (black and white) image, one way to do

so is to deﬁne each monochrome pixel as a linear combination of the three color pixel values,

and this is a one by one convolution.

One obvious application of one by one convolutions is for the reduction of depth (number

of channels) inside convolutional neural networks without changing the spatial dimension.

Return to the VGG19 architecture in Table 6.1 and observe that from layer 0 to layer 21

depth either stays the same or grows (starting at 3 and reaching 512). However, in contrast

to VGG19 that ﬂattens layer 21, say we wanted to have a layer, which we call a depth

reduction layer, straight after layer 21, which reduces the depth from 512 channels to a lower

number. This can be viewed as a non-linear projection of the 512 channels in layer 21 onto a

tensor of lower dimension with less channels. Clearly, one by one convolutions oﬀer a natural

way for such depth reduction where we set the number of one by one convolution kernels as

the desired number of output channels of the reduction layer. So for example in VGG19 if

we would have wanted layer 22 to be a tensor of dimension 8

7 instead of the fully

connected layer as in Table 6.1, then we would introduce 8 one dimensional convolutions for

that layer. The total parameter count for that layer would be 8

512 + 8 = 4

112, where

the additional +8 is for the bias term of each of the 8 one by one convolutions.

Importantly, one by one convolutions also allow us to represent fully connected layers as

convolutional layers. To see this, recall that a fully connected layer relies on an aﬃne

transformation on some input vector, say

of length

, to obtain

W x

, where

the weight matrix with

columns, and

is the bias vector. In that case, the

-th element

of z is,



i=1

j,i



+ b

. (6.24)

Now return to

(6.21)

and consider a one by one convolution on a volume

of dimension

N ×

1. In this case

(i)

can simply be represented as

and the

⋆

operation can be

replaced by multiplication. Omitting the superscript “[1]” in (6.21), we have,

(j)



i=1

(j),(i)

· x



+ b

(j)

. (6.25)

Hence, we see that the fully connected operation

(6.24)

and the one by one convolution

operation (6.25) are essentially identical.

In general a convolutional network that does not have fully connected layers and has all

trained weights and biases associated with convolutional layers is called a fully convolutional

network. In essence a non fully convolutional network such as VGG19 may be transformed

into a fully convolutional network by replacing the fully connected layers using one by one

convolutions. This process sometimes termed convolutionalization. For example for VGG19

this means transforming layers 22, 23, and 24, as in Table 6.1, into convolutional layers. There

are multiple reasons for convolutionalization and multiple advantages to fully convolutional

architectures. Primarily, the representation of fully connected layers as convolutional layers

allows us to stack multiple parallel outputs or intermediate channels in a single tensor.

6.4 Building a Convolutional Neural Network

Dropout, Batch Normalization, and Group Normalization

Some of the techniques introduced in Chapter

for fully connected neural networks are also

applicable in convolutional networks. We now discuss two such techniques, namely dropout

and batch normalization. We also highlight group normalization which is a variant of batch

normalization in the context of convolutional neural networks.

Recall from Section

that dropout is a simple regularization technique where during each

forward pass in the training, only a random subset of the neurons (randomly selected for

that iteration) is used. In convolutional networks, we can still employ dropout for the fully

connected layers but not for the convolutional layers. This is because in convolutional layers,

the neurons have spatial orientation, and dropping out individual neurons could disrupt the

spatial structure.

Batch normalization, introduced in Section

, often accelerates learning. The key idea

is a shifting and scaling transformation using additional learned parameters as in

(??)

Chapter

which generally maintains the activation values in a dynamic range near 0. For

convolutional neural networks, batch normalization at a convolutional layer

ℓ

is usually

applied on each channel

[ℓ]

(j)

of the convolution output

[ℓ]

before the corresponding activation

is applied. That is, for two learned scalar parameters

[ℓ]

and

[ℓ]

, the

th channel matrix of

dimension M

[ℓ]

× M

[ℓ]

after the is given by

˜z

[ℓ]

(j)

= γ

[ℓ]

¯z

[ℓ]

(j)

+ β

[ℓ]

, (6.26)

for each

= 1

, . . . , M

[ℓ]

, where as before we use ‘+’ for addition of the scalar to every

element of the matrix. Here, the matrix being transformed has (

′

, j

′

)-element [

¯z

[ℓ]

(j)

]

′

that

is computed similar to

(??)

by subtracting the mean and then dividing it by the square-root

of the variance plus a small constant

, where the mean and variance are computed for the

same element (

′

, j

′

) of

th channel of the convolution output

[ℓ]

(j)

over the entire mini-batch,

similar to (??).

A variant that has gained popularity is called group normalization. Here, instead of normal-

izing each channel (applying

(6.26)

on a standardized

¯z

[ℓ]

(j)

), the channels of the convolution

output are divided into a set of groups, and then the mean and variance values are com-

puted for each group over a mini-batch and similarly a form of

(6.26)

is applied per-group.

Hence the learned parameters (

’s and

’s) are per group in a layer. Note that the group

normalization is identical to the batch normalization when the number of groups is equal to

the number of channels, but otherwise it reduces the number of learned parameters.

Understanding Inner Layers and Derived Features

Recall an elementary example from Section

where we estimated a simple linear regression

coeﬃcient

to have a value of 8

27. In that simple example, the interpretation of the

estimated parameter was clear: A unit increase of the feature implies an increase of the output

by 8

27. Thus with linear models, beyond using the model for prediction, the actual learned

parameters have meaning. Ideally, for deep learning models in general, and particularly

for convolutional neural networks, we would also like to have such an interpretation of the

learned parameters. Namely, what information do we know based on the learned convolution

6 Convolutional Neural Networks

kernels, weight matrices, and bias vectors. However, convolutional (and deep) models with

millions of parameters are much more involved, and simple direct interpretability is typically

not attainable.

While direct interpretability is not possible, there are multiple techniques for visualizing

convolutional neural networks. We now brieﬂy summarize some overarching approaches

which we dichotomize as either weight based or feature based. The weight based approach

focuses on visualizing the learned convolution kernels of the network. The feature based

approach focuses on the activation values in speciﬁc channels and has several variants.

Starting with the weight based approach, visualizing the weights of convolution kernels with

at most 3 is possible just by treating the kernel as a red, green, and blue image and

displaying it. For many architectures this is possible at the ﬁrst layer since the input has three

channels (hence

= 3). In fact, for many trained models, the color image visualization

of ﬁrst layer convolution kernels shows that these ﬁlters are similar in nature to simple

engineered ﬁlters such as edge detectors. On the other hand, for layers down the network,

there are often more than three channels, and while we may try to use data reduction

techniques to visualize the associated ﬁlters (each with

3), such a visualization is

typically not fruitful.

Continuing to the feature based approach, we focus on the values of activations in speciﬁc

channels in the network. A simple mechanism is to apply diﬀerent categories of images and

examine which neurons or activations are most excited by which category. A slightly more

sophisticated feature based approach is via the application of occlusions (covering part of

the view). The basic idea is to ﬁrst consider a non-occluded image, and then occlude the

image by covering up some interesting part such as a face of a person. We then compare the

diﬀerence in neuron activation values for the non-occluded and occluded inputs. Neurons

for which the diﬀerence in activation values is signiﬁcant may then be interpreted as being

sensitive to the occluded part of the image (e.g. to a face).

All the aforementioned approaches are simple in the sense that they do not rely on an

additional model, but rather just use the trained model under study. However, there are

multiple approaches that execute additional optimization for better interpretability insights.

As an illustration, let us see one such approach stemming from a landmark paper.

addition to the methodological contribution, the work of this paper also highlighted important

structural aspects of trained convolutional neural networks. Speciﬁcally, it was shown that

initial layers of the network generally seek simple visual features such as corners, colors, and

edges, while later layers of the network ﬁnd much more reﬁned artifacts such as faces, or

speciﬁc objects.

Consider Figure 6.12 which illustrates a visual interpretation of some channels within a

trained convolutional network. The network has many channels across multiple layers, and

here we present only a few of those channels, focusing on a pair of arbitrary channels within

each of the layers 2, 3, 4, and 5. Before we outline how the visualization in this ﬁgure was

created, let us interpret it. Each channel that we visualize has a 3

3 grid of synthesized

images (channel visualization) as well as a matching 3

3 grid of parts of images from a

We mention that there is a whole ﬁeld dealing with interpretable machine learning. In this subsection

our goal is to only present a glimpse of the area.

See “Visualizing and Understanding Convolutional Networks” by M. Zeiler and R. Fergus, [66].

Image is adapted from ﬁgure 2 of “Visualizing and understanding convolutional networks”, [

] with

thanks to M. D. Zeiler and R. Fergus.

6.4 Building a Convolutional Neural Network

Layer&2

Channel&&

Visualization

Original&&

Receptive&Field

Channel&&

Visualization

Original&&

Receptive&Field

Layer&3

Channel&&

Visualization

Original&&

Receptive&Field

Channel&&

Visualization

Original&&

Receptive&Field

Layer&4

Channel&&

Visualization

Original&&

Receptive&Field

Channel&&

Visualization

Original&&

Receptive&Field

Layer&5

Channel&&

Visualization

Original&&

Receptive&Field

Channel&&

Visualization

Original&&

Receptive&Field

Figure 6.12: Visualization of the meaning of channels of a trained model.

We present two

arbitrary channels for each of layers, 2, 3, 4, and 5 and for each channel we see the 9 images

that yield the 9 highest activation values. The gray background images (channel visualization) are

processed via a deconvolution network from feature space back to pixel space. The original receptive

ﬁeld color images are the associated receptive ﬁelds in the images that excite those activations. It

can be seen that initial layers search for more elementary features and layers deeper in the network

search for more reﬁned features. Importantly, it appears like the type of features searched for in

each channel are generally homogenous (although this not always the case, as is evident with the

top channel presented for layer 5).

dataset (original receptive ﬁeld). These channel visualizations and original receptive ﬁelds

can serve as a visual interpretation of what the speciﬁc channel detects.

For example, we see that the two channels visualized in layer 2 detect simple features with

one channel focusing on edges and another channel focusing on circles. As we advance deeper

in the network we see that the type of visual patterns detected are much more complex. For

example, the two channels presented for layer 4 detect parts of animals, and the channels

of layer 5 detect such representations as well. Note however, that one of the channels in

layer 5 that we present appears to detect either faces or car wheels even though these are

very diﬀerent objects. Hence any attempt to categorize channels based on their “meaning”

alone is far from absolute. Nevertheless, a visual representation such as that in Figure 6.12

can help to understand the function of individual channels within the network.

Let us now indicate how a visualization such as that in Figure 6.12 can be created. We

may focus on any arbitrary speciﬁc channel

in layer

ℓ

. A validation set of images is

run though the network and for each image we consider the activation matrix

[ℓ]

(j)

the channel. We compute

[ℓ]

(j)

max

i,k



[ℓ]

(j)



i,k

, where the maximum is over the pixel

coordinates (

i, k

)

∈ {

, . . . , M

[ℓ]

} × {

, . . . , M

[ℓ]

}

. We also keep the coordinates that attain

this maximum, denoted here via

∗

and

∗

. The idea is to ﬁnd the neuron or activation, that

is maximally activated by each image in the validation set. Doing so for all images in the

validation set, we then take the 9 images that achieve the maximal

[ℓ]

(j)

values and these

are the 9 images used for visualization of that channel. Now for each image out of those 9

images we take the coordinate (

∗

, k

∗

) of the maximally activated neuron, and determine the

receptive ﬁeld of that neuron within the input image. We then present the receptive ﬁeld

6 Convolutional Neural Networks

part of the input image for each of the 9 images. This visualization then illustrates the 9

most signiﬁcant image patches for the channel at question.

As for the channel visualization part (gray background images) of Figure 6.12, a more

sophisticated process is carried out on each of the 9 selected images per channel. A type

of network, called a deconvolution architecture is constructed in parallel to the original

convolutional network. This combined architecture enables transforming “feature space”

back to “pixel space” for individual input images and speciﬁc neuron locations. That is, an

input image to the original network is ﬁrst processed. Then with a speciﬁc neuron (

∗

, k

∗

) in

channel

of layer

ℓ

, speciﬁed, the deconvolution architecture returns an image associated

with the receptive ﬁeld of that particular neuron in “pixel space”. While we do not specify

the details of this particular deconvolution architecture, let us mention how it is used for

the channel visualization. Each gray background channel visualization image in Figure 6.12,

is an output in pixel space, resulting from the associated neuron, (

∗

, j

∗

) in channel

layer

ℓ

speciﬁed to the deconvolution architecture. For this all other neurons in channel

layer

ℓ

are set to 0, except for [

[ℓ]

(j)

]

∗

which is activated. The deconvolution architecture,

then works backwards in the network from layer

ℓ

to layer

ℓ −

1, and back, all the way until

presenting the result in pixel space. The beneﬁt of such channel visualization images, is that

they allow us to see how a single neuron [a

[ℓ]

(j)

]

∗

“appears” in “pixel space”. Importantly,

we see that the 9 most activated neurons in the same channel, are generally of the same

nature.

6.5 Inception, ResNets, and Other Landmark Architectures

So far we covered the key components of convolutional neural networks and considered the

VGG19 model as one concrete network example. In this section we highlight other landmark

architectures within the world of convolutional neural networks. Our main goal is to highlight

ideas stemming from these architectures.

In general, the book avoids historical accounts as much as possible, yet in the context of

network architectures, some knowledge of the historical progression might be practically

useful. We thus begin with a brief historical account naming key architectures. We then

focus on three architectural ideas, namely the network within a network (inception network

also known as GoogLeNet), residual connections (ResNets), and eﬃcient model scaling as

in EﬃcientNet. See also the notes and references at the end of the chapter for further

information.

A Brief Historical Account

As with many ideas in deep learning, with convolutional networks one can ﬁnd quite early

roots. In this case, early convolutional networks include the Neocognitron from the late

1970’s and early 1980’s and LeNet-5 worked on during the mid 1980’s until the late 1990’s.

Both of these networks already encompass many of the ideas presented in this chapter, yet

in those days computation power was lacking and ease of implementation with software was

much less advanced.

The architecture that really advanced deep learning as a whole, and particularly convolu-

tional neural networks is AlexNet from 2012. At the time, it was a breakthrough in image

classiﬁcation, achieving state-of-the-art performance on the ImageNet dataset. The archi-

6.5 Inception, ResNets, and Other Landmark Architectures

tecture consists of ﬁve convolutional layers followed by three fully connected layers. It also

uses two parallel computation streams allowing the network to execute parallel forward

propagation and backpropagation, using two state of the art GPUs of the time. The work

on AlexNet also introduced several innovations that are now commonplace, such as the use

of ReLU activation functions and dropout regularization. While today, AlexNet is probably

not the ﬁrst oﬀ-the shelf model that one would use, it can still be cast as the ﬁrst “modern

convolutional neural network”. From a research and applied perspective, it was the success of

AlexNet that sparked the start of the deep learning era. After the introduction and success of

AlexNet, hundreds (and now many thousands) of researchers, both applied and theoretical,

shifted focus towards deep learning. This heavy research eﬀort accelerated advances in the

ﬁeld.

Architectures that followed AlexNet include ZFNet in 2013, VGGNet (including VGG19) in

2014, GoogLeNet in 2014, and ResNet in 2015. This short sequence of advances marks the

main evolution of convolutional architectures to what they are today. In more recent years,

vision tasks have also been tacked by non-convolutional networks using transformers. For

such ideas see Chapter

, describing transformers in the context of sequence or language

models, and see Chapter

where we highlight how ideas from diﬀerent deep learning

domains interplay. Nevertheless, convolutional networks remain the bread and butter of

modern computer vision. A recent advance that we cover below is EﬃcientNet. This set of

models tries to optimally scale models to balance performance and model size.

Inception and Networks within a Network

The inception network, also called GoogLeNet, works by composing multiple sub-networks

into a bigger network. This idea is sometimes called a network within a network. Each

sub-network is called an inception module and such a module uses multiple ﬁlter sizes in

parallel.

activations

1 × 1

Conv

1 × 1

Conv

3 × 3

Max-pooling

3 × 3

Conv

5 × 5

Conv

1 × 1

Conv

1 × 1

Conv

Channel concatenation

96 × 28 × 28

16 × 28 × 28

192 × 28 × 28

64 × 28 × 28

128 × 28 × 28

32 × 28 × 28

128

28 × 28

192 × 28 × 28

256 × 28 × 28

Figure 6.13: One form of an inception module, playing part in an inception network. The key idea

is parallel computation of various paths followed by a concatenating of the outputs from all paths.

6 Convolutional Neural Networks

Figure 6.13 illustrates an example of one such inception module. In this example a volume

of previous activations of dimension 192

28 is transformed to an output volume of

dimension 256

28 (the number of channels grows from 192 to 256). Inside the inception

module, there are four parallel paths, each operating independently and producing its own

set of output channels. Then the outputs of these paths are concatenated.

The diﬀerent paths of the inception module are designed to handle diﬀerent scales and

resolutions. The ﬁrst path has 1

1 convolutions and yields 64 output channels. The second

path has 1

1 convolutions followed by 3

3 convolutions. This path produces 128 output

channels (with an intermediate number of 96 channels). The third path is similar, yet uses

5 convolutions instead of 3

3. It results in 32 output channels with 16 intermediate

channels. Finally the last path starts with a max-pooling operation with a max pooling

stride of 1, such that there is no reduction in spatial dimension, but rather only a non-linear

operation. This is then followed by 1

1 convolutions. This path results in 192 output

channels. All convolutions have ReLU activations and where needed, there is padding such

that the desired spatial dimensions are respected. It should be noted that when the inception

network was developed, mass experiments were conducted to seek near-optimal settings for

this inception module and similar ones.

The essence of the inception network is to interconnect such inception modules in series. The

concatenated channels that result as output from one module are given as input to the next

module. One aspect of this interconnection is that the number of channels generally grows

down the network. To mitigate such channel explosion, one by one convolutional layers are

placed between some of the inception modules. Such layers have been termed bottleneck

layers. Another aspect introduced with such networks was intermediate loss functions. Here

the idea is that in addition to the ﬁnal loss function at the exit of the network, the loss is

also computed at various intermediate “exit points” and the gradient based optimization

uses the sum of all loss functions.

Empirically, in 2014, the introduction of GoogLeNet outperformed other networks at the

time and importantly these types of networks appear to strike a balance between accuracy

and computational eﬃciency. The original GoogLeNet has about 6

8 million parameters and

this is much less than the 143

7 million parameters of VGG19. GoogLeNet can be viewed

as having 22 parameterized layers with a total of 9 inception modules, two convolutional

layers as initial layers, and only a single dense layer at the output. Practically, these days

when one wishes to use an oﬀ the shelf trained convolutional neural network, some variant

of GoogLeNet is often a prime choice.

Residual Connections

Recall early discussions in Section

, and in particular Figure

. There we claimed that in

general, as model complexity grows we expect training error to decrease simply because our

model is able to capture more complex relationships. With deep learning one would also

hope to see this type of phenomena when adding layers. However, this is only partially true.

Empirically it has been observed that when deep learning models get extremely deep with

dozens or hundreds of layers, training error actually starts to increase. In other words, as

we add more layers to a neural network, its training error initially decreases, but after a

certain depth, the network’s accuracy on the training set starts to saturate and sometimes

degrades. One reason for this phenomenon, which is often termed a degradation problem,

stems from vanishing and exploding gradient issues. When a gradient is backpropagated

6.5 Inception, ResNets, and Other Landmark Architectures

through multiple layers, it can become extremely small, causing the weights in the earlier

layers to receive almost no updates during training. As a result, the network’s ability to

learn and generalize is reduced. See Section

for a discussion of vanishing and exploding

gradients.

Some further insight into the computational problems for very deep models is as follows.

We may hypothesize that good parameters,

[ℓ]

, for layer

ℓ

, are such that the operation of

the layer

[ℓ]

(

) is approximately an identity. Namely the input to the layer,

[ℓ−1]

, and the

output,

[ℓ]

, are ideally very similar. This can be hypothesized because with deep models

we would expect individual layers to only apply minor variations on their inputs. If we

accept such an hypothesis then we immediately get insight into some of the numerical

and computational problems that learning entails. Speciﬁcally, learning functions close to

[ℓ]

(

) =

is often not trivial - for example consider a pure convolutional layer and observe

that for it to be an identity function the convolution kernel requires to be all zeros except for

a single entry that is 1. For this, iterating over parameters of convolutional layers until they

become close to identity requires many gradient descent steps and can run into numerical

problems.

ReLU

Convolution

ReLU

Convolution

u = a

[`]

[`+

r(u)

Figure 6.14: A shortcut connection (residual connection) as part of a residual network.

An approach to overcome this problem is to use shortcut connections as in Figure 6.14. Here

the key idea is to take the input before a given layer (or sequence of layers), and bypass

the layer (or the sequence of layers). Then the bypassed information is added to the output

down the network, typically before the application of an activation function. Note that as

6 Convolutional Neural Networks

we bypass layers, it may be that channel dimensions are diﬀerent. In such a case, we use one

by one convolutions, and similarly we may use padding, stride, and pooling to adjust the

spatial dimensions if needed.

Mathematically, and continuing with the hypothesis that layers should be close to identity

functions, we may view the shortcut connection approach as a means to set the bypassed layer

(or layers) to a function that approximately outputs zero. To see this return to Figure 6.14

and assume that

(

)

≈

0. This then makes the operation of the whole sequence of layers

with a bypass close to the identity. Speciﬁcally in the ﬁgure we bypass

ℓ

layers, and if

(

)

≈

0, then

[ℓ+

ℓ]

≈ a

[ℓ]

. Due to this reason the shortcut connections are also sometimes

called residual connections and the whole architecture is called a ResNet. The usage of the

term, “residual” implies that by adding a shortcut connection, we are now learning

(

) to

as a deviation from zero, or a residual.

When the ResNet idea was introduced in 2015, networks of depths of dozens and even more

than one hundred layers were able to be eﬃciently trained. This elegant and simple idea

allows us to learn residuals instead of actual transformations. Ideas from ResNets propagated

to other aspects of deep learning beyond convolutional neural networks, such as for example

some sequence models presented in the next chapter. There are also models that combine

residual connections and inception modules, and these models are near the state of the art

of convolutional neural networks.

EﬃcientNet Models

EﬃcientNet is a family of convolutional neural network architectures that were developed with

the aim of providing better accuracy and eﬃciency in terms of model size and computation

cost. The key idea is to systematically scale up the dimensions of the network’s parameters

(such as depth, width, and resolution) in a balanced way, while also introducing a new

compound scaling method that optimizes these dimensions based on a set of pre-deﬁned

constraints. This allows users to choose which form of EﬃcientNet model they want, in a

way that balances the number of parameters and the performance of the model. Figure 6.15

plots the parameter count vs. performance tradeoﬀs of eﬃcient net models. The models

are named B0, B1,

. . .

,B7 where B0 is the most lightweight model in terms of parameter

counts, and B7 is the most computationally demanding model. It is seen that EﬃcientNet

dominates other popular models.

6.6 Beyond Classiﬁcation

The sections above focus on the internals of convolutional neural networks. For simplicity in

those sections, we discuss the task of image classiﬁcation, e.g. determining if an image is

that of a cat or a dog. However, there are several other important image analysis tasks that

are also handled with convolutional neural networks. These tasks deal with analysis and

understanding of an image including the location of objects, the count of objects, separating

between diﬀerent semantic features of the image, and more. Our purpose in this section is to

highlight such tasks. For this we present a brief overview of key computer vision developments

that use convolutional neural networks for tasks beyond classiﬁcation.

Image thanks to M. Tan and Q. V. Le, taken from “EﬃcientNet: Rethinking Model Scaling for

Convolutional Neural Networks”, [54]. See also [55].

6.6 Beyond Classiﬁcation

Figure 6.15: Performance of various convolutional models as well as eﬃcient net.

In terms of the input data, it is important to keep in mind that not all data is of the form of

monochrome or color images. Within computer vision, one often deals with image sequences

(short movies), or images that have more than 3 channels. For example, some images may

also have a distance channel capturing the distance from the camera per pixel. Further,

non-image data can also be handled via convolutional networks. One such example is fMRI

(functional magnetic resonance imaging) data which is 4 dimensional in nature as it records

the state of physical locations in three dimensions over time. Nevertheless, most of our

attention in this section is restricted to images.

Convolutional Networks and Key Computer Vision Tasks

As mentioned above, classiﬁcation serves as a simple and useful example. For an input

image

, a convolutional neural network

(

), has output

ˆy

(

) which is a vector of

probabilities where the highest probability typically determines the appropriate label for the

image. As was evident from our detailed study of the VGG19 model in Section 6.4 and other

architectures of Section 6.5, initial layers of the model

(

) are typically convolutional, and

the ﬁnal layers are typically fully connected layers. These ﬁnal layers help transform the

internal derived features in the network into the output vector of probabilities

ˆy

. When one

considers tasks other than classiﬁcation, it is often common to replace the ﬁnal layers of

the network with other layers such that the output

ˆy

suites the desired task. With such a

replacement we typically keep in the initial layers as is.

Image (b) is thanks to J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, taken from “You only look

once: Uniﬁed, real-time object detection”, [

]. Image (c) is thanks to H. Lai, S. Xiao, Y. Pan, Z. Cui, J. Feng,

C. Xu, J. Yin, and S. Yan, taken from “Deep recurrent regression for facial landmark detection”, [

]. Image

(d) is attributed to B. Palac under the creative commons license and available via Wikimedia Commons.

Image (e) is thanks to K. He, G. Gkioxari, P. Dollár, and R. Girshick, taken from “Mask R-CNN”, [18].

6 Convolutional Neural Networks

(a) (b)

(e) (f)

Figure 6.16: Illustrations

of some common computer vision tasks beyond classiﬁcation: (a)

Object localization. (b) Object detection. (c) Landmark detection. (d) Semantic segmentation. (e)

Instance segmentation. (f) Identiﬁcation (face recognition).

Let us now get a feel for some of these tasks and in each case consider some possible

structure for the output

ˆy

. Figure 6.16 illustrates key computer vision tasks for images. In

(a) we see object localization which is the task of identifying the location of an object in an

image, as well as possibly the type of object in which case the task is called localization and

classiﬁcation. In (b) we see object detection which is the task of detecting multiple instances

of an object in an image, also separating between the objects and classifying their type. In

of landmarks in an image. In (d) we see semantic segmentation which is the process of

classifying each individual pixel to be of a diﬀerent class from a ﬁnite set of classes (pixel

wise classiﬁcation). In (e) we see instance segmentation which ﬁnds diﬀerent instances of

6.6 Beyond Classiﬁcation

objects in the image and separates pixels to be of diﬀerent instances. Finally, in (f) we see

the task of identiﬁcation or more speciﬁcally face recognition which determines if an image

is that of a speciﬁc instance (or person).

Let us now consider possible forms of the output

ˆy

. For object localization, (a) in Figure 6.16,

ˆy

needs to contain information about a bounding box which locates the object. This can be in

the form of (

ˆy

, ˆy

) where

ˆy

and

ˆy

are the coordinates of (say) the upper left corner

of the bounding box and

ˆy

are the height and width of the bounding box, respectively.

This information can also be augmented with probabilities for the respective classes (types of

objects) including the possibility of having no object. For object detection, (b) in the ﬁgure,

a collection of multiple bounding boxes needs to be supplied. For (c), landmark detection,

a list of coordinates of the locations of landmarks comprises the output. For (d) semantic

segmentation, each pixel location in the input image,

, has an associated probability vector

of classes in the output

ˆy

. Hence in this case,

ˆy

can be represented as a tensor with width

and height dimensions the same as the input image, and a depth dimension which is the

number of classes in the segmentation. For (e), instance segmentation, the output is similar

to that of semantic segmentation, but instead of recording probabilities of classes, the depth

dimension of the output

ˆy

is used for determining the speciﬁc instance of any given pixel.

Finally, in the case of identiﬁcation, or face recognition, as in (f) of Figure 6.16, the output

is often just a probability as in a binary classiﬁer, since the task is to determine if a face

image matches a given pre-stored template or not. Note that in this case, the input

typically composed of two images, where one image, say

, is the template of the person

(e.g. a stored image in a security database), and the other image, say

, is the other image.

There are many ideas that have gone into developing architectures for handling tasks (a) –

(f). Some of these ideas stem from vision analysis research, prior to the era of deep learning,

while other ideas evolved in parallel to deep learning in recent years. Object localization

and classiﬁcation as in (a) is a particularly simple example and for this we provide more

details below. Similarly, identiﬁcation (face recognition) is also worth consideration and

we provide more details below. Landmark detection (c) is handled easily also in a similar

spirit to object localization and classiﬁcation; we omit the details. The other tasks including

object detection (b), semantic segmentation (d), and instance segmentation (e), are each big

topics of their own and we leave investigation of these for further reading. See the notes and

references at the end of the chapter.

Object Localization

To get a feel for object localization assume that we wish to train a convolutional neural

network that operates on an input image

and determines if the image contains a

bird

or a

plane

(classiﬁcation). The model’s second goal is to determine the speciﬁc location

(

ˆy

, ˆy

) of that object (localization). Images with multiple birds or planes are not

considered. Images without a bird and without a plane are possible and in this case the output

yields

nothing

. One way to encode the output is

ˆy

= (

ˆp

nothing

, ˆp

bird

, ˆp

plane

, ˆy

)

where as in standard classiﬁcation examples (

ˆp

nothing

, ˆp

bird

, ˆp

plane

) is a probability vector,

and the other coordinates deﬁne a bounding box.

Here an output that has

ˆp

nothing

greater than each of

ˆp

bird

and

ˆp

plane

implies a prediction of

no bird and no plane. On the contrary if

ˆp

bird

is the highest probability then the output

implies there is a bird, located in the bounding box (

ˆy

, ˆy

). Similarly for the other

class, plane.

6 Convolutional Neural Networks

In terms of training data, for each input image we denote the output as

where images

without a bird or a plane are labeled as,

= (1

, ∅, ∅, ∅, ∅

), where

∅

are “do not care”

values. Images with a bird are labeled as

= (0

, y

) where the bounding

box (

, y

) is typically based on a manual determination by a human annotator.

Similarly, images with a plane are labeled as y = (0, 0, 1, y

, y

We now construct a loss function that captures closeness of

ˆy

and

. For this we ﬁrst

separate the classiﬁcation and localization objectives into a loss

classiﬁcation

(

;

ˆy, y

) and

localization

(

;

ˆy, y

). The former depends only on the probability components in

ˆy

and

, and

the latter depends only on the bounding box components in ˆy and y. For the classiﬁcation

loss, we use categorical cross entropy as in

(??)

. For the localization loss, we use a mean

squared error as in (??) or some variant, applied to the four bounding box components.

The two separate losses are then combined such that the loss for a speciﬁc observation is,

classiﬁcation

(θ ; ˆy, y) + γ · (1 − y

) · C

localization

(θ ; ˆy, y),

where

γ >

0 is a hyper-parameter used to weigh the two losses and taken as

= 1 by default.

Observe that

= 1 when the label is

nothing

and is otherwise 0 and thus for labels in the

training data without a bird or a plane only the classiﬁcation objective is used.

To perform object localization, say with a model like VGG19, the network can be modiﬁed

by adding additional layers at the end of the architecture to predict the coordinates of the

bounding box. This can be achieved by attaching a regression head to the output of the ﬁnal

convolutional layer of the network. The regression head consists of fully connected layers

that predict the coordinates of the bounding box. Such simple modiﬁcations of networks

that were otherwise designed for classiﬁcation are always possible.

Face Recognition, Siamese Networks, and Triplet Loss

Let us get a feel for how identiﬁcation (face recognition) as in Figure 6.16 (f) can be

implemented both in production and training. First let us consider the simpliﬁed use of

such a task. Say a face identiﬁcation system needs to be able to recognize faces where in

production one may have an anchor face image

stored. With each use, the anchor needs

to be compared to another image

. For example every “login” is based on a new

image

and the system needs to determine if

is of the same person as

or not. In contrast to

other tasks discussed in the book, here we do not have the ability to train a network for a

particular person (or face), and similarly we do not have many diﬀerent face images of the

same person. Hence this setup requires a slightly diﬀerent architecture.

One type of architecture useful for this task is a siamese network, illustrated schematically

in Figure 6.17. The idea is that two parallel replicas of a convolutional neural network

(

)

are used, one applied on

and the other on

. The output of each of these networks is an

embedding vector. Now since we have two embedding vectors,

(

) and

(

), we can

compare them and see if they are likely associated with face images of the same person or not.

One approach for this comparison is as in Figure 6.17 using a comparison network,

(

·, ·

)

for binary classiﬁcation (output is a probability) with parameters

. Hence in production we

can determine,

same

if the output probability

(

)

, f

(

)) is greater than a threshold,

or otherwise determine

different

. The comparison network is not too complex, and is

often a shallow logistic regression model or similar. With such an architecture, the learned

6.6 Beyond Classiﬁcation

)

Comparision

(·)

Same/Different

Figure 6.17: A schematic of a siamese network architecture for identiﬁcation (face recognition).

The two parallel convolutional neural networks both share the same parameters

, and one operates

while the other operates on

. The outputs of these networks are embedding vectors. These

are then compared via a comparison module which may be a neural network with parameters

and has output indicating if same or different.

parameters

and

are not designed for one particular face, but rather for any possible

face.

Let us describe a simpliﬁed approach for training such a model. We can ﬁrst treat the

parameters

as known, say from a pertained or ﬁne tuned model, and focus on learning the

parameters of the (smaller) comparison network

. For this, the training data can be of the

form,



(

(1)

, x

(1)

, y

(1)

)

, . . . ,

(

(n)

, x

(n)

, y

(n)

)



, where each tuple (

(i)

, x

(i)

, y

(i)

), has an

anchor image

(i)

, another image

(i)

, and a binary label

(i)

∈ {

}

with the value based

on the images being

different

(0) or

same

(1). With such a dataset all that is required is

to train the binary classiﬁer f

(·).

A related useful concept for training siamese networks is the triplet loss. Say for simplicity that

now our goal is to learn

for

(

), ignoring the comparison network. For this we can setup a

slightly diﬀerent dataset of the form

triplet



(

(1)

, x

(1)

, x

(1)

)

, . . . ,

(

(n)

, x

(n)

, x

(n)

)



where

now each (

(i)

, x

(i)

, x

(i)

) has an anchor face image

(i)

as before, and also has two additional

images with

(i)

being a face image of a diﬀerent person, and

(i)

being a face image of the

same person (not the exact same image as x

(i)

). Now by applying f

(·) on each element of

this dataset, we can construct a loss function for observation i as,

(θ ; x

(i)

, x

(i)

, x

(i)

) = max



∥f

(i)

) − f

(i)

)∥

| {z }

same

− ∥f

(i)

) − f

(i)

)∥

| {z }

diﬀerent

+α, 0



, (6.27)

where

α >

0 is some hyper-parameter called the margin and the Euclidean norm

∥ · ∥

can in

principle be replaced by a diﬀerent distance metric as well.

6 Convolutional Neural Networks

Let us understand the motivation behind the triplet loss

(6.27)

. Our desire is that the

embedding associated with the anchor image

(i)

and embedding associated with the image

of the same person

(i)

be close to each other and hence

same

should ideally be small.

Similarly we wish to have the embedding of

(i)

and the embedding

(i)

to be distant from

each other and this motivates the negative sign in front of the

diﬀerent

term which we ideally

want to be large.

Now in general, when we have such an optimization with two competing criteria,

same

which we want to be small, and

diﬀerent

which we want to be large, one approach to capture

such a desire via a loss, is by pre-determining a margin α and considering cases where,

same

− d

diﬀerent

≤ −α, (6.28)

as being “admissible” and otherwise “inadmissible”. We can then assign a loss of 0 to

admissible cases, and assign a loss that depends on

for the inadmissible cases. This

is achieved with the

max{·,

}

operation since if

(6.28)

is satisﬁed, the loss in

(6.27)

0. In contrast, when

(6.28)

is not satisﬁed (inadmissible), the loss in

(6.27)

same

−

diﬀerent

. Hence when using gradient descent based learning of

for minimization of

i=1

(

;

triplet

), at any iteration, we drive loss down for the inadmissible observations.

The triplet loss with a properly curated dataset

triplet

has been eﬀectively used for state

of the art face recognition training. We note that when curating this dataset it is often

important to preprocess the images so that

(i)

and

(i)

are not acutely diﬀerent. We also

note that with the use of the triplet loss we can add a comparison network

(

) which is

trained as a binary classiﬁer, after training

with the triplet loss. In other cases, using the

cosine distance between the two embedding vectors

(

) and

(

) suﬃces in production.

6.6 Beyond Classiﬁcation

Notes and References

Before we outline notes and references associated with explicit details of this chapter, here is a brief

description of early convolutional neural network developments. Initial ideas originated in the 1950’s

and 1960’s with the study of the visual cortices of animals, primarily by Hubel and Wiesel over a

series of publications including [

] and [

]. Early concrete models that have some similarity with

modern convolutional neural networks are the 1980 neocognitron [

] for pattern recognition, as

well as the 1988 time delay neural network [

] for speech recognition. In the 1990s convolutional

neural networks saw industrial applications for the ﬁrst time with [

] for handwritten character

recognition and [5] for signature veriﬁcation. Other signiﬁcant early works include [32] for written

digit recognition, [

] for face recognition, and [

] for phoneme recognition. Finally we mention

that the LeNet-5 model developed in the late 1990’s by Yann LeCun et al. [

] is recognized as an

early form of contemporary convolutional neural networks and it was used for classifying 28

size images of grayscale handwritten digits. We also mention that in 1989 with [

] and [

], LeCun

et al. developed the ﬁrst multi-layered convolutional networks for handwritten character recognition

trained using backpropagation.

The structure of convolutional layers in neural networks as we present in this chapter solidiﬁed at

around the 2012–2016 period and best ﬁts the VGG model [

]. This model followed the pivotal

AlexNet model [

] from 2012 which was speciﬁcally designed for training on two parallel GPUs.

Other notable convolutional architectures of this period are the GoogLeNet or inception network

model of [

], the batch normalization inception model [

] which uses batch normalization of layer

inputs, and ResNets which were introduced in [

]. All of these models competed in the ImageNet

challenges of that era with the results from each model eﬀectively outperforming those that came

prior to it. Other developments included the SqueezeNet model of [

], which marked a key milestone

in reducing parameter size and memory footprint of convolutional network without compromising

accuracy; this model achieved the AlexNet-level accuracy with much fewer parameters and a much

smaller memory footprint. Also, see the Network-in-Network model of [

] which inspired the

inception networks and [

] that uses dropout mechanism to reduce overﬁtting on convolutional

layers. See [

] for a comprehensive survey of convolutional neural networks of that time as well

as the more recent survey [

]. In times closer to the publication of this book, paradigms such as

EﬃcientNet appeared in [54], see also the more recent version, EﬃcientNet v2 in [55].

Ideas of dilation in convolutional networks were introduced in [64] for dense prediction, where the

goal is to compute a label for each pixel in the image. Furthermore, dilation for residual networks is

introduced in [65]. See also the discussion of group normalization in [62].

A general overview of linear time invariant systems can be found in standard texts such as [

] which

is also useful for understanding basic ﬁltering. The book [

] can provide a more mathematically

rigorous foundation and can also be useful for understanding the delta function in continuous time.

The probabilistic interpretation of a convolution is standard and can be found in any elementary

probability textbook such as [

]. The multiplication of polynomials interpretation, also coupled

with the study of the fast Fourier transform can be found in [

]. A simple explanation of the

representation of discrete convolutions in terms of Toeplitz matrices can be found in [

]. For analysis

of convolutions of classic image processing applications as well as many other classic image processing

techniques see [

]. The Sobel ﬁlter is one of many convolution based ﬁltering operations. It was

developed by Sobel and Feldman, and presented at a 1968 scientiﬁc talk; see [

] for an historical

review.

The rise of convolutional neural networks drove the development of many paradigms using these

networks for diﬀerent tasks. In terms of object detection, early works are [

] and [

] and recent

work in this direction is [

] where YOLOv7 model enhances the landmark YOLO (you only look

once) work of [

]. A recent survey on object detection can be found in [

]. The important area of

semantic segmentation has received much attention with notable papers being [

] (U-net), as well

as [

]. Instance segmentation is studied in [

], [

], and [

]. For additional recent surveys of the

subsequent developments in semantic and instance segmentation see [

] and [

]. See also [

] for

a survey of video semantic segmentation. Inﬂuential work on identiﬁcation (face recognition) is in

[46] and early ideas of siamese networks are from [7]; see also [20] and [61].

Over the years, many eﬀective network visualization methods were developed for understanding

inner layers and derived features. Before the era of great popularity of convolutional networks,

the work in [

] introduced a technique aimed at optimizing the input to maximize the activity

6 Convolutional Neural Networks

of hidden neurons in a deep neural network. For convolutional neural networks, the deconvolution

architecture in [

], based on previous work in [

], was a signiﬁcant as it was the ﬁrst work where

eﬀective visualization of internal layers was made possible. Other related important papers in this

direction are [

] which introduced a technique called network inversion and [

] which introduced

a framework called network dissection. A general useful survey on visual interpretability for deep

learning is [

]. Other related ideas that we do not discuss in this book include deep dreaming

and directly using convolutional networks for neural style transfer, initially introduced in [

], and

further developments reported in [9]. See also the related generative models of Chapter ??.

In terms of real world applications, these days convolutional neural networks are used in many

scenarios. For image classiﬁcation applications of convolutional neural networks, see for example

[

] dealing with traﬃc sign recognition, and [

] for medical image classiﬁcation, among many

others. For a review of advances in image classiﬁcation, refer to [

]. The most basic application of

convolutional neural networks is with 3-dimensional tensors as appropriate for color images, yet

there are other cases as well. In [

] 4-dimensional fMRI data is studied. Also videos are analyzed

in [

] by treating the entire video as a bag of short clips. In particular, see [

], [

], and [

]

for video-based action recognition; a brief summary of such methods is listed in [

]. In general,

techniques for analyzing video data vary depending on the task at hand; see [

] for a brief survey of

such tasks and the corresponding methods. We also mention that transformer models, as introduced

in Chapter

have been applied to images and managed to surpass the performance of convolutional

networks in certain cases when trained with huge datasets. See [

] for a survey as well as the notes

and references at the end of Chapter ??.

This blog post is credited for introducing the concept of deep dreaming:

https://blog.research.

google/2015/06/inceptionism-going-deeper-into-neural.html.

Bibliography

[1] P. J. Antsaklis and A. N. Michel. Linear systems. Springer, 1997.

[2]

D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba. Network dissection: Quantifying

interpretability of deep visual representations. In Proceedings of the IEEE conference

on computer vision and pattern recognition, 2017.

[3]

D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee. YOLACT: Real-Time Instance Segmentation.

In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.

[4]

S. Boyd and L. Vandenberghe. Introduction to applied linear algebra: Vectors, matrices,

and least squares. Cambridge university press, 2018.

[5]

J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah. Signature veriﬁcation

using a "siamese" time delay neural network. Advances in neural information processing

systems, 1993.

[6]

L. Chen, S. Li, Q. Bai, J. Yang, S. Jiang, and Y. Miao. Review of image classiﬁcation

algorithms based on convolutional neural networks. Remote Sensing, 2021.

[7]

S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively,

with application to face veriﬁcation. In 2005 IEEE computer society conference on

computer vision and pattern recognition (CVPR’05), 2005.

[8]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms.

MIT press, 2022.

[9]

V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style.

arXiv:1610.07629, 2016.

[10]

D. Erhan, Y. Bengio, A. Courville, and P. Vincent. Visualizing higher-layer features of

a deep network. University of Montreal, 2009.

[11]

C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional Two-Stream Network Fusion

for Video Action Recognition. In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, 2016.

[12]

K. Fukushima. c: A self-organizing neural network model for a mechanism of pattern

recognition unaﬀected by shift in position. Biological cybernetics, 1980.

[13]

A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, P. Martinez-

Gonzalez, and J. Garcia-Rodriguez. A Survey on Deep Learning Techniques for Image

and Video Semantic Segmentation. Applied Soft Computing, 2018.

Bibliography

[14]

L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style.

arXiv:1508.06576, 2015.

[15]

R. Girshick. Fast R-CNN. In Proceedings of the IEEE International Conference on

Computer Vision, 2015.

[16]

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate

object detection and semantic segmentation. In Proceedings of the IEEE conference on

computer vision and pattern recognition, 2014.

[17]

I. Guyon, P. Albrecht, Y. LeCun, J. Denker, and W. Hubbard. Design of a neural

network character recognizer for a touch terminal. Pattern Recognition, 1991.

[18]

K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In Proceedings of the

IEEE International Conference on Computer Vision, 2017.

[19]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.

[20]

G. Hu, Y. Yang, D. Yi, J. Kittler, W. Christmas, S. Z. Li, and T. Hospedales. When

Face Recognition Meets with Deep Learning: An Evaluation of Convolutional Neural

Networks for Face Recognition. In Proceedings of the IEEE International Conference

on Computer Vision Workshops, 2015.

[21]

D. H. Hubel and T. N. Wiesel. Receptive ﬁelds of single neurones in the cat’s striate

cortex. The Journal of physiology, 1959.

[22]

D. H. Hubel and T. N. Wiesel. Receptive ﬁelds, binocular interaction and functional

architecture in the cat’s visual cortex. The Journal of Physiology, 1962.

[23]

F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer.

Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size.

arXiv:1602.07360, 2016.

[24]

S. Ioﬀe and C. Szegedy. Batch normalization: Accelerating deep network training by

reducing internal covariate shift. In International conference on machine learning, 2015.

[25] B. Jähne. Digital image processing. Springer Science & Business Media, 2005.

[26]

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-

Scale Video Classiﬁcation with Convolutional Neural Networks. In Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition, 2014.

[27] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah. Transformers

in vision: A survey. ACM Computing Surveys (CSUR), 2022.

[28]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep

convolutional neural networks. Advances in neural information processing systems,

2012.

[29] H. Kwakernaak and R. Sivan. Modern signal and systems. Prentice Hall, 1991.

Bibliography

[30]

H. Lai, S. Xiao, Y. Pan, Z. Cui, J. Feng, C. Xu, J. Yin, and S. Yan. Deep recurrent

regression for facial landmark detection. IEEE Transactions on Circuits and Systems

for Video Technology, 2016.

[31]

K. J. Lang. A time-delay neural network architecture for speech recognition. Technical

Report, Carnegie-Mellon University, 1988.

[32]

Y. LeCun, B. Boser, J. Denker, D. Henderson, W. Hubbard, and L. Jackel. Handwritten

digit recognition with a back-propagation network. Advances in neural information

processing systems, 1989.

[33]

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and

L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural

Computation, 1989.

[34]

Y. LeCun, L. Bottou, Y. Bengio, and P. Haﬀner. Gradient-based learning applied to

document recognition. Proceedings of the IEEE, 1998.

[35]

Q. Li, W. Cai, X. Wang, Y. Zhou, D. D. Feng, and M. Chen. Medical Image Classiﬁcation

with Convolutional Neural Network. In 2014 13th International Conference on Control

Automation Robotics & Vision (ICARCV), 2014.

[36]

Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou. A survey of convolutional neural networks:

Analysis, applications, and prospects. IEEE Transactions on Neural Networks and

Learning Systems, 2021.

[37] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400, 2013.

[38]

S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path Aggregation Network for Instance

Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2018.

[39]

A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting

them. In Proceedings of the IEEE conference on computer vision and pattern recognition,

2015.

[40]

S. Minaee, Y. Y. Boykov, F. Porikli, A. J. Plaza, N. Kehtarnavaz, and D. Terzopoulos.

Image Segmentation Using Deep Learning: A Survey. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 2021.

[41]

H. Noh, S. Hong, and B. Han. Learning Deconvolution Network for Semantic Seg-

mentation. In Proceedings of the IEEE International Conference on Computer Vision,

2015.

[42]

W. Rawat and Z. Wang. Deep convolutional neural networks for image classiﬁcation: A

comprehensive review. Neural Computation, 2017.

[43]

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Uniﬁed,

real-time object detection. In Proceedings of the IEEE conference on computer vision

and pattern recognition, 2016.

Bibliography

[44]

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical

image segmentation. In Medical Image Computing and Computer-Assisted Intervention–

MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015,

Proceedings, Part III 18, 2015.

[45] S. M. Ross. A ﬁrst course in probability. Pearson, 2014.

[46]

F. Schroﬀ, D. Kalenichenko, and J. Philbin. Facenet: A uniﬁed embedding for face

recognition and clustering. In Proceedings of the IEEE conference on computer vision

and pattern recognition, 2015.

[47]

P. Sermanet and Y. LeCun. Traﬃc Sign Recognition with Multi-Scale Convolutional

Networks. In The 2011 International Joint Conference on Neural Networks, 2011.

[48]

V. Sharma, M. Gupta, A. Kumar, and D. Mishra. Video Processing Using Deep Learning

Techniques: A Systematic Literature Review. IEEE Access, 2021.

[49]

K. Simonyan and A. Zisserman. Two-Stream Convolutional Networks for Action

Recognition in Videos. Advances in Neural Information Processing Systems, 2014.

[50]

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image

recognition. arXiv:1409.1556, 2014.

[51]

I. Sobel. History and deﬁnition of the sobel operator. Retrieved from the World Wide

Web, 2014.

[52]

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout:

a simple way to prevent neural networks from overﬁtting. The journal of machine

learning research, 2014.

[53]

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,

and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE

conference on computer vision and pattern recognition, 2015.

[54]

M. Tan and Q. Le. EﬃcientNet: Rethinking Model Scaling for Convolutional Neural

Networks. In International Conference on Machine Learning, 2019.

[55]

M. Tan and Q. V. Le. EﬃcientNetV2: Smaller Models and Faster Training. International

Conference on Machine Learning, PMLR, 2021.

[56]

A. C. Tsoi. Face recognition: A convolutional neural-network approach. IEEE Transac-

tions on Neural Networks, 1997.

[57]

I. Ulku and E. Akagündüz. A Survey on Deep Learning-Based Architectures for

Semantic Segmentation on 2D Images. Applied Artiﬁcial Intelligence, 2022.

[58]

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang. Phoneme recognition

using time-delay neural networks. In Backpropagation. 2013.

Bibliography

[59]

C. Y. Wang, A. Bochkovskiy, and H. Y. M. Liao. YOLOv7: Trainable Bag-of-Freebies

Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.

[60]

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal

Segment Networks: Towards Good Practices for Deep Action Recognition. In European

Conference on Computer Vision, 2016.

[61]

H. Wu, Z. Xu, J. Zhang, W. Yan, and X. Ma. Face Recognition Based on Convolution

Siamese Networks. In 2017 10th International Congress on Image and Signal Processing,

BioMedical Engineering and Informatics (CISP-BMEI), 2017.

[62]

Y. Wu and K. He. Group normalization. In Proceedings of the European conference on

computer vision (ECCV), 2018.

[63]

G. Yao, T. Lei, and J. Zhong. A Review of Convolutional-Neural-Network-Based Action

Recognition. Pattern Recognition Letters, 2019.

[64]

F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions.

arXiv:1511.07122, 2015.

[65]

F. Yu, V. Koltun, and T. Funkhouser. Dilated residual networks. In Proceedings of the

IEEE conference on computer vision and pattern recognition, pages 472–480, 2017.

[66]

M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In

European conference on computer vision, 2014.

[67]

M. D. Zeiler, G. W. Taylor, and R. Fergus. Adaptive deconvolutional networks for mid

and high level feature learning. In 2011 International Conference on Computer Vision,

2011.

[68]

Q. Zhang and S. Zhu. Visual interpretability for deep learning: a survey. Frontiers of

Information Technology & Electronic Engineering, 2018.

[69]

Y. Zhao, X. Li, W. Zhang, S. Zhao, M. Makkie, M. Zhang, Q. Li, and T. Liu. Modeling

4D fMRI Data via Spatio-Temporal Convolutional Neural Networks (ST-CNN). In

Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st

International Conference, 2018, Proceedings, Part III 11, 2018.

[70]

Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye. Object Detection in 20 Years: A Survey.

Proceedings of the IEEE, 2023.