Mathematical Engineering

of Deep Learning

Chapter 3

Benoit Liquet, Sarat Moka and Yoni Nazarathy

February 28, 2024

Contents

Preface 3

1 Introduction 1

1.1 The Age of Deep Learning ............................ 1

1.2 A Taste of Tasks and Architectures ....................... 7

1.3 Key Ingredients of Deep Learning ........................ 12

1.4 DATA, Data, data! ................................ 17

1.5 Deep Learning as a Mathematical Engineering Discipline ........... 20

1.6 Notation and Mathematical Background .................... 23

Notes and References .................................. 25

2 Principles of Machine Learning 27

2.1 Key Activities of Machine Learning ....................... 27

2.2 Supervised Learning ............................... 32

2.3 Linear Models at Our Core ........................... 39

2.4 Iterative Optimization Based Learning ..................... 48

2.5 Generalization, Regularization, and Validation ................ 52

2.6 A Taste of Unsupervised Learning ....................... 62

Notes and References .................................. 72

3 Simple Neural Networ ks 75

3.1 Logistic Regression in Statistics ......................... 75

3.2 Logistic Regression as a Shallow Neural Network ............... 82

3.3 Multi-class Problems with Softmax ....................... 86

3.4 Beyond Linear Decision Boundaries ....................... 95

3.5 Shallow Autoencoders .............................. 99

Notes and References .................................. 111

4 Optimization Algorithms 113

4.1 Formulation of Optimization .......................... 113

4.2 Optimization in the Context of Deep Learning ................ 120

4.3 Adaptive Optimization with ADAM ...................... 128

4.4 Automatic Diﬀerentiation ............................ 135

4.5 Additional Techniques for First-Order Methods ................ 143

4.6 Concepts of Second-Order Methods ....................... 152

Notes and References .................................. 164

5 Feedforward Deep Networks 167

5.1 The General Fully Connected Architecture ................... 167

5.2 The Expressive Power of Neural Networks ................... 173

5.3 Activation Function Alternatives ........................ 180

5.4 The Backpropagation Algorithm ........................ 184

5.5 Weight Initialization ............................... 192

Contents

5.6 Batch Normalization ............................... 194

5.7 Mitigating Overﬁtting with Dropout and Regularization ........... 197

Notes and References .................................. 203

6 Convolutional Neural Networks 205

6.1 Overview of Convolutional Neural Networks .................. 205

6.2 The Convolution Operation ........................... 209

6.3 Building a Convolutional Layer ......................... 216

6.4 Building a Convolutional Neural Network ................... 226

6.5 Inception, ResNets, and Other Landmark Architectures ........... 236

6.6 Beyond Classiﬁcation ............................... 240

Notes and References .................................. 247

7 Sequence Models 249

7.1 Overview of Models and Activities for Sequence Data ............. 249

7.2 Basic Recurrent Neural Networks ........................ 255

7.3 Generalizations and Modiﬁcations to RNNs .................. 265

7.4 Encoders Decoders and the Attention Mechanism ............... 271

7.5 Transformers ................................... 279

Notes and References .................................. 294

8 Specialized Architectures and Paradigms 297

8.1 Generative Modelling Principles ......................... 297

8.2 Diﬀusion Models ................................. 306

8.3 Generative Adversarial Networks ........................ 315

8.4 Reinforcement Learning ............................. 328

8.5 Graph Neural Networks ............................. 338

Notes and References .................................. 353

Epilogue 355

A Some Multivariable Calculus 357

A.1 Vectors and Functions in R

........................... 357

A.2 Derivatives .................................... 359

A.3 The Multivariable Chain Rule .......................... 362

A.4 Taylor’s Theorem ................................. 364

B Cross Entropy and Other Expectations with Logarithms 367

B.1 Divergences and Entropies ............................ 367

B.2 Computations for Multivariate Normal Distributions ............. 369

Bibliography 399

Index 401

3 Simple Neural Networks

In the previous chapter we explored machine learning in general and also focused on linear

models trained via gradient based optimization. In this chapter we move up a notch and

consider logistic regression and multinomial (softmax) regression models used for regression

and classiﬁcation. These models often play a key role in statistical modeling and are also very

important from a deep learning perspective. Logistic regression and multinomial regression

models are shallow neural networks, but also involve non-linear activation functions and

hence understanding their structure and means of training is a gateway to understanding

general deep learning networks.

In exploring logistic and multinomial regression, we are intro duced to some of the basic

mathematical engineering elements of deep learning. These include the sigmoid activation

function, the cross-entropy loss, the softmax function, and the convexity properties that these

models enjoy. We also use the opportunity to consider simple but non trivial autoencoders

that generalize PCA, introduced in the previous chapter.

In Section 3.1 we consider the statistical viewpoint of the logistic regression model. In the

process we see how binary cross entropy loss results from maximum likelihood estimation. In

Section 3.2 we consider the same model, only this time as a shallow neural network trained

via gradient descent. In Section 3.3 we adapt the model to multi-class classiﬁcation. Here we

introduce the softmax function as well as the categorical cross entropy loss. In Section 3.4

we investigate feature engineering for s uch models and see how additional features extend

the resulting classiﬁers to have non-linear decision boundaries. In Section 3.5 we move onto

unsupervised learning and consider simple non-linear autoencoder models. In that section we

also discuss general autoencoder concepts and describe a few applications of autoencoders.

Note that a reader returning to Section 3.5 after reading Chapter 5, may also envision how

the simple shallow autoencoders that we present in Section 3.5, can be generalized to deep

autoencoders.

3.1 Logistic Regression in Statistics

Logistic regression can be viewed as the simplest non-linear neural network model. However,

outside of the context of deep learning, logistic regression is a very popular statistical model.

In this section we present logistic regression via a statistical viewpoint. We show how to

estimate parameters via maximum likelihood estimation and in the process are introduced to

the (binary) cross entropy loss function common in deep learning models including logistic

regression, but also beyond.

Note that the statistical view of logistic regression also incorporates parameter uncertainty in-

tervals, hypothesis tests, and other statistical inference aspects. Literature for such statistical

inference aspects of logistic regression is provided to at the end of the chapter.

3 Simple Neural Networks

The Model

Logistic regression is a model for predicting the probability that a binary response is 1. It is

suitable for classiﬁcation tasks, a concept described in Section 2.2, as well as for prediction

of proportions or probabilities. From a statistical persp e ctive, it is deﬁned by assuming that

the distribution of the binary response variable,

, given the features,

, follows a Bernoulli

distribution

with success probability

„

(

) that is deﬁned below. That is, if we represent the

random variables of the feature ve ctor and the (scalar) response via

and

respectively,

then

P(Y =1|X = x)=„(x) and P(Y =0|X = x)=1≠ „(x). (3.1)

To capture the re lationship between

and

„

(

),themodelusestheodds of

„

(

), namely

„(x)/

1 ≠„(x)

,viathelog odds represented as the logit function,

Logit(u) = log

1 ≠u

. (3.2)

Using this notation, the logistic regression model assumes a linear (aﬃne) relationship

between the feature vector x and the log odds of „(x). Namely,

Logit

„(x)

= b + w

€

x. (3.3)

Here for feature vector

x œ R

, the parameter space is  =

R ◊R

and the parameters

◊ œ



are denoted via ◊ =(b, w). In this case, the number of parameters is d = p +1similarly to

the linear regression models of Section 2.3. Like the linear regression model introduced in

Section 2.3, the scalar parameter

is called the intercept or bias and the vector parameter

is called the regression parameter or weight vector.

The fact that

(3.3)

presents

Logit

(

) on the left hand side in contrast to simply the expected

response

as one would have in linear regression, sets logistic regression as a non-trivial form

of a generalized linear model (GLM). We do not discuss general GLM further, nevertheless

it is good to note that using GLM terminology,

Logit

(

) plays the role of a link function.See

notes at the end of this chapter for details and references.

A closely related function to

Logit

(

), central to logistic regression and deep learning, is the

sigmoid function, also called the logistic function. It is denoted

‡

Sig

(

) and is also discussed

in Section 5.3 where it is plotted in Figure 5.6. Its expression is,

‡

Sig

(u)=

1+e

≠u

. (3.4)

It has domain

, range (0

1), and is monotonically increasing. Note that the logit function

and the sigmoid function are inverses. That is, ‡

Sig

Logit(u)

= Logit

‡

Sig

(u)

= u.

Applying

‡

Sig

(

) to the representation of the model in

(3.3)

we obtain the common represen-

tation of the logistic regression model,

„(x)=P

Y =1|X = x ; ◊ =(b, w)

= ‡

Sig

(b + w

€

x)=

1+e

≠(b+w

€

. (3.5)

This is a probability distribution of a random variable having only two outcomes, 0 or 1.Itisaspecial

case of the binomial distribution where the parameter for the “number of trials” is set at 1.

3.1 Logistic Regres sion in Statistics

Side Note: The Logistic Distribution

When considering the statistical logistic regression model, it is also interesting to represent

the relationship between

and

via a continuous latent variable, denoted via

. A latent

variable is an unobserved quantity. In the case of logistic regression,

allows us to use the

linear model representation,

Z = b + w

€

X + ‘, with

Y =1, if Z Ø 0,

Y =0, if Z<0.

(3.6)

Here the noise component

‘

follows a (standard) logistic distribution whose cumulative

distribution function,

‘

(

‘ Æ u

), is given by

‘

(

‡

Sig

(

). Such a representation

agrees with (3.5)since,

„(x)=P(Y =1|X = x ; ◊)=P(Z Ø 0 |X = x ; ◊)

= P(b + w

€

x + ‘ Ø 0)

= P(b + w

€

x ≠‘ Ø 0)

= P(‘ Æ b + w

€

x)=‡

Sig

(b + w

€

x).

Note that in the step between the second line and the third line we use the fact that the

(standard) logistic distribution of

‘

is symmetric about 0 and hence

‘

and

≠‘

are equal in

distribution.

We do not use the latent representation

(3.6)

any further. Yet it may be interesting to

note that if one chooses to use a Gaussian distribution for

‘

(3.6)

in place of the logistic

distribution, the model turns out to be a probit regression model instead of

(3.3)

. Henc e

such a latent representation of the model is generally interesting.

Estimation Using the Maximum Likelihood Principle

When statistical assumptions are imposed, maximum likelihood estimation (MLE) is a very

common statistical method for estimating parameters. While most deep learning models in

this book do not use MLE or other statistical point estimation techniques directly, exploring

how MLE works for logistic regression is insightful.

Central to MLE is the likelihood function

: 

æ R

. It is a function of the parameters

◊ œ

 which is obtained by the probability (or probability density) of the data for any given

parameter value

◊

. The idea of MLE is to choose values for

◊

that maximize the likelihood

(function). We see this in action for the logistic regression model now.

Consider the training observations

)

(1)

),...,(x

(n)

)

denoted in a similar

way to the way

is deﬁned for the linear model Section 2.3. Here each label

(j)

is encoded

via either 0 or 1, and the feature vector x

(j)

is a p-dimensional vector in R

A common assumption in statistics and machine learning is that all features-label pairs

(

(j)

) are identically distributed and mutually independent; this is often denoted i.i.d..

With this assumption, we aim to estimate the parameters

◊

b, w

). The likelihood function

3 Simple Neural Networks

L(◊ ; D)=

i=1

P(Y = y

(i)

|X = x

(i)

; ◊), (3.7)

where

(

◊

;

) is viewed as a function of

◊

given the observed sample

.Theproductisdue

to the independence assumption arising as part of the i.i.d. assumption. In the context of

the underlying logistic model, using

(3.1)

and

(3.5)

, since the labels are either 0 or 1,the

likelihood evaluates as

L(◊ ; D)=

i=1

„(x

(i)

)

(i)

1 ≠„(x

(i)

)

1≠y

(i)

where „(x)=‡

Sig

(b + w

€

x). (3.8)

The maximum likelihood estimate is deﬁned as a value of the paramete r

◊

which maximizes

the likelihood function. That is, MLE is the value which maximizes the probability of the

observations D assuming the logistic mo del.

With

(3.8)

available, one may proceed to optimize the likelihood directly. However, in this

case of logistic regression, and in many other cases where MLE is applied to i.i.d. data,

it is more convenient to maximize the logarithm of the likelihood called the log-likelihood

denoted via

(

◊

;

log (L(◊ ; D))

. This is equivalent to maximizing the likelihood since

the

log

function is a monotonic increasing function. For logistic regression, the log-likelihood

expression is,

¸(◊; D)=

i=1

(i)

log

„(x

(i)

)

+(1≠ y

(i)

) log

1 ≠„(x

(i)

)

, (3.9)

and the maximum likelihood estimate (MLE) is represented via,

◊

MLE

:= argmax

◊œR

◊R

¸(◊ ; D) = argmin

◊œR

◊R

≠

¸(◊ ; D), (3.10)

since maximizing

(

◊

;

) is the same as minimizing

≠¸

(

◊

;

) and the positive factor 1

does not change the optimization, yet is useful in the presentation that follows.

One can show that the function

≠

(

;

) is convex and hence a global minimum exists.

Aspects of optimization and convexity are also further overviewed in Section 4.1. However in

contrast to the linear model where an explicit analytic solution as in

(2.16)

exists, in the case

of logistic regression there is no analytic solution for the minimizer and hence optimization

algorithms are needed to ﬁnd the MLE (3.10).

The Binary Cross-Entropy Loss

We have already claimed throughout Chapter 2 that learning almost always involves opti-

mization. The MLE based paradigm for logistic regression certainly reinforces this claim. We

now reposition logistic regression MLE in terms of minimization of a loss function, following

similar lines to the learning of the linear model of Section 2.3. This setup continues straight

into more involved deep learning models that follow.

Recall the general loss function formulation which we ﬁrst presented for linear models

(2.11)

where the loss is

(

◊

;

i=1

(

◊

). In that context of linear models the

individual loss is

(

◊

(

◊

;

(i)

, ˆy

(i)

)=(

(i)

≠ ˆy

(i)

)

. Logistic regression learning follows

3.1 Logistic Regres sion in Statistics

similar lines except that

(

◊

) is not the quadratic loss. To see this, revisit the minimization

form of

(3.10)

together with the log-likelihood expression

(3.9)

. The scaled negative likelihood

that needs to be minimized can then be represented as a loss function via

C(◊ ; D)=

i=1

≠

(i)

log

ˆy

(i)

+(1≠ y

(i)

) log

1 ≠ ˆy

(i)

2È

, (3.11)

where

ˆy

(i)

„

(

(i)

‡

Sig

(

€

(i)

). This then implies that for logistic regression the

loss for each data sample is C

(◊)=CE

binary

(i)

, ˆy

(i)

) where

binary

(y, ˆy)=≠

y log

ˆy

+(1≠ y ) log

1 ≠ ˆy

, (3.12)

is called the binary cross entropy (estimate) applied to observation y and prediction ˆy.

In general the phrase “cross entropy” and speciﬁcally “entropy” is rooted in information

theory. Appendix B outlines relationships between cross entropy and related quantities.

However, these relationships are not critical for understanding of deep learning. In terms of

relating

(3.11)

to the probabilistic meaning of cross entropy, see ﬁrst the deﬁnition of the

cross entropy for two probability distributions in

(B.2)

and then see the specialisation to

distributions with binary outcomes in (B.5).

We also mention that while here, we arrived to the binary cross entropy as a bi-product

of maximum likelihood estimation for logistic regression, in general, binary cross entropy

has become the default loss function for more complex binary classiﬁcation deep learning

models, as surveyed in Chapter 5 and onwards.

Predicted Probabilities and Parameter Interpretability

For any observed or postulated feature vector

œ R

, the output of logistic regression is

ˆy

„

(

). This is a probability which can be interpreted as in the left hand side of

(3.1)

Hence at its core, the logistic regression model

(3.5)

yields probabilities as outputs. Below we

see how these probabilities can b e used for classiﬁcation, yet ﬁrst let us consider prediction

of probabilities or proportions.

Recall the breast cancer example presented in Section 2.2 where we overviewed concepts

of binary classiﬁcation. In that example, based on the Wisconsin breast cance r dataset,

we used a training set to create two logistic regression models. The ﬁrst model used

the

smoothness_worst

feature as the single coordinate of

and the second model used

all possible 30 features. We now continue to use this dataset, this time using all

569 observations for statistical inference and setting a single feature model based on the

area_mean feature, and a two feature model based on area_mean and texture_mean.

With the estimated two feature model based on estimated parameters

◊

b, ˆw

, ˆw

),the

predicted probability for an observation x

œ R

ˆy = ‡

Sig

(

b +ˆw

area_mean

+ˆw

texture_mean

3 Simple Neural Networks

where for clarity we subscript c oordinates of the features vector

with the feature name.

Using (3.3) this can also be represented as,

log

ˆy

1 ≠ ˆy

b +ˆw

area_mean

+ˆw

texture_mean

. (3.13)

The representation in

(3.13)

is appealing since it endows the estimated parameters

ˆw

and

ˆw

with a concrete interpretation. First observe that the log odds is linearly described

by the features

area_mean

and

texture_mean

. This means that it is estimated that a unit

increase in

area_mean

will see the log odds increase by

ˆw

, a unit increase in

texture_mean

will see the log odds increase by

ˆw

, and similarly when both features are at zero, the log

odds is at the value

. This interpretation of parameters is clearly not limited to a model

with two features as it would work for any number of features.

In practice, it is more convenient to interpret the odds given by,

ˆy

1 ≠ ˆy

= e

b+ˆw

area_mean

+ˆw

texture_mean

Speciﬁcally, an increase of

area_mean

by one unit implies a multiplicative increase for the

odds by a factor of

ˆw

, and similarly for

texture_mean

by a factor of

ˆw

.Withthisview

it is typically common to consider the odds ratio where the meaning of the multiplicative

factor

ˆw

is the ratio between the odds of a model where the feature is at some level

between the odds at some leve l

+1. This is especially useful where features are binary

but also in general. Further note that when

ˆw

1 we say the feature has an increasing

eﬀect on the probability of the outcome being

positive

, and in the opposite direction a

feature with e

ˆw

< 1 yields a decrease in this probability when the feature is increased.

Such parameter interpretation transcends from linear models to logistic regression models

as well as to other statistical models. This interpretability property often makes models

extremely attractive in biostatistics and similar ﬁelds. It means that estimated parameters

of the model are not only useful for their predictive ability, but also for reasoning ab out the

relationships between variables.

As we progress beyond this chapter to more complicated deep neural networks, direct inter-

pretability of parameter values is often lost. In such non-interpretable cases, the parameter

estimate

◊

is not a useful learning outcome on its own, but is rather only useful for the tasks

of the model. Nevertheless we mention that an active area of machine learning and deep

learning research is to seek interpretable models.

In Figure 3.1 we present the actual ﬁt of the single feature model and the two features model

for the Wisconsin breast cancer observations. The response

ˆy

is the probability of malignant

lumps whereas the data involves observations with

=1(

positive

, i.e., malignant) and

=0(

negative

, i.e., benign). In (a) we see the sigmoid function applied to

ˆw

for

the estimated single feature model directly and also plot all the observations with a slightly

random jitter on the y axis so that they can be visualized easily, where we cut oﬀ high

outliers. In (b) the sigmoid function is applied to

ˆw

yielding a surface where

we omit plotting the actual observations. In both the univariate and multivariate cases it is

evident that the models present a monotonic relationship between each of the variables and

the predicted probability. This property holds for any number of features.

3.1 Logistic Regres sion in Statistics

0.0

0.2

0.4

0.6

0.8

1.0

500 1000 1500

(a)

500

1000

1500

0.2

0.4

0.6

0.8

(b)

Figure 3.1:

Logistic regression models for probability prediction ﬁt to the Wisconsin breast cancer

dataset. (a) A

=1model with the feature

area_mean

(

)and a conﬁdence band. (b) A

model based on the features area_mean (x

) and texture_mean (x

Logistic Regression for Classiﬁcation is a Linear Classiﬁer

In addition to the application of probability prediction as described above, logistic regression

models can also be naturally used for binary classiﬁcation. We have already seen how

to convert such models into binary classiﬁers. This was ﬁrst seen in Section 2.2 where

we introduced the threshold based classiﬁer in

(2.5)

with the label prediction

‚

set to 1

(positive)ifˆy>· and 0 (negative) otherwise.

As an illustration, consider the two features breast cancer prediction model with a response

surface as in Figure 3.1 (b). By setting

5 for this model we get a classiﬁer as illustrated

in Figure 3.2. The red region corresponds to potential feature vectors that would be classiﬁed

negative

(benign) and the blue region corres ponds to

positive

(malignant). It is seen

that many, but not all, of the training samples are correctly classiﬁed.

Interestingly the classiﬁcation boundary appears like a straight line, or more precisely a

hyperplane in the feature space. One may then wonder if this is a property of logistic regression.

We now show that in fact, logistic re gress ion based binary classiﬁers are linear classiﬁers.

Such linear classiﬁers separate the feature space via a single hyperplane,

(

˘w

€

where

b œ R

and

˘w œ R

are the parameters of the hyperplane. Since every hyperplane cuts

Euclidean space into two half-spaces, a natural way to use a hyperplane for classiﬁcation of

a feature vector x

œ R

is,

‚

Y =

0(negative), if H(x

) Æ 0,

1(positive), if H(x

) > 0.

(3.14)

This means that if

falls in one of the half spaces the classiﬁcation is

negative

and in the

other it is positive. The distinction on the b oundry is arbitrary.

3 Simple Neural Networks

500 1000 1500 2000

10 15 20 25 30 35

y=0 (benign)

y=1 (malignant)

Figure 3.2:

Binary classiﬁcation of malignant lumps (blue dots) and be nign lumps (red dots)

using a logistic model with two features, area_mean (x

) and texture_mean (x

To see that logistic regression is a linear classiﬁer consider estimated model parameters

◊

b, ˆw

) and probability prediction

ˆy

.The

positive

classiﬁcation then occurs if

ˆy>·

i.e.,

‡

Sig

(

ˆw

€

)

> ·

. We can then apply the

Logit

(

) function

(3.2)

to this inequality.

Since

Logit

(

) is a monotonic increasing function and is the inverse of the sigmoid function,

we obtain,

b +ˆw

€

> log(

1 ≠·

), or

b + log(·

≠1

≠ 1) + ˆw

€

> 0.

We thus see that the hyperplane parameters associated with logistic regression classiﬁcation

are

log

(

≠1

≠

1) and

˘w

ˆw

. This shows that logistic regression with a threshold

rule yields a linear classiﬁer. Note in fact that with the same argument, any model of the

form

ˆy

‡

(

ˆw

€

) where

‡

(

) is some bijection (invertible function) from

yields a

linear classiﬁer. This naturally includes the linear model based binary classiﬁcation outlined

in Section 2.3. It also includes the probit model mentioned earlier in this section, as well as

any shallow neural network for binary classiﬁcation that has a strictly monotonic activation

function, a concept introduced in the next section. Other very common linear classiﬁers,

including supp ort vector machine models are not surveyed in this book.

Note also that one often transforms classiﬁers with linear decision boundaries into more

expressive classiﬁers via feature engineering. In Section 3.4 we explore how feature engineering

based transformations of the features may yield non-linear decision boundaries for the models

studied in this chapter.

3.2 Logistic Regression as a Shallow Neural Network

We now position the logistic regression model as a deep learning model. However, as we

shortly explain, it is not really “deep” but is rather “shallow” since it does not have hidden

layers. This is the ﬁrst instance in this book where we explicitly consider deep learning

models in some mathematical detail. The general (fully connected) de ep learning model is

3.2 Logistic Regres sion as a Shallow Neural Network

outlined in Chapter 5 and our presentation here serves as a shallow introduction. Note that

the linear model of Section 2.3 can also be positioned as a deep learning model; we shed light

on this too. Further, the multinomial regression model of the next section is a c lose relative

of logistic regression, and as we show in that section, it is also a shallow neural network.

Logistic Regression is an Artiﬁcial Neuron

Let us ﬁrst represent the logistic regression model (3.5) as

ˆy = ‡

˙ ˝¸ ˚

b + w

€

¸ ˚˙ ˝

. (3.15)

Observe that in

(3.15)

, we omit the subscript from

‡

Sig

(

) used in

(3.5)

. We call

‡

(

) a scalar

activation function which in the case of logistic regress ion needs to be

‡

Sig

(

), but in other

cases can be a diﬀerent function. Section 5.3 is devoted to speciﬁc forms of such activation

functions beyond the sigmoid function. At this point, let us just mention a trivial alternative,

the identity act ivation function

‡

(

. With this identity activation function, the model

in (3.15) is clearly just the linear model ˆy = b + w

€

ˆy = (b + w

z = b +

i=1

(z)=

1+e

z

2 (0, 1)

Input

Weight, Bias

(w, b)

Ane

Transformation

Activation

 ( z)

Output

ˆy

Figure 3.3:

Logistic regression represented with neural network terminology as a shallow neural

network. The gray box represents an artiﬁcial neuron composed of an aﬃne transformation to create

z and an activation ‡(z).

The form of

(3.15)

represents what we may call a single layer of a deep learning model, a

shallow neural network, or simply an artiﬁcial neuron. In this case the vector inputs

x œ R

are transformed to a scalar output

ˆy œ R

. However, in general, deep learning m odels (as well

as shallow neural networks) allow for ve ctor outputs. The next section presents such a case.

Observe that the artiﬁcial neuron is composed of an aﬃne transformation

€

followed by a (generally) non-linear transformation

‡

(

). This notation of using

for

the result of the aﬃne transformation and

for the result of the non-linear transformation

is common in deep learning models and heavily used in Chapter 5. Note that the actual

deﬁnition of an artiﬁcial neuron is sometimes taken as the combination of

and

, sometimes

3 Simple Neural Networks

just as the non-linear result

, and sometimes as the computation mechanism deﬁned by

the right hand side of

(3.15)

. Figure 3.3 summarizes the components of the artiﬁcial neuron

focusing on the speciﬁc ‡(·)=‡

Sig

(·) activation function.

A “deeper” deep learning model would have “hidden layers” based on composition of

constructs similar to

(3.15)

. This would involve multiple

and

values computed along

the way. Thus when there is a single layer as in

(3.15)

, the neural network is called shallow

and otherwise it is called deep. Logistic regression is the simplest non-linear scalar output

shallow neural network that one can consider.

Training Logistic Regression

We have already seen in

(3.10)

how maximum likelihood estimation positions parameter

ﬁtting of logistic regression as an optimization problem. We then saw that maximization

of the likelihood is identical to minimization of the loss

(

◊

;

i=1

(

◊

),with

(

◊

) deﬁned via the cross entropy cost

(3.12)

. Importantly, when considered as a deep

learning model or a machine learning model, one sometimes ignores the maximum likelihood

interpretation of logistic regression and starts oﬀ with the cross entropy loss as an engineered

loss function which requires minimization.

When statistical software packages are used to e stimate parameters of logistic regression,

this optimization is typically tackled using second-order methods; see Section 4.6 for an

overview of such techniques. In general, such methods make use of the Hessian matrix of

the loss function or approximations of it; see App e ndix A for a review.

In contrast to statistical practices, in deep learning and machine learning culture, one often

considers the numb er of features

to be very large in which case standard second-order

optimization methods tend to struggle computationally. Thus when treated as a deep learning

model, the default technique used for logistic regression training is gradient descent. Gradient

descent was already introduced in Section 2.4 in the context of the linear model, and variants

of gradient based learning are studied in detail in the next chapter as well; see sections 4.2

and 4.3.

General deep learning models do not have explicit expressions for the gradient of

(

◊

;

)

and certainly not for the Hessian matrix. Hence learning such models requires computational

techniques for gradient evaluation. The most common technique is the backpropagation

algorithm, described in Section 5.4, which is a form of automatic diﬀerentiation, overviewed

in Section 4.4. However, in the case of logistic regression, like the linear model, there are

explicit expressions both for the gradient and the Hessian. We see these now.

In the case of logistic regression, the gradient of

(

◊

;

i=1

(

◊

) with respect to

◊

b, w

)

œ R ◊ R

is a vector in

with

+1. It is denoted as

ÒC

(

◊

) and can be

represented as the average of the gradients of each C

(◊). Namely,

ÒC(◊)=

i=1

ÒC

(◊). (3.16)

This general relationship between the gradient of the total loss function

ÒC

(

◊

) and the

gradients of the loss for each observation

ÒC

(

◊

) is common throughout deep learning. In

the case of logistic regression we have that

(

◊

binary

(

(i)

, ˆy

(i)

) with

binary

(

·, ·

)

3.2 Logistic Regres sion as a Shallow Neural Network

from

(3.12)

. We thus require expressions for the gradient of this binary cross entropy loss

with observation label y

(i)

and predicted value ˆy

(i)

= ‡

Sig

(b + w

€

(i)

). That is,

ÒC

(◊)=≠Ò

(i)

log ‡

Sig

(b + w

€

(i)

)+(1≠ y

(i)

) log

1 ≠‡

Sig

(b + w

€

(i)

where the gradient is with respect to the vector

◊

b, w

). Now using basic diﬀerentiation

rules and the structure of ‡

Sig

(·), we obtain,

ˆC

ˆb

= ‡

Sig

(b + w

€

(i)

) ≠y

(i)

ˆC

ˆw

‡

Sig

(b + w

€

(i)

) ≠y

(i)

for j =1,...,p.

Thus in combining these components we can represent the

+1dimensional gradient

vector as,

ÒC

(◊)=

‡

Sig

€

(i)

+ b) ≠y

(i)

, (3.17)

where the ﬁrst scalar expression on the right hand side of

(3.17)

is the prediction diﬀerence

for observation

and the second expression is the vector of features for observation

including

the constant 1 feature.

(a) (b)

Figure 3.4:

The loss landscape of logistic regression for a synthetic dataset. (a) Using the squared

loss C

(◊ )=(y

(i)

≠ ˆy

(i)

)

. (b) Using the binary cross entropy loss C

(◊ )=CE

binary

(i)

, ˆy

(i)

Some Beneﬁts of Cross Entropy Loss

If one considers the problem of logistic regression training purely based on an optimization

approach and not based on a maximum likelihood approach, then the cross entropy loss can

potentially be replaced by other loss functions such as for example quadratic cost. However,

it turns out that using the cross entropy loss, generally yields desirable loss landscapes.

As an illustration consider Figure 3.4 based on logistic regression with synthetic data of a

single feature (

=1and

=2). The parameters to be optimized are the scalars

and

.In

(a) we use the squared error loss and in (b) we use the cross entropy loss. It is clear from

3 Simple Neural Networks

this image, that at least in this case, the cross entropy loss landscape is more m anageable to

navigate for an optimization algorithm like gradient descent. In fact, in the case of logistic

regression, cross entropy always yields a convex loss landscape (further details are in the

next chapter), while other los ses such as the squared error loss generally yield non-convex

loss landscapes often presenting multiple local minima as well as saddle points.

Importantly, when considering classiﬁcation (or probability prediction problems), deep

learning models that are more complex than logistic regression are still often easier to

optimize using the cross entropy loss than the squared error loss or other losses. In such more

complex models, the neat mathematical convexity property that logistic regression enjoys

with cross entropy is lost, and multiple local minima can exit. Nevertheless, computational

and research experience has shown that in general using cross entropy is preferable.

As an illustration of gradient descent applied to a concrete example we return to the

Wisconsin breast cancer dataset, now splitting the observations as we did in Chapter 2 into

train

= 456 and

test

= 113.Weuseamodelwith

= 10 features (

= 11) and learn the

parameters via gradient descent with some arbitrary initilization. In doing so, we obtain

trajectories as in Figure 3.5.

0.2

0.3

0.4

0.5

0.6

0 10 20 30 40

Iteration

Cross Entropy

Test

Train

(a)

0.75

0.80

0.85

0.90

0.95

0 10 20 30 40

Iteration

F1 score

(b)

Figure 3.5:

Training logistic regression via gradient descent for the Wisconsin breast cancer dataset

with an 80-20 train-test split (we can treat the test set as a validation set). (a) Loss over iterations.

(b) Performance over iterations using the F

score when · =0.5.

3.3 Multi-class Problems with Softmax

The multinomial regression model as it is known in statistics, or softmax regression as it

is known in machine learning

is the generalization of logistic regression from the binary

response case to the case of

2 classes. Now the response random variable

takes on

values 1, 2,...,K. The feature vector remains X just like in logistic regression.

In some machine learning ﬁelds, this is also called softmax logistic regression, multinomial logistic

regression,ormulti-class logistic regression.

3.3 Multi-class Problems with Softmax

We denote the training observations via

)

(1)

),...,(x

(n)

)

, where each

label

(j)

is one of

class values

{

,...,K}

. The purpose of the model is to predict class

probability vectors, or if used for classiﬁcation, to predict a class in a multi-class setting.

The Model

Just like logistic regression predicts two classes and uses

„

(

) for the probability of the

positive response, in multinomial regres sion the predicted response is the probability vector

„(x)=

„

(x),...,„

(x)

, (3.18)

where,

„

(x)=P(Y = k | X = x) for k =1,...,K. (3.19)

To compare

(3.19)

with

(3.1)

, observe that

(3.1)

is like

(3.19)

with

=2where

„

(

„

(

)

and 1 ≠„(x)=„

(x).

The name “multinomial regression” stems from the multinomial distribution which is a

probability distribution over vectors of length

consisting of non-negative integers that

sum up to some speciﬁed integer

. Speciﬁcally a random vector (

,...,U

) follows

a multinomial distribution with parameters

(positive integer) and

„

,...,„

)

(probability vector) if,

P (U

= u

,...,U

= u

! ·...· u

k=1

„

for

k=1

= N,

with the probability at 0 when

i=k

”

. A speciﬁc case of the multinomial distribution

with

=1is called a categorical distribution. In this case the random vector (

,...,U

)

is like a one-hot encoded vector since exactly one coordinate is 1 and the rest are 0s.

A statistical assumption in multinomial regression is that the one-hot enco ded response,

given the features vector

, follows a categorical distribution. The random vector of

the one-hot encoded response of

is denoted (

,...,Y

) where

1{Y

,i.e.,the

th coordinate equals 1 if

and equals 0 otherwise. Now the categorical distribution

assumption is,

P (Y

= u

,...,Y

= u

| X = x)=

k=1

„

(x)

. (3.20)

The key model assumption in multinomial regression

is in the way in which the probability

vector

„

(

) depends on the features vector

. There are two possible paramete rizations, which

we refer to as the statistical parameterization and the machine learning parameterization.

The former presents a unique (identiﬁable) model while the latter is slightly simpler but has

some redundant parameters.

Starting with the statistical parameterization, it assumed that,

„

(x)=

€

(k)

K≠1

j=1

€

(j)

for k =1,...,K ≠ 1, (3.21)

Like logistic regres sion, this assumption is rooted in the theory of generalized linear models (GLM), a

topic we defer to the notes and references at the end of the chapter.

3 Simple Neural Networks

and further,

„

(x)=1≠

K≠1

j=1

„

(x). (3.22)

This directly generalizes the sigmoidal relationship of logistic regression

(3.5)

from

=2to

K Ø

2. With this statistical parameterization, the parameters of multinomial regression are

◊ =(b

,...,b

K≠1

(1)

,...,w

(K≠1)

) œ R ◊ ...◊ R

¸ ˚˙ ˝

K≠1 times

◊R

◊ ...◊ R

¸ ˚˙ ˝

K≠1 times

:= , (3.23)

and hence the number of parameters is

K ≠

1)(

+1). The scalar parameters

,...,b

K≠1

are bias (intercept) parameters and each of the vector parameters

(1)

,...,w

(K≠1)

is a

weight vector (regression parameter). With this parameterization, we may also view the

ﬁnal bias

and ﬁnal weight vector

(K)

as the scalar 0 and vector 0 respectively. Such a

restriction on b

and w

(K)

allows us to combine (3.21) and (3.22) into,

„

(x)=

€

(k)

j=1

€

(j)

for k =1,...,K. (3.24)

Moving onto the machine learning parameterization the last class also has free parameters

and

(K)

like the other classe s. In this case the parameter space is  =

◊

(

and

thus

(

+ 1). Now again the representation

(3.24)

is valid, yet unlike the statistical

parameterization, the last term in the summation in the denominator is not restricted to be 1.

In this sense the machine learning parameterization is simpler, however when estimating ◊

with maximum likelihood estimation there is never a unique

◊

that maximizes the likelihood

(or minimizes the loss).

The Softmax Function and Multinomial Regression as a Shallow Neur al

Network

A common function in deep learning, especially when considering classiﬁcation problems, is

the R

æ R

softmax activation function. For z =(z

,...,z

) œ R

, it is deﬁned as,

Softmax

(z)=

i=1

. (3.25)

Softmax and its derivative expressions are further discussed in Section 5.3. At this point let

us just point out that for any vector

z œ R

, the result of

Softmax

(

) is a probability vector.

Now with

Softmax

(

) deﬁned we can revisit the multinomial regression model

(3.24)

and

represent it as the softmax of a vector of length

that has

€

(k)

at the

th coordinate.

More succinctly we have,

ˆy = S

Softmax

zœR

˙ ˝¸ ˚

b + Wx

¸ ˚˙ ˝

aœR

, (3.26)

3.3 Multi-class Problems with Softmax

where

ˆy

is the prediction of

„

(

) as in

(3.18)

, and the

dimensional vector

and the

K ◊p

dimensional matrix W are

b =

, and W =

—— w

€

(1)

——

—— w

€

(2)

——

—— w

€

(K)

——

. (3.27)

Compare

(3.26)

with

(3.15)

of logistic regression. In the logistic regression case,

is a scalar

and so is

. In contrast, in the multinomial regression case the aﬃne transformation converts

a vector

x œ R

into a vector

z œ R

and then the softmax activation retains the same

dimension to arrive at a.

Note als o that

(3.26)

is valid both in the statistical parameterization and the machine

learning parameterization. In the former, the last bias

and the last row

€

(K)

(3.27)

are zeros, where in the latter these are free variables.

+ w

(1)

| {z }

i=1



(x)

+ w

(2)

| {z }

i=1



(x)

+ w

(K)

| {z }

i=1



(x)

Softmax

Figure 3.6:

Multinomial regression as a neural network model with output

ˆy

which is a probability

vector. Note that each a

is a function of all of z

,...,z

due to the softmax operation.

This representation positions multinomial regression as a shallow neural network similarly

to the way logistic regression is a shallow neural network. The diﬀerence is that multinomial

regression has vector outputs while logistic regression has a scalar output. Figure 3.6

illustrates the multinomial regression model as a neural network. Here each circle can again

be viewed as an “artiﬁcial neuron” however note that the softmax activation aﬀects all

3 Simple Neural Networks

neurons together via the normalization in the denominator of

(3.25)

. Hence the activation

value of each neuron is not independent of the activation values of the other neurons.

Likelihood and Cross Entropy

Now that we understand the multinomial regression model, both from a statistical perspective

using the probabilistic interpretation

(3.20)

and as a deep learning model using

(3.26)

,let

us consider maximum likelihood estimation, and equivalently loss function minimization.

To obtain an estimate

◊

using maximum likelihood estimation, we follow a procedure

analogous to the one used for logistic regression. Speciﬁcally, the likelihood of the model for

the data

is deﬁned as in

(3.7)

. Now using the probability law of the categorical distribution,

(3.20), we obtain the likelihood,

L(◊ ; D)=

i=1

k=1

„

(i)

)

1{y

(i)

=k}

for ◊ œ ,

where we use the parameter space  as in the statistical parameterization

(3.23)

. Now

considering the log-likelihood by applying log(·) to L(◊ ; D) we obtain,

¸(◊ ; D)=

i=1

k=1

1{y

(i)

= k}log „

(i)

)

i=1

log „

(i)

)

i=1

(i)

+ w

€

(i)

)

(i)

≠ log

j=1

€

(j)

(i)

(3.28)

In the second step, we use the subscript

(i)

because except for the index

where

(i)

all other summands of the internal sum of the ﬁrst row are zero. The third step follow s from

(3.24).

Now similarly to our development of logistic regression MLE in

(3.10)

, we can represent the

problem as a minimization problem and deﬁne the estimator via,

‚

◊

MLE

= argmin

◊œ

≠

¸(◊ ; D).

Further, we can also view this problem as a loss minimization problem, where as before

the loss is of the form

(

◊

;

i=1

(

◊

), and

(

◊

) is the loss for an individual

observation. Based on the expressions of the log-likelihood

(3.28)

and using

ˆy

(i)

for the

vector „(x

(i)

), the loss for an individual observation is,

(◊)=≠

k=1

1{y

(i)

= k}log ˆy

(i)

= ≠log ˆy

(i)

. (3.29)

Alternatively one may opt for the machine learning parameterization w ith the bigger parameter space.

In this case the MLE is never unique.

3.3 Multi-class Problems with Softmax

Here, the ﬁrst and second equalities arise from the ﬁrst and the second lines of

(3.28)

respectively. The expressions on the right hand side of

(3.29)

are in fact called the categorical

cross entropy. More precisely, for a label

y œ {

,...,K}

and a probability vector

ˆy

length K, we deﬁne the categorical cross entropy as,

categorical

(y, ˆy)=≠

k=1

1{y = k}log ˆy

= ≠log ˆy

, (3.30)

where in the ﬁnal expression,

ˆy

means taking the element at index

from the vector

ˆy

.We

thus see that

(◊)=CE

categorical

(i)

, ˆy

(i)

). (3.31)

We have already seen the binary cross entropy in

(3.12)

. Similar to this, the categorical cross

entropy

(3.30)

is the empirical estimate of the cross entropy of two probability distributions

appearing in

(B.2)

of Appendix B. We note here that in many contexts of deep learning,

one just uses the phrase “cross entropy” without the preﬁx “binary” or “c ategorical”.

Then the distinction between

binary

(

·, ·

) and

categorical

(

·, ·

) is based on context and

the type of arguments

and

ˆy

. In the former

is a binary outcome in

{

}

and

ˆy

is a

scalar probability in [0

1]. In the latter,

is a multi-class choice in

{

,...,K}

and

ˆy

is a

probability vector of length

in [0

. The binary cross entropy is a special case of the

categorical cross entropy and

binary

(

y, ˆy

) with

y œ {

}

and

ˆy œ

1] can be represented

as CE

categorical

(2 ≠y,[ˆy, 1 ≠ ˆy]

€

It may be useful to also see other notational forms for the loss of an individual observation

(◊). Making use of (3.26) we obtain

◊

= CE

categorical

(i)

Softmax

(b + Wx

(i)

)

, (3.32)

where we parameterize

◊

to be composed of the vector

and the matrix

. Namely,

 = R

◊ R

K◊p

. Alternatively,

(◊)=≠(b

(i)

+ w

€

(i)

)

(i)

) + log

j=1

€

(j)

(i)

. (3.33)

Compare (3.33) to the last line of (3.28).

Derivatives and Lear ning

Like logistic regression, inference for multinomial regression can be eﬃciently carried out

using second-order methods. Nevertheless, when multinomial regression is viewed as a deep

learning model with very large

, one often uses gradient descent (ﬁrst-order optimization)

in place of second-order methods. We now present the derivative expressions of

(

◊

).These

expressions allow one to use gradient descent just as in the case of logistic regression.

Using either the statistical parameterziation with

K ≠

1)(

+ 1) parameters or the

machine learning parameterziation with

(

+ 1) parameters, we have that

◊

b, W

)

consists of both a vector

and a matrix

. Thus in dealing with the gradient of

(

◊

) with

respect to

◊

, denoted as

ÒC

(

◊

), one needs to either vectorize

◊

into a vector of length

, or

In practic e, when deep learning systems implement the binary and categorical cross e ntropy,

log

(

) in

(3.12)

(3.30)

is replaced with

log

(

) where

is a small ﬁxed parameter. This allows to seamlessly

include the probability u =0as an input.

3 Simple Neural Networks

alternatively, to make use of the derivative of a real valued function with respect to a matrix,

as deﬁned in Appendix A,

(A.7)

. Our presentation uses both variants, and it is a prelude for

further derivative expressions involving matrix valued parameters, used in Chapter 5 and

the chapters that follow.

Let us ﬁrst consider the derivative of

(3.33)

with respect to the individual scalar elements of

◊ =(b, W ). Speciﬁcally, for k =1,...,K we obtain,

ˆC

ˆb

€

(k)

(i)

j=1

€

(j)

(i)

≠ 1{y

(i)

= k},

ˆC

ˆw

k,¸

€

(k)

(i)

j=1

€

(j)

(i)

≠ 1{y

(i)

= k}

(i)

, for ¸ =1,...,p.

These derivatives can then be placed in a d dimensional vector

to form ÒC

(◊).

Now observe that the expressions above involve the softmax function where the ratio

of the exponent with the sum of exponents in each expression is the

th coordinate of

Softmax

(

(i)

). This hints at a simpler expressions and indeed we may use unit vector

notation (e

) in place of the indicator functions 1{·} to obtain,

ˆC

ˆb

= S

Softmax

(b + Wx

(i)

) ≠e

(i)

ˆC

ˆW

Softmax

(b + Wx

(i)

) ≠e

(i)

€

With such a representation the ﬁrst expression which is a derivative with respect to a vector

is in fact composed of the coordinates associated with the bias vector

of the gradient

ÒC

(

◊

). Further, the second expression is a derivative of a real valued function with respect

to a matrix, which we represent using the notation of

(A.7)

in Appendix A to represent the

matrix with (k, ¸)th element denoting

ˆC

ˆw

k,¸

In general, when one implements gradient based optimization as in Step 3 of Algorithm 2.1

of Chapter 2, or using one of the multiple variants presented in Chapter 4, these derivatives

need to be used with the right shape dimensions taken into account.

We also mention that it is sometimes common to subsume the bias term expressions within

the weight parameter expressions by augmenting the feature vector to be of length

+1 where

the ﬁrst feature is the constant 1. This practice was already used in the linear regression

treatment of Section 2.3, and could have also been employed in the logistic regression

treatment of Section 3.1. However we mention that from a deep learning perspective

the

weight parameters in

often receive a diﬀerent treatment than the bias parameters in

and thus keeping b separate from W notationally is instructive.

In fact, when one seeks a Hessian expression for

(

◊

),usingsucha

dimensional vector layout is

the standard approach. We skip presentation of such a Hessian expression here, but note that (in contrast

to more complicated deep learning models) it is tractable, and is often used in second-order methods for

multinomial regression.

This is made clearer in chapters 5 and 6,forexample,inthecontextsofdropoutandconvolutional

neural networks.

3.3 Multi-class Problems with Softmax

Note that like logistic regression, the loss minimization problem of multinomial regression is

convex and thus with proper hyper-parameter choices (e.g., learning rate), a global minimum

can in principle always be reached. We establish the convexity of this optimization problem

in the next chapter.

Classiﬁcation with Multinomial Regression Yields Convex Polytope

Decision Regions

Recall that with multi-class classiﬁcation, our goal is to provide a prediction

‚

for the label

{1,...,K} associated with the input feature vector. In Section 2.2 we introduced concepts

of binary classiﬁcation and then via the example of the linear model in Section 2.3 we

saw strategies for adapting binary classiﬁers to a multi-class classiﬁer. This was via the

one vs. rest and one vs. one approaches. In contrast to these approaches, the multinomial

model provides a direct solution for multi-class classiﬁcation since the output

ˆy

is already a

probability vector over

{

,...,K}

. This approach is also common in more complicated deep

learning models presented in the chapters that follow.

In general, an output vector

ˆy

, which is a probability vector, can be used to create a class iﬁer

by choosing the class that has the highest probability (and breaking ties arbitrarily). Namely,

‚

Y = argmax

kœ{1,...,K}

ˆy

. (3.34)

This approach is called a maximum a posteriori probability (MAP) decision rule since it

simply chooses

‚

as the class that is most probable. It is the most common decision rule

when using deep learning models for classiﬁcation. As a side note, for binary classiﬁcation, a

MAP decision agrees with binary classiﬁcation threshold prediction

(2.5)

with a threshold

5, and similarly to the ﬁne-tuning of the parameter

, in the multi-class case we

may also deviate from MAP decisions and adjust (3.34)ifneeded.

0.0 0.2 0.4 0.6 0.8 1.0

class 1

class 2

class 3

class 4

Figure 3.7:

Multinomial regression for multi-class classiﬁcation applied to synthetic data. In this

example the training accuracy is 15/17.

3 Simple Neural Networks

As an example, consider Figure 3.7 that represents a MAP based rule applied to the output

of a multinomial model trained on

= 17 synthetic data observations with

=2features

and

=4classes. Like the binary classiﬁcation example of Figure 3.2, this multi-class

classiﬁcation example illustrates the output of

‚

(

) for any

œ R

. Similarly to the

logistic regression case, we observe that the decision boundaries are straight lines. This is

not a coincidence but rather a prope rty of multinomial regression.

Denote by

the set of input features vectors

x œ R

such that

‚

(

.Wenowsee

that

is an intersection of half spaces, i.e., it is a convex polytope. To see this, consider

some arbitrary point

œ R

and ﬁx some class label

k œ {

,...,K}

. We can now compare

ˆy

(

) and

ˆy

(

) for all other labels

. Speciﬁcally, since the denominator of the softmax

expression (3.24) is independent of the index, ˆy

) Ø ˆy

) if and only if

€

(k)

Ø e

€

(j)

, or (b

≠ b

)+(w

(k)

≠ w

(j)

)

€

Ø 0.

Thus by deﬁning a hyperplane

with

(

)=(

≠ b

)+(

(k)

≠ w

(j)

)

€

, we see that

ˆy

(

)

Ø ˆy

(

) holds if and only if

(

)

0 for all other classes

. Now

‚

(

only

if ˆy

) Ø ˆy

) for all other j. Hence C

is an intersection of K ≠ 1 hyperplanes.

As a concrete example we return to MNIST digit classiﬁcation explored in Section 2.3.In

that section we used the one vs. rest and one vs. one strategies to build classiﬁers based on

linear models. See Table 2.1 where we presented confusion matrices for these classiﬁers when

trained on the 60

000 MNIST train images and tested on the 10

000 test images. We now

report the performance of multinomial regression on this dataset as well as the application of

one vs. rest and one vs. one with logistic regression.

A summary is in Table 3.1. Our purpose

with this comparison is not to claim which method is best since in practice all these methods

are signiﬁcantly beat by convolutional neural networks as presented in Chapter 6. We rather

aim to highlight that a direct multi-class strategy such as multinomial regression requires

training and application of a single model while the other approaches require multiple mo dels,

and are not signiﬁcantly superior.

Tab le 3 .1 :

Diﬀerent approaches for creating an MNIST digit classiﬁer. It is evident that in gene ral

one vs. one classiﬁers outperform one vs. rest classiﬁers, yet have many more parameters. Further,

on the same type of classiﬁcation scheme, logistic regression generally outperforms linear regression.

Finally observe that multinomial regression with only a single model and a low number of parameters,

generally performs almost as good as the top scheme (logistic regression with one vs. one).

Strategy and model type

Number of

models to train

Total numb er

of parameters

Test accuracy

Linear regre ssion one vs. rest 10 7,850 0.8603

Linear regre ssion one vs. one 45 35,325 0.9297

Logistic reg ression one vs. rest 10 7,850 0.9174

Logistic reg ression one vs. one 45 35,325 0.9321

Multinomial regression 1 7,850 or 7,065 0.9221

The linear regression based classiﬁers use the pseudo-inverse and hence the accuracy on the test set

of size 10

000 is exact (up to numerical error). The logistic and multinomial classiﬁers were trained with

gradient based learning with an ADAM algorithm with a learning rate of 0

02,mini-batchesofsize2000,

and 200 epochs; see Chapter 4. With this gradient based le arnin g there is room for error with the accuracy

as it depends on the optimization run. Hence the best achievable accuracy with the non-linear classiﬁers can

potentially be slightly better.

3.4 Beyond Linear Decision Boundaries

Recall ﬁgures 3.1, 3.2, and 3.7. Logistic regression naturally yields a sigmoidal relationship

between the input features and the response probability. In a classiﬁcation setting this

translates to linear (aﬃne) decision boundaries. Similarly, multinomial regression used for

classiﬁcation also yields straight line boundaries via convex polytope decision boundaries. One

may then ask if these shallow neural network models can create other forms of response curves

or decision boundaries? We now show that this is indeed possible via feature engineering

following a similar spirit to the way feature engineering was introduced in Section 2.2.

Enhancing the Sigmoidal Response

Recall ﬁrst the linear regression example illustrated in Figure 2.3 of Chapter 2.Indisplay

(b) of that ﬁgure we saw how linear regression with an engineered quadratic feature yields a

curve that better ﬁts the data. A similar type of idea may also be applied in the case of

logistic regression.

As an example consider a dataset

where each observation is for a diﬀerent geographic

location and where the feature vector has a single coordinate which is the level of precipitation

at that location (in millimeters/month). For each observation, the associated

(i)

œ {

}

determines the absence (0) or presence (1) of a certain species. Observations from such a

dataset

are presented in Figure 3.8. At locations

that are not too dry or not too wet, the

species tends to be present, whereas when the precipitation is very low or very high the

species tends to be absent.

Such a relationship between the precipitation and the probability of presence of the species

cannot b e captured by a sigmoidal curve since

‡

Sig

(

€

) is a monotonic function of

In fact, when ﬁtting a sigmoidal curve to this data, the response, plotted via the red curve

of Figure 3.8, tends to be almost ﬂat in the region of the observations because in this case

ˆw

is close to zero.

Nevertheless, if we wish to use logistic regression for this problem we may do so via feature

engineering. Similarly to the housing price example of Section 2.2 we introduce a new feature

. With the addition of this new feature, the model (as a function of the single feature

x = x

) becomes,

„(x)=

1+e

≠(b+w

x+w

)

Following from basic properties of the parabola

,if

0 then we have that

„

(

)

0 as

x æ ≠Œ

x æŒ

and further

„

(

) is maximized at

≠w

which is

the maximal point of the parabola. Similarly, if

0 the shape of

„

(

) is reversed and

„

(

) has a minimum point at the minimum point of the parab ola. In both cases,

„

(

) is

symmetric about x = ≠w

Fitting logistic regression to this feature engineered model yields a negative value for

ˆw

and this results in the blue curve of Figure 3.8. We thus see that feature engineering in

the context of logistic regression allows us to extend the form of the response beyond the

sigmoid function.

This is a synthetic dataset motivated by ecological applications.

3 Simple Neural Networks

0.0

0.2

0.4

0.6

0.8

1.0

25 50 75

Figure 3.8:

Logistic regression ﬁt with prediction

ˆy

for a feature engineered model with the feature

. The response curve is plotted in blue together with conﬁdence bounds. The red curve is a

sigmoidal function ﬁt to the single feature x = x

The General Setup of Polynomial Feature Engineering

As in the example above, in general it is quite common to use powers for feature engineering

and this makes the linear combination of the engineered features a polynomial. In examples

with a small number of features such as the

=1example above, we can tweak feature

engineering visually. However, when

is large, other performance based methods are needed.

When there are initially

input features we can automate the creation of more features by

choosing each new feature as a power product or monomial form

· ...· x

for some

non-negative integers

,...,k

. With such a process it is comm on to limit the degree by

a constant r via

+ k

+ ...+ k

Æ r.

For example when r =2the set of engineered features is,

˜x =(x

...,x

,...,x

p≠1

In this case

˜x œ R

p(p+3)/2

and thus

=1+

(

+ 3)

2 for logistic regression.

So if for

example there are initially

000 features then there are about half a million engineered

features with

= 501

501. One may of course cull the number of engineered features to

reduce the number of learned parameters. This can sometimes be done via regularization

introduced in Section 2.5.

This also holds for any model where the number of parameters is one plus the number of features such

as for example l inear regression.

3.4 Beyond Linear Decision Boundaries

Let us also go beyond

=2to higher degrees of the monomial features. From basic

combinatorics, we have that the number of non-negative integer solutions to

...

is (

p ≠

1)!

(

p ≠

1)!

and thus when using a polynomial feature scheme with degree up

to r we have for logistic regression

d =

¸=0

(¸ + p ≠ 1)!

(p ≠1)! ¸!

and for multinomial regre ss ion,

K ≠

1 times this number with the machine learning

or statistical parameterizations respectively.

Using Stirling’s approximation of factorials we have that when

is signiﬁcantly smaller

than

then

d ¥ p

! for logistic regression. As an example with

000 input features

and setting

=4we have approximately 4

◊

parameters or and exact number of

42, 084, 793, 751.Thisisover40 trillion parameters to learn!

Having 40 trillion parameters in a model is indeed borderline astronomical and as of today

still infeasible. Thus for non-small

, going beyond

=2is rarely used in practice. In fact,

in Section 5.2 of Chapter 5 we argue that with deeper neural networks one may sometimes

get more expressivity without creating such a large number of parameters.

Versatile Classiﬁcation Boundaries

To further explore feature engineering let us now consider classiﬁcation problems. We have

seen above that both logistic regression and multinomial regression yield decision boundaries

deﬁned by hyperplanes which we may generally denote via

(

). Each such hyperplane

has parameters

b œ R

and

˘w œ R

which depend on the estimated m odel parameters in a

simple manner. Decision rules for classiﬁcation with given input

œ R

are then made

based on if

(

)

0 or not (where the case of multinomial regression involves multiple

such hyperplanes and comparisons).

Now in a feature engineered scenario the hype rplane parameters are adapted to be of larger

dimension and in place of

(

) we evaluate

(

˜x

) with parameters

and

˜w

. As an example

consider a scenario with

=2features where we carry out polynomial feature engineering

as above with

=2. Then the number of parameters is now

=6and the hyperplane

comparison is then,

b +

˜w

Ø 0. (3.35)

Now here the set of input feature vectors (

)

œ R

that satisﬁes the above inequality is

no longer a half space but rather represents a curved subset of

. We thus see that such

feature engineering enables non linear decision boundaries.

As an example, consider Figure 3.9 based on synthetic data that requires binary classiﬁcation

with two input features. In (a) we see that for this type of data, logistic regression without

feature engineering is not suitable. The linear decision boundary is not able to capture the

pattern in the data. In (b) we expand the set of features and get a decision boundary of

the form

(3.35)

. It appears that going from

=3learned parameters to

=6learned

parameters is worthwhile since the pattern in the data is well captured in (b).

3 Simple Neural Networks

−2

−1

−2 −1 0 1

(a)

(b)

−2

−1

−2 −1 0 1

(c)

−2

−1

−2 −1 0 1

(d)

Figure 3.9:

Decision boundaries for binary classiﬁcation with an expanded set of features for

synthetic data with

=2input features. (a) No feature engineering (

=1,

=3). (b) Quadratic

r =2, d =6. (c) Quartic r =4, d =15. (d) r =8, d =45.

We may continue with higher orders in an attempt to gain more expressivity. However, as

we see in (c) and (d), for this data, the higher order models yield obscure classiﬁcation

decision boundaries. In this example these higher orders certainly appear like an overﬁt.

Such overﬁtting would probably lead to high generalization error (recall the discussion in

Section 2.5). Moreover, the new set of features could become highly correlated and that can

cause diﬃculty in inference of parameters when taking a statistical approach.

As another visual example, return to the synthetic multi-class classiﬁcation example illus-

trated in Figure 3.7. We expand the set of features for this example and plot the decision

regions in Figure 3.10. As this is just a synthetic example, our purpose is to show that by

increasing the number of engineered features we can get more curvature in the decision

boundaries. In (a) we consider quadratic features and in (b) we consider an extreme case of

r =8which has 180 parameters with the machine learning parameterization.

3.5 Shallow Autoencoders

0.0 0.2 0.4 0.6 0.8 1.0

class 1

class 2

class 3

class 4

(a)

0.0 0.2 0.4 0.6 0.8 1.0

class 1

class 2

class 3

class 4

(b)

Figure 3.10:

Decision boundaries with an expanded set of features for the multinomial regression

model (K =4). (a) r =2, d =24(b) r =8, d =180.

3.5 Shallow Autoencoders

So far in this chapter we explored logistic and multinomial regression. Both of these models

are shallow neural networks, which do not involve hidden layers, for supervised learning

where the data is labelled. They are special cases of “deep learning models”. The same goes

for the linear regression model of Section 2.3 (it was made evident that linear regression

is also a deep learning model in Section 3.2). Thus, all the key supervised learning models

that we discussed up to this point are simple neural networks that fall under the umbrella

of deep learning.

We now devote this last section of this chapter to simple neural networks that are used

for unsupervised learning where the unlabelled data is of the form

(1)

,...,x

(n)

}

Namely we introduce autoencoder architectures and focus primarily on shallow versions of

these architectures. Recall that in Section 2.1 we presented an overview of the concept of

unsup e rvised learning in the context of machine learning activities, and in Section 2.6 we

surveyed basic unsupervised learning techniques including principle components analysis

(PCA). As we see in this current s ec tion, PCA can be cast as a special case of an autoencoder

and this p os itions PCA as a form of (simple) neural network based learning as well.

Autoencoder Principles

Before we explore several varied applications of autoencoders, let us focus on their basic

architecture. Consider Figure 3.11 which presents a schematic of an autoencdoer w ith a

single hidden layer.Theinputx œ R

is transformed into a bottleneck, also called the code,

which is some

˜x œ R

and is the hidden layer of the model. Then the bottleneck is further

transformed into the output

ˆx œ R

. The part of the autoencoder that transforms the input

into the bottleneck is called the encoder and the part of the autoencoder that transforms

the bottleneck to the output is calle d the decoder. Both the encoder and the decoder have

parameters that are to be learned.

Note that many other deep learning models in this b ook will also have hidden layers. In

fact, representing and computing gradients for the parameters of such layers is the focus of

3 Simple Neural Networks

Input Layer

Encoder

˜x

Bottleneck

ˆx

Output Layer

Decoder

Figure 3.11:

An autoencoder architecture with an encoder, decoder, and a bottleneck in between

which is the single hidden layer of this neural network.

Chapter 5 and stands at the heart of deep learning. In this autoencoder example, the single

hidden layer is also the bottleneck of the autoencoder and thus we informally consider this

shallow autoencoder as “simple”. Other “deeper” autoencoders may have multiple hidden

layers of which one should be treated as the bottleneck or code.

Interestingly for input

, once parameters are trained, we generally expect the autoencoder to

generate output

ˆx

that is as similar to the input

as possible. This property of autoencoders,

after which they are named, may at ﬁrst seem obscure since it means that the autoencoder

is a form of an identity function. However, this architecture is actually very useful for a

variety of applications, some of which are detailed in this section and others appearing as

parts of more complicated models such as for example sequence models which we discuss in

Chapter 7.

For now, as the most basic application, consider the activity of data reduction already

surveyed in Section 2.6 in the context of PCA and SVD (singular value decompos ition). For

this, assume that the dimension of the bottleneck m is signiﬁcantly smaller than the input

and output dimension

(in other non-data reduction applications we may also have

m Ø p

For example, return to the case of MNIST digits (initially introduced in Section 1.4)where

p = 784. For our example here, assume we have an autoencoder with m = 30.

If indeed a trained autoencoder yields

x ¥ ˆx

then it means that we have an immediate data

reduction method. With the trained encoder we are able to convert digit images, each of

size 28

◊

28 = 784, into much smaller vectors, each of size 30. With the trained decoder we

100

3.5 Shallow Autoencoders

MNIST test set

Reconstruction with PCA

Reconstruction with shallow autoencoder

Reconstruction with deep autoencoder

Figure 3.12:

Reconstruction of the test set of MNIST (ﬁrst row) using various types of auto en-

code rs.

are able to convert back and get an approximation of the original image. This choice of

implies a rather remarkable compression factor of about 26.

Figure 3.12 presents the compression/decompression eﬀect of several types of autoencoders

with

= 30. The ﬁrst row presents untouched MNIST images. The second row pres ents the

eﬀect of reducing the images via PCA (a shallow linear autoencoder) and the reverting back

to the image. The third row presents the eﬀect of a shallow non-linear autoencoder. Finally

the last row presents the eﬀect with a richer autoencoder that has several hidden layers (a

deep autoencoder).

There are multiple other applications for autoencoders which we so on s urvey. However, let

us ﬁrst formulate these types of models more precisely.

Single Layer Autoencoders

As already mentioned above, we may view an autoencoder as a function

◊

æ R

,where

◊

are the trainable parameters of this function. These parameters

◊

ideally inﬂuence the

function’s operation such that

◊

(

)

¥ x

where

is an arbitrary observation from either

the seen or unseen data. The approximate equality, “

”, can be considered informally as

closeness of two vectors.

In practice when faced with training data

(1)

,...,x

(n)

}

, we train the autoencoder

(learn the parameters

◊

) such that

(

◊

;

i=1

(

◊

) is minimized. This is a similar

loss function setup to that used in supervised contexts such as linear regression, logistic

regression, multinomial regression, or the deep learning models that are in the chapters that

follow. For autoencoders, we construct the loss for an individual observation,

(

◊

), as some

distance penalty measure be tween the input observation

(i)

and the output

ˆx

(i)

◊

(

(i)

Contrast this with supervised learning where we compare the observed label and the predicted

label. That is, with autoencoders the target of the model is the input in contrast to a label

(i)

in supervised learning.

101

3 Simple Neural Networks

The most straightforward choice for the distance penalty in

(

◊

) is the square of the

Euclidean distance, namely,

(◊)=Îx

(i)

≠ f

◊

(i)

)Î

and thus C(◊ ; D)=

i=1

Îx

(i)

≠ f

◊

(i)

)Î

. (3.36)

With this cost structure, learning the parameters,

◊

, of an autoencoder based on data

the process of minimizing C(◊ ; D).

In general, autoencoders may have architectures similar to the fully connected deep neural

networks that we study in Chapter 5 as well as to extensions in the chapters that follow.

This may include multiple hidden layers, convolutional layers, and other constructs. At this

point, to understand the key concepts, let us revert to Figure 3.11 and consider autoencoders

composed of the same components of that ﬁgure.

Speciﬁcally, we decompose

◊

(

) to be a composition of the encoder function denoted via

[1]

◊

[1]

(

) and the decoder function denoted via

[2]

◊

[2]

(

),where

◊

[1]

are the parameters of the

encoder and ◊

[2]

are the parameters of the decoder.

That is,

ˆx = f

◊

(x)=

[2]

◊

[2]

¶ f

[1]

◊

[1]

(x)=f

[2]

◊

[2]

[1]

◊

[1]

(x)

where the notation using the square bracketed superscripts for the functions and parameters

is in agreement with the notation we use for layers of deep neural networks in Chapter 5

and onwards.

In general, one may construct all kinds of structures for the encoder, decoder, and their

parameters. In our case, we consider a spe ciﬁc single layer neural network structure. We

deﬁne,

[1]

◊

[1]

(u)=S

[1]

+ W

[1]

u) for u œ R

(Encoder)

[2]

◊

[2]

(u)=S

[2]

+ W

[2]

u) for u œ R

(Decoder),

(3.37)

where the notation is somewhat similar to

(3.26)

. Speciﬁcally for

[¸]

are vectors,

[¸]

are matrices, and

[¸]

(

) are vector activation functions with

[1]

æ R

and

[2]

æ R

. Before describing the exact details of these functions, let us focus on the

autoencoder parameters.

The encoder parameters

◊

[1]

are composed of the bias

[1]

œ R

and weight matrix

[1]

m◊p

, and the decoder parameters

◊

[2]

are composed of the bias

[2]

œ R

and weight matrix

[2]

œ R

p◊m

. Thus the complete list of parameters for the autoencoder is,

◊ =(b

[1]

[2]

). (3.38)

Returning to the vector activation functions

[1]

(

) and

[2]

(

), we construct these based on

scalar activation functions

‡

[¸]

R æ R

for

2 such as the sigmoid function

(3.4)

,the

identity function

‡

(

, or one of many other variants (se e also Section 5.3). Speciﬁcally,

we set

[¸]

(

) to be the element wise application of

‡

[¸]

(

) on each of the coordinates of

An alternative way to denote these parameters would have been using

„

(in place of

◊

[1]

)forencoder

and

◊

(in place of

◊

[2]

)forthedecoder.ThisisthenotationusedinvariationalautoencodersinChapter8.

102

3.5 Shallow Autoencoders

Namely,

[¸]

(z)=

‡

[¸]

)

‡

[¸]

)

. (3.39)

The choice of the type of scalar activation function may depend on the domain of the input

since the output of the model aims to reconstruct the input. For example, use of the sigmoid

function for

‡

[2]

(

) restricts the output to be in the range [0

1] and this will clearly be

unsuitable in cases where the input is not limited to this range.

With this notation in place, it may also be useful to see the individual representation of the

bottleneck units ˜x

,...,˜x

, and the outputs ˆx

,...,ˆx

. Speciﬁcally, w ith a

=˜x

˜x

= ‡

[1]

k=1

[1]

i,k

, for i =1, . . . , m,

ˆx

= ‡

[2]

k=1

[2]

j,k

. for j =1,...,p.

This set of equations is the ﬁrst non-shallow (s ingle hidden layer) neural network which we

see in the book. Note also that we often use the notation

˜x

as it is an ‘activation’ of a

unit or neuron within the enco de r.

Also, it may be useful to see the loss function representation as,

C(◊ ; D)=

i=1

Îx

(i)

≠ S

[2]

[1]

(i)

+ b

[1]

)+b

[2]

)

¸ ˚˙ ˝

◊

(i)

)

. (3.40)

With this loss function, for given data

, the learned autoencoder parameters

◊

are given

by a solution to the optimization problem min

◊

C(◊ ; D).

PCA is an Autoencoder

It turns out that autoencoders generalize principal component analysis (PCA), introduced in

Section 2.6. We overview the details by seeing that PCA is essentially a shallow autoencoder

with identity activation functions

‡

[¸]

(

for

2, als o known as a linear autoencoder.

Speciﬁcally, we now summarize how PCA yields one possible solution to the learning

optimization problem for linear autoencoders. Note that some of the mathematical details

below may be skipped on a ﬁrst reading without loss of continuity. The key outcome of this

subsection is that PCA and linear auto encoders are essentially the same.

Consider the loss

(3.40)

with identity activation functions. In this case the vector activation

functions S

[¸]

(·) of (5.3) are each vector identity functions and (3.40) reduces to,

C(◊; D)=

i=1

Îx

(i)

≠ W

[2]

[1]

(i)

≠ W

[2]

[1]

≠ b

[2]

. (3.41)

103

3 Simple Neural Networks

Now it can be shown

that by considering the de-meaned data, we can formulate the

objective without the bias vectors

[1]

and

[2]

(3.41)

and focus on optimizing the loss

function,

C(W

[1]

[2]

; D

de-meaned

i=1

Îx

(i)

≠ W

[2]

[1]

(i)

. (3.42)

Here when the data is denoted

de-meaned

, we reuse the notation of the data samples

(i)

taking them now as de-meaned feature vectors (see also

(2.42)

in Chapter 2). It can be

shown that optimization of this new loss function

(3.42)

is equivalent to optimization of

(3.41)

. Speciﬁcally, if minimizing

(3.42)

and obtaining minimizers

[1]

and

[2]

,then

[1]

can be set to any value and

[2]

has a speciﬁc expression. With these, together with

[1]

and

[2]

minimize (3.41).

Now it is possible to go one step further in reducing the parameter space. In fact, it can be

shown that for the optimum of

(3.42)

we have

[1]

[2]†

. This is the pseudoinverse as

introduced in Section 2.3. Speciﬁcally when

[2]

is full column rank the pseudoinverse can

be represented as,

[2]†

[2]€

[2]

)

≠1

[2]€

and thus the matrix in (3.42)is,

[2]

[1]

[2]

[2]†

[2]

(

[2]€

[2]

)

≠1

[2]€

This is the

p ◊p

projection matrix which projects vectors of length

onto the

dimensional

column space of

[2]

. Now using the QR decomposition

this projection matrix may be

represented as

€

where the

p ◊m

matrix

has orthonormal columns. This means that

an equivalent optimization problem to the problem of minimizing (3.42)is

min

V œR

p◊m

€

V =I

i=1

Îx

(i)

≠ VV

€

(i)

, where x

(i)

œ D

de-meaned

, (3.43)

with

denoting the

m◊m

identity matrix. This constrained optimization problem limits the

search space to the space of

p ◊m

matrices that have orthonormal columns. Any minimizer

of (3.43) can then be used as a minimizer of the losses (3.41) or (3.42) by setting,

[1]

= V

€

, and

[2]

= V

Now let us see the connection to PCA. Recall

(2.51)

representing the encoded lower dimen-

sional PCA data as

where the

n ◊p

matrix

is a (de-meaned) data matrix as in

Section 2.6,the

n ◊m

matrix

is the reduced data matrix, and the matrix

has columns

that are singular vectors from the SVD decomposition of

. Hence the projection of each

data point x

(i)

into this lower dimensional space is given by

˜x

(i)

= V

€

(i)

(3.44)

with

the

p ◊m

matrix of columns

,...,v

that are an orthonormal basis of a reduced

subspace of

. Further, with PCA, the matrix

can also be used to reconstruct the data

This is shown by considering the derivative with respect to

[2]

as a ﬁrst step and then reorganizing the

objective

(3.41)

with the expression of

[2]

that sets the objective to 0.Seealsothenotesandreferencesat

the end of the chapter

See the notes and references of Chapter 1 for suggested linea r algebra background reading.

104

3.5 Shallow Autoencoders

as a decoder. Speciﬁcally, the decoded data points in the original space can be obtained via

ˆx

(i)

= V ˜x

(i)

. (3.45)

Now piecing together the encoder in

(3.44)

and the dec oder in

(3.45)

, the reconstruction

error is,

i=1

Îx

(i)

≠ ˆx

(i)

i=1

Îx

(i)

≠ VV

€

(i)

. (3.46)

We see that

(3.46)

agrees with the objective in

(3.43)

and further the matrix

of PCA agrees

with the constraint

€

. Now from the Eckart-Young-Mirsky theorem appearing in

(2.53)

of Section 2.6 as well as the relationship between SVD and PCA captured in

(2.51)

have that

of PCA is indeed one of the optimal solutions

(3.43)

. Hence in summary,

PCA is a (shallow) linear autoencoder.

Original Data

PCA Encoding

Non-linear

Encoding

PCA Decoding

Non-linear

Decoding

Projection on a

hyperplane

Projection on a

manifold

Figure 3.13:

Encoding and decoding of synthetic

=2data with a linear autoenco der (PCA) in

red and a non-linear shallow autoencoder (blue). The reconstruction of PCA falls on a hyperplane

while the non-linear autoencoder projects onto a manifold that is not an hyperplane.

In practice, one would not use gradient based optimization to learn the parameters of linear

autoencoders, but rather employ algorithms from numerical linear algebra. Further, the

speciﬁc basis vectors obtained via PCA are insightfull since they order vectors according to

their variance contributions. In contrast, optimizing

(3.43)

without considering PCA, yields

arbitrary

(with orthonormal columns). Nevertheless, as we see now, our positioning of

PCA as a special case of the (non-linear) autoencoder is insightful.

There is an inﬁnite number of solutions.

105

3 Simple Neural Networks

Autoencoders as a Form of Non-linear PCA

As we discussed above, encoding and decoding with PCA projects the

dimensional feature

vector

onto an

dimensional subspace. When

=2and

=1this can be viewed as a

projection of points from the plane onto a line, when

=3and

=2this is a projection of

points from three dimensional space onto som e plane, and similarly in more realistic higher

dimensions of

, we project onto a linear subspace of dimension

. As such, the bottleneck

of the linear encoder (PCA) encodes the location of the points on this projected space .

Linear subspace projection is sometimes a suﬃcient data reduction technique and at other

times is not. In such cases there are other multiple forms of non-linear PCA where points

are projected onto manifolds that are generally curved. Since autoencoders generalize PCA,

they present us with one such rich class of non-linear PCA models.

As an illustration consider Figure 3.13 where we consider synthetic data with

=2which we

wish to encode with

=1. This means that the bottleneck layer, or the code, represents a

value on the real line for each data point. If we use PCA (red) then this encoding translates

to a location on a linear subspace of

. However, if we modify the identity activation

functions in the autoencoder to be non-linear (blue), then the projection is on a manifold

which is generally curved. In this example the non-linear activation functions are taken as

tanh functions; see also Section 5.3.

As a further example, consider using an autoencoder on MNIST where

= 28

◊

28 = 784

and we use

=2. We encode this via PCA, a shallow non-linear autoencoder, and a deep

autoencoder that has hidden layers. In Figure 3.14 we present scatter plots of the codes for

various types of autoencoders for both the training and test sets. That is, the autoencoders

are trained on the training set and the codes presented are both for the training set, and for

the test set data. While the training and testing does not directly involve the lab e ls (the

digits

–

), in our visualization we color the code points based on the labels. This allows us

to see how diﬀerent labels are generally encoded onto diﬀerent regions of the code space.

Recall also Figure 2.14 (b) which is of a similar nature.

Keeping in mind that one application of such data reduction is to help separate the data, it

is evident that as model complexity increases (moving right along the displays of the ﬁgure),

somewhat better separation occurs in the data. In particular, compare (d) based on the

test set using PCA, and (f) based on the test set using the deep autoencoder. Refer also

to Figure 3.12 which illustrates the reconstruction eﬀect for various types of autoencoders

(here

= 30). In terms of reconstruction, it is also evident in this case that more complex

models exhibit better reconstruction ability.

Applications and Architectures

We have already seen the archetypical autoencoder application, namely data reduction.

Yet there are many more applications and asso c iated architectures of autoencoders. We

now discuss some of these. We ﬁrst consider various ways in which data reduction can be

employed to help with additional machine learning activities. We then discuss de-noising

which is a diﬀerent application to data reduction. We close the chapter with ways in which

interpolation in the latent space of the bottleneck are useful. The discussion here is just a

brief summary and more information is suggested in the notes and references at the end of

the chapter.

106

3.5 Shallow Autoencoders

−6

−3

−5 0

PC1

PC2

digit

(a)

−0.5

0.0

0.5

1.0

−0.5 0.0 0.5 1.0

RS1

RS2

digit

(b)

−10 0 10

RS1

RS2

digit

(c)

−3

−10 −5 0

PC1

PC2

digit

(d)

−0.5

0.0

0.5

1.0

−0.5 0.0 0.5 1.0

RS1

RS2

digit

(e)

−10 0 10

RS1

RS2

digit

(f)

Figure 3.14:

Autoencoders applied to MNIST with each of the 10 digits color coded. (a)–(c) are

for the training set and (d)–(f) are for the test set. In the ﬁrst column w ith (a) and (d) we use PCA.

In the middle column with (b) and (e) we use a shallow non-linear autoencoder. In the last column

with (c) and (f) we use a deep autoencoder.

In terms of uses of data reduction, let us consider both supervised and unsupervised learning.

In the supervised setting, whenever a dataset

{

(

(1)

)

,...,

(

(n)

)

}

is us e d

where the dimension of

(i)

, one may attempt to reduce the dimension of the input

features to

m<p

and obtain a dataset

{

(

˜x

(1)

)

,...,

(

˜x

(n)

)

}

.Withsucha

dataset, training a supervised learning model to predict

in terms of

˜x œ R

may sometimes

yield much better performance than using the original

x œ R

. This proc es s requires training

an autoencoder as well as training of the model. Then in production with unseen data, for

any

we ﬁrst use the trained encoder to obtain to obtain

˜x

. We then use the trained

model on

˜x

to obtain

ˆy

. Note that with such an activity we do not use the decoder per-se.

To take this application of data reduction even further, consider a situation with

(i)

œ R

where

is not small (high dimensional labels such as segmentation maps on images for

example). In this case one may encode both the feature vector and the label, each with their

own autoencoder, to obtain encoded data of the form

{

(

˜x

(1)

, ˜y

(1)

)

,...,

(

˜x

(n)

, ˜y

(n)

)

}

,say

with

˜x

(i)

œ R

˜y

(i)

œ R

, and

. One may then train a supervised model

using this

.Suchamodelpredicts

˜y

œ R

for each encoded feature vector

˜x

œ R

With this setup, when the model is used in production, one ﬁrst observes a feature vector

, then uses the feature vector encoder to create

˜x

,thenusesthemodeltopredict

˜y

, and

ﬁnally uses the label decoder to obtain

ˆy

. Hence with such an activity, in production the

encoder of the feature vector and the decoder of the label are used.

107

3 Simple Neural Networks

In terms of unsupervised learning, there are many secondary ways to use applications of

autoencoder based data reduction as well. As one example, assume that we wish to cluster

images. A naive approach may be to treat each image as a vector and then use an algorithm

such as K-means (see Section 2.6) to cluster the vectors. This approach has many drawbacks

since the distance between images is based on the exact locations (indices) of pixels within

an image. For example two images of the same object with slightly diﬀerent camera locations

would be generally “far” when comparing the Euclidean distance between the associated

vectors. A more suitable approach is to use an auto e ncoder for data reduction of the image

to a low dimension and then to perform clustering on the reduced dimensional codes. See

for example Figure 3.14 (f). Here

=2and with such encoding, clustering the vectors on

the two dimensional code space will generally work well.

BottleneckEncoder Decoder

Production System

Input

Partially

Destroyed

Training

Data

Original

Training

Data

Reconstructed

Input

Figure 3.15:

A denoising autoencoder where during training partially destroyed (noised) data is

fed into the input. The production system is the trained auto encoder.

We now move away from data reduction and consider an architecture called the denoising

autoencoder which is illustrated in Figure 3.15. This model le arns to remove noise during the

reconstruction step for noisy input data. It takes in partially corrupted input and learns to

recover the original denoised input. It relies on the hypothesis that high-level representations

are relatively stable and robust to entry corruption and that the model is able to extract

characteristics that are useful for the representation of the input distribution.

In terms of architectures, denoising autoencoders exhibits a similar architecture to the usual

autoencoder as they involve an encoder

[1]

◊

[1]

(

) and decoder

[2]

◊

[2]

(

). However a key diﬀerence

is that during the learning pro c es s it is trained on noisy samples and the loss function is

modiﬁed. First the initial input

is corrupted into

˘x

via some predeﬁned stochastic mapping

performing the partial destruction. In practice the main corruption mappings include adding

Gaussian noise, masking noise (a fraction of the input chosen at random for each example is

forced to 0), or salt-and-pepper noise (a fraction of the input chosen at random for each

example is set to its minimum or maximum value with uniform probability).

108

3.5 Shallow Autoencoders

(i)

naive



(j)

encoder



Figure 3.16:

Interpolation of images with

⁄

2. The left and right images of

(i)

and

(j)

are raw images. The top image is the naive interpolation and the bottom image is obtained via

interpolation using an autoencoder.

Now with the corrupted input

˘x

, the auto e ncoder encodes and decodes

˘x

in the usual way

yielding a reconstructed

˘x via,

◊

(˘x)=f

[2]

◊

[2]

¶ f

[1]

◊

[1]

(˘x) ≠æ

˘x.

However, instead of using a loss like

(3.36)

which would compare

˘x

and

˘x

we modify the loss

function for each observation

to be

(

◊

Îx

(i)

≠

˘x

(i)

. Hence during training, we seek

parameters

◊

that try to remove the noise as much as possible. When used in production

on an unseen data sample

, the data corruption step is clearly not employed. Instead, at

this point the autoencoder output

◊

(

) is conditioned to generate outputs

ˆx

that are

denoised.

As a third general application of autoencoders, let us consider interpolation on the latent

space. Take x

(i)

and x

(j)

from D = {x

(1)

,...,x

(n)

} and consider the convex combination

naive

⁄

= ⁄x

(i)

+(1≠ ⁄)x

(j)

for some

⁄ œ

1]. Ideally such an

naive

⁄

may serve as a weighted average between the two

observations where

⁄

clearly captures which of the observations has more weight. However

with most data, such arithmetic on the associated feature vectors is too naive and often

meaningless for similar reasons to those discussed above in the context of K-means clustering

of images. For example, in the top image of Figure 3.16 we see such interpolation with

⁄ =1/2.

When considering the latent space representation of the images it is often possible to create a

much more meaningful interpolation between the images. An example is in the bottom image

of Figure 3.16. To carry out this interpolation we train an autoencoder and then encode

(i)

and

(j)

to obtain

˜x

(i)

and

˜x

(j)

. We then interpolate on the codes, and ﬁnally decode

˜x

⁄

obtain an interpolated image. That is, using the notation of

(3.37)

and omitting parameter

subscripts we have,

encoder

⁄

= f

[2]

⁄f

[1]

(i)

)+(1≠ ⁄)f

[1]

(j)

)

109

3 Simple Neural Networks

This property of latent space representations and the ability to interpolate with these

representations, also plays a role in the context of generative models discussed in Chapter 8.

At this point let us mention that one potential application of such interpolation is for design

purposes, say in art or architecture, where one chooses two samples as a starting point and

then uses interpolation to see other samples lying “in between”.

110

3.5 Shallow Autoencoders

Notes and References

A comprehensive book on applied logistic and multinomial regression in the context of statistics

is [

187

] where one can ﬁnd out about conﬁdence intervals, hypothesis tests, and other aspects of

inference. The statistical origins of logistic regression are most probably due to Chester Ittner Bliss

whom developed the related probit regression model in the 1930’s; see [

]. Probit was initially used

for bioassay studies. See also [

] for an historical account where the development of the logistic

function (the s igmoid function) as a solution to logistic diﬀerential equations is presented. In the

context of machine learning, logistic regression may be viewed as one example of a probabilistic

generative model, see for example Section 4.2 of [39].

These days, in the context of deep learning, logistic regression is treated as the simplest general

non-linear (shallow) neural network. However, the original simplest (non-linear) neural network

is not actually logistic regression but rather the perceptron;see[

353

]. That model, created by

Rosenblatt in 1958, diﬀers from logistic regression in that the activation function is the ; see

(5.16)

Chapter 5. With this activation function, gradient based optimization is not possible. Yet Rosenblatt

introduced the which ﬁnds a classiﬁer in ﬁnite time when the data is linearly separable;seealso

[314].

The phrase “softmax” for

(3.25)

started to be used in the machine learning community at around

1990; see [

]. The presentation of both logistic regression and multinomial regression as speciﬁc cases

of generalized linear models is described in [

106

]. We also mention several variants of these models.

These include Dirichlet regression which is used to analyse continuous proportions and compositional

response data; see [

108

], [

149

], and [

178

]. Also ordin al regression is relevant for predicting an ordinal

resp ons e variable. Examples of ordinal models are the ordered logit and ordered probit models.

A comprehensive reference for analysing categorical variables is [

] and for ordinal variables see

[

]. See also chapters 10 and 12 of [

179

]. Related terms from the world of machine learning are

ranking learning, also known as learning to rank. This ﬁeld builds on ordinal regression; see for

example [

374

]. See [

] as a general reference for inference using maximum likelihood estimation

in the context of biostatistics as well as the classic theoretical statistics book [

]. Further, see

[

] for additional computational aspects including optimization of multinomial regression using

second-order methods.

Our focus on autoencoders in this chapter is mostly on the most popular architecture where the

hidden layer presents a lower number of units than the inputs. This speciﬁc architecture is called

undercomplete as opposed to the overcomplete case presenting more hidden units than the input.

We mention that autoencoders in general, and speciﬁcally overcomplete architectures, go hand in

hand with regularisation, achieving sparse representations of the code; see for example chapter 14

of [

142

]. A general overview of autoencoder applications and architectures is in [

]. Speciﬁcally,

diﬀerent variants of autoencoders have recently emerged including sparse autoencoders, contractive

autoencoders, robust autoencoders, and adversarial autoen coders. One particular class explored in

Chapter 8 is variational autoencoders. For relationships between PCA and autoencoders see [

332

]as

well as the more classic works [21] and [53].

111