Mathematical Engineering

of Deep Learning

Chapter 2

Benoit Liquet, Sarat Moka and Yoni Nazarathy

February 28, 2024

Contents

Preface 3

1 Introduction 1

1.1 The Age of Deep Learning ............................ 1

1.2 A Taste of Tasks and Architectures ....................... 7

1.3 Key Ingredients of Deep Learning ........................ 12

1.4 DATA, Data, data! ................................ 17

1.5 Deep Learning as a Mathematical Engineering Discipline ........... 20

1.6 Notation and Mathematical Background .................... 23

Notes and References .................................. 25

2 Principles of Machine Learning 27

2.1 Key Activities of Machine Learning ....................... 27

2.2 Supervised Learning ............................... 32

2.3 Linear Models at Our Core ........................... 39

2.4 Iterative Optimization Based Learning ..................... 48

2.5 Generalization, Regularization, and Validation ................ 52

2.6 A Taste of Unsupervised Learning ....................... 62

Notes and References .................................. 72

3 Simple Neural Networ ks 75

3.1 Logistic Regression in Statistics ......................... 75

3.2 Logistic Regression as a Shallow Neural Network ............... 82

3.3 Multi-class Problems with Softmax ....................... 86

3.4 Beyond Linear Decision Boundaries ....................... 95

3.5 Shallow Autoencoders .............................. 99

Notes and References .................................. 111

4 Optimization Algorithms 113

4.1 Formulation of Optimization .......................... 113

4.2 Optimization in the Context of Deep Learning ................ 120

4.3 Adaptive Optimization with ADAM ...................... 128

4.4 Automatic Diﬀerentiation ............................ 135

4.5 Additional Techniques for First-Order Methods ................ 143

4.6 Concepts of Second-Order Methods ....................... 152

Notes and References .................................. 164

5 Feedforward Deep Networks 167

5.1 The General Fully Connected Architecture ................... 167

5.2 The Expressive Power of Neural Networks ................... 173

5.3 Activation Function Alternatives ........................ 180

5.4 The Backpropagation Algorithm ........................ 184

5.5 Weight Initialization ............................... 192

Contents

5.6 Batch Normalization ............................... 194

5.7 Mitigating Overﬁtting with Dropout and Regularization ........... 197

Notes and References .................................. 203

6 Convolutional Neural Networks 205

6.1 Overview of Convolutional Neural Networks .................. 205

6.2 The Convolution Operation ........................... 209

6.3 Building a Convolutional Layer ......................... 216

6.4 Building a Convolutional Neural Network ................... 226

6.5 Inception, ResNets, and Other Landmark Architectures ........... 236

6.6 Beyond Classiﬁcation ............................... 240

Notes and References .................................. 247

7 Sequence Models 249

7.1 Overview of Models and Activities for Sequence Data ............. 249

7.2 Basic Recurrent Neural Networks ........................ 255

7.3 Generalizations and Modiﬁcations to RNNs .................. 265

7.4 Encoders Decoders and the Attention Mechanism ............... 271

7.5 Transformers ................................... 279

Notes and References .................................. 294

8 Specialized Architectures and Paradigms 297

8.1 Generative Modelling Principles ......................... 297

8.2 Diﬀusion Models ................................. 306

8.3 Generative Adversarial Networks ........................ 315

8.4 Reinforcement Learning ............................. 328

8.5 Graph Neural Networks ............................. 338

Notes and References .................................. 353

Epilogue 355

A Some Multivariable Calculus 357

A.1 Vectors and Functions in R

........................... 357

A.2 Derivatives .................................... 359

A.3 The Multivariable Chain Rule .......................... 362

A.4 Taylor’s Theorem ................................. 364

B Cross Entropy and Other Expectations with Logarithms 367

B.1 Divergences and Entropies ............................ 367

B.2 Computations for Multivariate Normal Distributions ............. 369

Bibliography 399

Index 401

2 Principles of Machine Learning -

At its core, dee p learning is a class of machine learning models and methods. Hence, to

understand deep learning, one must have at least a basic understanding of machine learning

principles. There are dozens of general machine learning methods and models that one can

cover and our purpose here is certainly not to present a detailed account of all of these

methods. Instead, we take a path that presents a general overview of machine learning, and

then focuses mostly on linear models which are the most elementary neural networks out

there. In the process we explore gradient based learning for the ﬁrst time, a topic that plays

a key role in the chapters that follow.

In Section 2.1 we present an overview of the key activities of machine learning including

supervised learning, unsupervised learning, and variants of these. In Section 2.2 we explore

key elements of supervised learning. In Section 2.3 we introduce linear models. These form

the basis for many other models as well as for the deep learning networks of this book.

We then explore the basic gradient descent algorithm in Section 2.4. Linear models can

often be trained without gradient descent, yet exploring gradient descent in the context

of linear models is a useful warmup for the chapters that follow. In Sec tion 2.5 we discuss

generalization ability, overﬁtting, and introduce techniques of regularization. We further

discuss the training process including splitting of the data and cross validation methods.

Most of this book deals with supervised learning metho ds, however understanding basic

techniques from unsupe rvised learning is also important. Hence, in Section 2.6, we take

a brief look at unsupervised methods including K-means clustering, principal component

analysis (PCA), and also touch the singular value decomposition (SVD).

2.1 Key Activities of Machine Learning

The world of machine learning intersects heavily with both the worlds of statistics and

computer science. In statistics data and randomness are key. In contrast, in computer science,

algorithms and computation are the focus. Machine learning borrows from both worlds

and is about the combination of data and algorithms. It is all about training mathematical

models on a computer in order to classify data, predict outcomes, estimate relationships,

summarize data, control complex processes, and more.

We now present and characterize a few key activities of machine learning. These activities

are carried out when training, calibrating, adjusting, or designing machine learning models

and algorithms. Many of these activities are loosely called learning while other activities

involve prediction or decision making.

Any activity of machine learning can be described as an interaction between the following

entities: data, models, algorithms, and the real world.Bydata we mean both collected data

2 Principles of Machine Learning

Training Data

Learning

Algorithm

Model

(Prediction

Algorithm)

Real

World

ˆy

[ x

]

Figure 2.1:

Supervised learning. Activities include training a model and prediction in production.

such as the (

x, y

), features-label, pairs described in the previous chapter, or data that is

generated as output of models or algorithms. By models we mean mathematical objects

stored and implemented on a computer, together with the parameters that specify these

objects. By algorithms we mean the procedures for creating models, procedures for creating

output datasets, as well as procedures for using the models themselves for prediction or

related tasks. Finally, by the real world we refer to scenarios that support generation of data,

annotation of data, as well as usage of output data for decision making, and control.

Machine learning activities are often dichotomized into two broad categories, supervised

learning, and unsupervised learning. With supervised learning, data is ass umed to be available

as (

x, y

) pairs where each feature vector

is labeled via

. In the case of unsupervised

learning, one only observes data points

and tries to ﬁnd relationships between the various

elements, variables, or coordinates of

. To understand this terminology consider the learning

of babies or toddlers, which only involves the exploration of input sensory data without

any indication of what is what. This is unsupervised learning since toddlers are typically

not told explicitly “this does this” and “that does that”. Then later on, for example during

school, they engage in supervised learning since language and text are used to present the

learners with examples x and their outcomes y.

A key activity in supervised learning is the usage of data to learn/train models for prediction.

See Figure 2.1. This prediction is called classiﬁcation in case the labels

are from a ﬁnite

discrete set and it is called regression in case the labels

are continuous variables. There are

also other cases of prediction where the labels

are vectors, images, or sim ilar. A related

activity is obviously to use the trained models for prediction when presented with unlabelled

data from the real world; this is illustrated on the right of Figure 2.1. Both the training of

models and usage of models for prediction involves the execution of algorithms. Sometimes

the trained model is called an “algorithm” as well since it may be integrated in part of

bigger systems that use it. Most of this book focuse s on supervised learning and we begin in

the next section, Section 2.2, by overviewing key concepts of supervised learning.

2.1 Key Activities of Machine Learning

With unsup e rvised learning there are other activities be yond prediction, regression, and

classiﬁcation. See Figure 2.2. One important activity is clustering, which focuse s on ﬁnding

groups of similar data samples. The output of algorithms that perform clustering are typically

not considered as models but are rather modiﬁed datasets that incorporate the clustering

information. Another key unsupervised learning activity is to carry out data reduction where

high dimensional vectors are transformed into lower dimensional vectors that still encode

some of the key relationships between variables. While unsupervised learning is not the focus

of this book, several unsupervised learning algorithms are overviewed in Section 2.6.

Input Data

Clustering

Algorithm

Model

(Clustering

Rule)

Cluster 1

Cluster 2

Cluster k

Clustered Data

Real

World

Input Data

Data Reduction

Algorithm

Model

(Data Reduction

Rule)

˜x

Low Dimensional

Data

Figure 2.2:

Some activities of unsupervised learning: Clustering (top) partitions the data. Dimension

reduction (bottom) reduces the size of the features.

Beyond the dichotomy of machine learning into supervised and unsupervised, there are also

additional popular activities that are not directly categorized as such. One popular class of

activities is reinforcement learning introduced in Chapter 8. Here a temporal component

is key and an agent is trained to carry out tasks in a dynamic environment. An additional

class of activities is generative modelling also introduced in Chapter 8 in the context of

variational autoencoders, diﬀusion models, and generative adversarial networks. Here models

are trained to create artiﬁcial datasets with characteristics (or a distribution) similar to

the input dataset. An additional suite of activities is transfer learning. Transfer learning

is all about taking models that have been trained for one domain and adapting them to

2 Principles of Machine Learning

other domains with new data. Related is active learning where the learning process is not

static but is rather informed by the performance of the mode l on unse en data. T his is very

closely related to semi-supervised learning where like supervised learning, there are both

feature vectors

and labels

, yet only a subset of the feature vectors have accompanying

labels. The learning process tries to use all of the available data. Finally, self-supervised

learning, brieﬂy discussed in the context of deep learning natural language processing in

Chapter 7, creates models where sequences of data are used to self-predict the future or

missing elements of the sequences. This is useful for language models and related tasks.

Data: Seen, Unseen, Training, and Test

Data is a central part of machine learning. In considering data it is important to distinguish

between seen data and unseen data. Seen data is the data available for learning, namely for

training of models, model selection, parameter tuning, and testing of models. Unseen data

is essentially unlimited since it is all data from the real world that is not available while

learning takes place but is later available when the model is used. This can be data from the

future, or data that was not collected or labelled with the seen data.

Needless to say, for machine learning to work well, the nature of the seen data should be

similar to that of the unseen data. The underlying assumption of machine learning is that

the seen data used to create models is generated by underlying processes of the real world

that are similar to the processes generating unseen data. Practically, one needs to carry out

data collection and lab elling so that this resemblance between the seen and unseen data is

maintained.

A common practice in the world of machine learning is to split the seen data into training

data and testing data. These are sometimes called the training set and test set; an additional

name for the test set is the hold out set. The key idea with such a split is to use the training

data for learning and to use the testing data for mimicking a scenario of unseen data. As

described in Section 1.4 s ome popular example datasets come with such a predeﬁned split.

In other cases, it is up to the machine learning engineer to split the data randomly according

to some predetermined proportions. Examples follow in this chapter.

Since the purpose of the train-test data split is to mimic the unseen data with the test set, one

should not recalibrate, adjust, or tune models on the training set while testing repeatedly on

the test set. Carrying out such a repetitive use of the test set would invalidate its re se mblance

of unseen data. For this reason one sometimes performs an additional split of the training

data by removing a chunk out of the training data and calling it the validation set. More on

this practice and other alternatives such as K-fold cross validation is in Section 2.5.

The practice of splitting data into the train set and test set is very popular in machine

learning so long as a large number of samples is available. However, in some cases there is

not enough data to be able to separate a test set and thus one wishes to use all available

data for model ﬁtting. This is sometimes the case when working with experimental data and

biomedical data. In such cases, statistical inference approaches for evaluating the quality of

the model ﬁt make heavier use of the model at hand. Some approaches for comparing models

are likelihood based and include performance measures such as the Akaike information

criterion (AIC) or Bayesian information criterion (BIC). The world of model ﬁtting in a

statistical content is vast and we do not focus on such methods further in this book.

2.1 Key Activities of Machine Learning

Data Preprocessing

Raw data often requires preprocessing before it can be used for training, prediction, or as

input to other machine learning models. Although a full description of data processing steps

and practical as pects of data processing is beyond our scope, one important activity that we

cover is standardization of the data, also sometimes called normalization of the data.This

involves subtraction of the mean of each feature and division by the standard deviation of

the feature.

Assume the values for some feature

are

(1)

,...,x

(n)

where

is the number of data samples.

The sample mean and sample variance

of the feature are respectively computed as,

j=1

(j)

j=1

(j)

≠ x

)

. (2.1)

Further, the sample standard deviation is the square root of the sample variance and is

denoted via

. With these basic descriptive statistics of the feature available we may

standardize the data samples of each feature i =1,...,p to obtain standardized samples,

(j)

≠ x

for j =1,...,n. (2.2)

Now the standardized data for feature

(1)

,...,z

(n)

, has a sample mean of exactly 0 and

a sample standard deviation of exactly 1. In the case the data samples of the feature are

distributed according to a normal distribution,

then most standardized samples would lie

in the range [

≠

3]. Even if the data is not normally distributed, the standardized samples

will still lie in the vicinity of this range and are centered ab out 0.

Such standardization is useful as it places the dynamic range of the model inputs on a

uniform scale and thus improves the numerical stability of algorithms. It also allows us to

use similar models for diﬀerent datasets that may, without standardization, have completely

diﬀerent dynamic ranges. In Section 2.4 we discuss how such standardization can also help

optimization performance.

Learn ing ¥ Optimization

Almost any form of a learning or model training activity involves optimization either explicitly

or implicitly. This is because learning is the process of seeking model parameters that are

“best” for some given task. In fact, all of Chapter 4 is devoted to optimization techniques in

the context of machine learning and deep learning, and a few of the sections of this current

chapter contain aspects of optimization as well.

In some cases optimization is carried out directly on some performance measure that

quantiﬁes how good the model at hand performs. This is, for example, the case when one

considers the mean square error criterion for regression problems, a concept which we study

in detail in this chapter. However in other cases, a loss function is engineered for the problem

at hand in a way that minimization of the loss function is a proxy for minimization of

In a statistical context one often uses

n ≠

1 in the denominator of the sample variance instead of

.For

non-small n this distinction is insigniﬁcant.

Afewattributesofthenormaldistribution,alsoknownastheGaussiandistribution,areinAppendixB.

2 Principles of Machine Learning

the actual performance measures that are of interest. This is, for example, the case when

considering classiﬁcation problems and aiming to get the most accurate classiﬁer. In such a

case, optimization is typically not carried out directly on the accuracy measure but rather

on a loss function such as the mean square error, or cross entropy deﬁned in Chapter 3.In

any case, the design of loss functions as part of the learning procedure is central to machine

learning and deep learning and appears throughout this book.

Note that in some cases, machine learning algorithms such as decision trees do not directly

specify an optimization procedure but rather execute a predeﬁned algorithm for ﬁtting

a model. However, even in such cases, there is typically an inherent hidden optimization

problem associated with the procedure. Hence in general we can think of “learning” as the

process of carrying out some sort of “optimization”.

2.2 Supervised Learning

We now focus on supervised learning and outline key concepts, practices, and terminology.

Sup e rvised learning is about predicting an outcome

ˆy

for

, where the prediction is based

on a ve ctor of input features

.When

is from a ﬁnite discrete set then the task is called

classiﬁcation and when y is a continuous variable then the task is called regression.

We begin with overviewing basic regression in the context of linear models and feature

engineering. We then discuss aspects of binary classiﬁcation which is the case when

only

attains one of two poss ibilities. The more general multi-class classiﬁcation case in which

takes on multiple possibilities is presented as part of speciﬁc examples in Section 2.3.We

close this section with a high level overview of several methods and general approaches to

supervised learning.

Regression and Feat ure Engineering

We begin by considering a very simple univariate example where the scalar (

=1) feature

is the average number of rooms per dwe lling and

is the median value of owner-occupied

homes in thousands of dollars. These variables, respectively denoted

and

medv

,represent

data from the well-known Boston housing dataset.

A regression model

◊

(

) attempts

to predict the median house price as a function of the average number of rooms.

To illustrate this concept, consider the well-known simple linear regression model where

◊

(

—

and the parameter vector is

◊

—

, —

). Notice that in this case the

dimension of the parameter vector is

=2. This model can also be described statistically

via,

y = —

+ —

x + ‘, (2.3)

where

‘

represents the error or noise term as it models the gap between

and

◊

(

).In

statistical theory and practice, assumptions about the probability distribution of

‘

go a

long way as they support inference outputs such as conﬁdence bands, hypothesis tests, and

more. However in practical machine learning culture, one often ignores

‘

and such statistical

assumptions.

This dataset originally published in [

161

], has 506 observation where each observation is associated

with a suburb or town in the Boston Massachusetts area. Of these observations 16 are capped at

= 50

and we remove these to stay with n =490observation.

2.2 Supervised Learning

Provided feature data

(1)

(2)

,...,x

(n)

(

– average number of rooms per house in a

geographical area), and corresponding label data

(1)

(2)

,...,y

(n)

(

medv

– median house

prices in a geographical area), the training pro c ess involves ﬁnding a suitable or best

◊

—

). In Section 2.3 we study the process for ﬁnding

◊

via minimization of a loss

function, yet at this point let us just c onsider the model parameters, also known as parameter

estimates

◊, as an outcome of training.

4 5 6 7 8 9

Average number of rooms per dwelling (rm)

House prices in $ 1000 (medv)

(a)

10 20 30

Lower status of the population in % (lstat)

House prices in $ 1000 (medv)

(b)

Figure 2.3:

Examples of elementary linear models. (a) Median house prices per locality (

medv

)as

a function of average number of rooms per dwelling (

) is described via a simple linear (aﬃne)

relationship. (b) House prices as a function of lower status of the population in % (

lstat

) is not

described well with a linear relationship (red), but by introducing an additional quadratic engineered

feature it is described well via a three parameter linear model resulting in a quadratic ﬁt (blue).

Figure 2.3 (a) presents a scatter of the (

(i)

) pairs. The parameters estimated for this

model are

—

≠

01 and

—

27. The ﬁgure also includes a plot of the ﬁt or estimated

model

◊

(

) as a red line. Clearly, with such a model, any unseen (new) obse rvation

can be

used to make a prediction

ˆy

◊

(

). If one is willing to make statistical assumptions about

the error

‘

and probabilities of error then an extra beneﬁt of the model is the conﬁdence

bands, presented as a the gray shaded area around the red line. Most of the modelling

described in this book uses very complex models that do not take such a statistical approach

and hence built-in inference outputs such as these conﬁdence bands are often not available.

We also mention that the estimated parameters in such a model have an interpretation.

For example

—

27 indicates that increasing the average number of rooms by one room

implies an average rise in median price of $8

. Many types of statistical models have the

beneﬁt of interpretable parameters, yet in the world of machine learning where models are

often very complex, parameter interpretation is an exception rather than the rule.

As a follow up example arising from the same dataset, consider the relationship between

the variable

taken as the percentage of the population that is of a low social economic

status (

lstat

) and the variable

taken again as

medv

. A simple linear model ﬁt to this will

yield parameter estimates

◊

since in almost any case one can ﬁt any model to data. However

the model may not always be suitable for the data or process at hand. For example, in

2 Principles of Machine Learning

Figure 2.3 (b), a scatter plot of the data is presented and it is apparent that the downward

sloping linear model ﬁt in red does not do a good job in describing the relationship between

and

. Observe also that conﬁdence bands for this simple linear model may look deceptively

appealing (tight gray bands around the red line) esp ec ially if one was only to look at these

and not the actual scatter plot of the data. The pitfall is that these conﬁdence bands are

computed under the assumption that the model ﬁts the underlying process and data well, a

case that does not hold here. Such a phenomena is often loosely called model misspeciﬁcation

and is one of the risks that one undertakes (and needs to mitigate) when using statistical

inference techniques.

An alternative to the simple linear model is to seek a richer relationship such as,

y = —

+ —

x + —

+ ‘. (2.4)

One way of describing this relationship is via the function

◊

(

) deﬁned for the

dimensional

◊

via,

◊

(

—

where

is still a scalar (

=1). An alternative

description of the same model is to consider the squaring of the

lstat

variable as a new

engineered feature and thus now consider

as a

=2dimensional vector with

being the

original feature and

the new engineered feature. In this case the model function

is linear (aﬃne)

◊

(

—

, and the fact that

is the square of

considered a feature engineering aspect and not a model function aspect. In practice, in this

case, both approaches are identical and yield the same

◊

. Figure 2.3 (b) presents the ﬁt and

corresponding error bands of this quadratic model in blue.

We mention that linear models for regression, which are the workhorse of classical statistics,

can be extended in many ways. The notes at the end of this chapter point at some e xtensions

such as Generalized Linear Models (GLM), mixed models, and more. Note also that non-

linear relationships in a regression context could be explored using smoothing techniques.

Popular techniques in this framework are the Generalized Additive Model (GAM), the Local ly

Estimated Scatterplot Smoothing (LOESS) method, as well as Nadaraya-Watson kernel

regression. We also mention that generally when one considers feature engineering, one very

important aspect is dealing with interaction terms. This means creating new engineered

features that are based on the products of other features.

The world of machine learning has adopted these models often removing statistical assump-

tions (e.g. about the noise

‘

) while introducing additional non-linearities and mechanisms

that yield very expressive models. The deep neural networks that we cover in this book

include one such rich class of examples. We also mention that while the numerical house

price context that we presented here appears to be simple and low dimensional, regression

problems can often involve extremely high dimensional input feature vectors. For example,

any regress ion problem where the input data is an image is of this nature. A concrete

example of such a case is using images of a human face to predict the age of the person.

Binary Classiﬁcation

Moving on from regression problems where

is continuous, we now consider binary classiﬁ-

cation where

attains one of two values, which are sometimes referred to as

positive

negative

. There are dozens of machine learning methods for binary classiﬁcation and our

purp ose here is not to explore how these methods work. Instead we wish to illustrate how

their performance is quantiﬁed.

2.2 Supervised Learning

Our exposition relies on a logistic regression based classiﬁer where

positive

samples are

encoded via

=1and

negative

samples are encoded via

=0. Note that, in other scenarios

positive and negative samples are sometimes encoded via

=+1and

≠

1,respectively.

Logistic regression is explored in depth in sections 3.1 and 3.2 of Chapter 3. For now, we

can treat such a classiﬁer as being based on a function

◊

(

) where

is the feature vector.

The output of

◊

(

) is a number in the continuous range [0

1] indicating the probability

that

matches a

positive

label. Hence, the higher the value of

◊

(

), the more likely it is

that the label associated with x is y =1.

With the model

◊

(

) at hand, a classiﬁer can be constructed via a decision rule based on a

threshold ·, with the predicted output being,

‚

Y =

0(negative), if ˆy Æ ·,

1(positive), if ˆy>·,

where ˆy = f

◊

(x). (2.5)

In many cases one selects the threshold at

5. However as we see below,

can often be

adjusted. Also note the notational diﬀerence between

‚

and

ˆy

. Throughout the book, we

use the former to signify the actual predicted label for classiﬁcation problems whereas we

use the latter to denote the output of the model which is usually (continuous) numerical in

nature.

As an example we consider breast cance r prediction where the label, or outcome variable

has

=1in case of malignant lumps and

=0in case of benign lumps. A popular datase t

in this context is called the Wisconsin Breast Cancer Dataset and is based on clinical data

released in the early 1990s.

We make further use of this dataset in Section 3.1 where we

dive into logistic regression. The feature vector

is of dimension

= 30 and is composed of

continuous variables such as

radius_mean

texture_mean

, etc., each potentially aﬀecting

the probability of m alignancy. The data has

= 569 observations and here we use the

80-20 splitting strategy where we split it randomly between training data and testing data

to have

train

= 456

◊ n

training observations and

test

= 113

◊ n

testing

observations. Note that in some of the sections below, we simply use

to denote

train

when not considering the test set explicitly .

We may train diﬀerent forms of logistic regression for this data where we take

to be some

subset or transformation of the original feature vector. At one extreme we can take

be a single variable, and at another extreme

can be the full feature vector or eve n have

additional engineered features similar to the home pricing example above.

Here for simplicity we consider two logistic regression models. For the ﬁrst model we use

only a single feature in

which is

smoothness_worst

where the actual physical meaning

of this variable is not critical for our understanding at this point. This model is denoted

◊; p=1

(

). In the second mo del we consider all 30 features in the dataset. This model is

denoted

◊; p=30

(

). For each of these models we use the

train

observations to obtain an

estimated parameter vector where in the case of

=1, the estimated parameters

◊

is of

dimension

=2and in the case of

= 30 the estimated parameters

◊

are of dimension

= 31. Details on the actual meaning of the models and parameters of logistic regression are

in Chapter 3. With these models at hand, at ﬁrst let’s ﬁx

and consider several performance

measures of the classiﬁers deﬁned via (2.5).

See

https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(Diagnostic)

for more

information and relevant references.

2 Principles of Machine Learning

The standard way to compute binary a classiﬁcation performance measure is to evaluate the

classiﬁer on the test set. T his then allows us to consider the predictions

‚

(i)

and compare

them to the test set labels

(i)

, for

,...,n

test

. Observations where

‚

(i)

are

counted as either True Positive (TP) or true negative (TN), depending on the value of y

(i)

being 1 or 0 respectively. Similarly, observations where

‚

(i)

”

(i)

are counted as either

False Positive (FP) or False Negative (FN). These four counts, TP, TN, FP, and FN total

up to n

test

and are customarily summarized in the 2 by 2 confusion matrix:

Decision

Decide 0 Decide 1

Reality

Lab el 0 True Negative (TN) False Positive (FP)

Speciﬁcity

TN+FP

Lab el 1 False Negative (FN) True Positive (TP)

Sensitivity/Recall

TP+FN

Negative

Predictive Value

(NPV)

TN+FN

Precision/Positive

Predictive Value

(PPV)

TP+FP

Observe the ratios at the margins of the matrix that present various performance indicators,

namely sensitivity (also known as recall), speciﬁcity, precision (also known as positive

predictive value), and negative predictive value. In general, we would like all of these ratios

to be as close to 1 as possible, however as we describ e below, there are often tradeoﬀs.

An additional natural performance measure is the accuracy. It is simply deﬁned as the

proportion of correctly classiﬁed samples, namely,

Accuracy =

TP + TN

test

. (2.6)

This is often the ﬁrst performance measure one considers

, yet for unbalanced data it can

be an extremely misleading measure. Indeed, a degenerate classiﬁer that always predicts the

most abundant class will have an accuracy equal to the proportion of that class. For example

in binary classiﬁcation if the

positive

class constitutes only 5% of the samples, then the

degenerate classiﬁer which always predicts

‚

=0will have an accuracy of 95%.Then,even

when considering a non-degenerate classiﬁer, one observes an accuracy that is typically not

worse than 95% and if for example the accuracy is 99% it is still not an indication that the

classiﬁer works well.

A more robust analysis of performance considers the competing objectives of sensitivity and

speciﬁcity, where a high sensitivity (or recall) value indicates a good ability of the classiﬁer

to detect

positive

samples and a high speciﬁcity value indicates a good ability to detect

negative

samples. In the case of a threshold based classiﬁer as in

(2.5)

, varying

alters the

confusion matrix and thus the sensitivity and sp ec iﬁcity values are modiﬁed as well.

A parametric curve based on

known as the Receiver Operating Characteristic (ROC) curve

is often used to visualize this tradeoﬀ between sensitivity and speciﬁcity where the x-axis

Note that the formula here is for binary classiﬁcation, yet this performance measure is also used for

multi-class classiﬁcation where the numerator TP+TN needs to be replaced by the total number of correctly

classiﬁed samples.

2.2 Supervised Learning

plots one minus the speciﬁcity also known as the false positive rate. Observe from

(2.5)

and

the formulas at the bottom margin of the confusion matrix, that as

· æ

0 the number

of false negatives vanishes and hence the sensitivity approaches 1. Similarly as

· æ

1 the

number of false positives vanishes and hence the speciﬁcity approaches 1. More generally, as

we vary

between 0 and 1 a tradeoﬀ emerges as captured in the ROC curve. This allows us

to tune the threshold

for balancing sensitivity and speciﬁcity, dep ending on the problem

at hand. Figure 2.4 presents (smo othed) ROC curves for the breast cancer example where

we compare the ROC curves for the

◊; p=1

(

) and

◊; p=30

(

) models, as well as a “coin ﬂip”

model (chance line) and the “ideal” model.

0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate

Sensitivity

Perfectly Separable

Univariate

Full model

Chance Line

Figure 2.4:

Receiver operating characteristic (ROC) curves for the breast cancer data. One model

is a univariate model, and the other is a full model. A chance line (guessing model) and a perfectly

separable line (ideal model) are also plotted. For each model, the ROC captures the tradeoﬀ between

the sensitivity and the false positive rate (one minus speciﬁcity).

Receiver operating characteristic curves allow us to asses the quality of models taking all

possible threshold parameters into account. A related measure that tries to quantify the

quality of a curve into a single number is the area under the curve (AUC) measure. For a

classiﬁer with an ROC curves that achieves perfect sensitivity under any level of speciﬁcity

this measure is at 1 and corresponds to the perfectly separable green curve in Figure 2.4.

However for classiﬁers that just choose a random class, this measure is at 0

5 corresponding

to the chance line red line in Figure 2.4. In the case of the breast cancer data we see that

on the test set the AUC for the

◊; p=1

(

) model is 0.70 and for the

◊; p=30

(

) model it is

at 0.92. This may give an indication that the additional features in the richer model help

obtain a b e tter predictor.

Let us now ﬁx the threshold at

5 and compare a few more performance measures. The

test accuracy in this case is 0

73 for the

◊; p=1

(

) model and 0

89 for the

◊; p=30

(

) model.

However, since the number of

positive

samples in the test set is 40 (out of

test

= 113),

we see that this dataset is somewhat unbalanced and hence accuracy is not a good measure

of performance. In such cases, machine learning practice typically focuses on both precision

and recall (sensitivity). Note that this could have alternatively been a focus on speciﬁcity

and sensitivity (recall), but in machine learning the precision–recall pair is more popular.

Precision, similarly to speciﬁcity approaches 1 as the number of false positive s FP approaches

0. However precision is based on the true pos itives numbe r (TP), while speciﬁcity is based

on the true negatives (TN) value.

2 Principles of Machine Learning

For the f

◊; p=1

(·) model with · =0.5 we have,

Precision =

TP + FP

=0.70, and Recall =

TP + FN

=0.4,

and for the f

◊; p=30

(·) model with · =0.5 we have Precision =0.82 and Recall =0.9.

A popular way to consider both precision and recall is by averaging them using the harmonic

mean of the two values. This is called the F

score and is computed as follows:

Precision

Recall

Precision ◊ Rec all

Precision + Rec all

. (2.7)

In our example for the

◊; p=1

(

) model with

5 we have,

= 50

8% and for the

◊; p=30

(

) model with

5 we have

= 85

7%. Note that one may also use

scores to

calibrate the threshold

. Sometimes one uses a generalization of

called the

—

where

—

determines how much more important recall is in comparison to precision. However, in

general, if there is not a c lear reason to price false positives and false negatives diﬀerently,

then using the F

score as a single measure of performance is sensible.

In general, cases of unbalanced data should be treated with caution not just in terms of

performance measures and threshold calibration, but also in terms of inference. There are

multiple techniques for handling unbalanced data, some of which include over-s ampling or

under-sampling to balance the data. One of the more popular techniques is called synthetic

minority oversampling technique (SMOTE). See the notes and references for further details.

Approaches and Algorithms for Supervised Learning

We cannot cover all aspects of supervised learning in a single section, a single chapter, or

even in a single book. Yet now, after getting a taste of sup ervised learning in the context of

regression and classiﬁcation, let us discuss a few general approaches for supervised learning.

Towards that end we ﬁrst distinguish between discriminative models and generative models.

Most of the models in this book are discriminative. This means that when viewed through

a probabilistic lens (even though we mostly do not do that), they are based on learning

aspects of the distribution

(

y | x

), i.e. the conditional distribution of the label

given

the feature vector

. This is the case for linear models, logistic regression, general neural

networks, and multiple additional models and algorithms that are not covered in this book.

In contrast, generative models involve learning the joint distribution

(

x, y

). As a byproduct

knowledge of

(

x, y

) also means knowing the marginal distributions

(

) and

(

) as well as

the conditional distributions

(

y | x

) and

(

x | y

). Hence generative models consider all of

the data relationships as being learned, not just

(

y | x

). In this book, the most prominent

appearance of generative models is in the context of variational autoencoders, diﬀusion

models, and ge nerative advers arial networks, appearing in Chapter 8. Another elementary

generative type of model that we do not cover is the famous naive Bayes classiﬁer, most

notably known for early success of e-mail spam detection applications. One more common

type of generative model is linear discriminant analysis (LDA) used in many experimental

statistical contexts.

2.3 Linear Models at Our Core

Naive Bayes classiﬁers are based on certain independence assumptions for

(

x, y

).We

assume that given the label

, all features

,...,x

, are mutually independent. This (naive)

independence assumption then allows us to represent the likelihood function

of the sample

easily and in turn this enables eﬃcient generative learning. LDA-based classiﬁers are also

generative models (even though the name “discriminant” might be misleading). These

classiﬁers ﬁt a multivariate normal model to the data to carry out classiﬁcation based on

linear boundaries. Further details of both of these classiﬁer models and algorithms are

beyond our scope.

In terms of discriminative models, the linear models, logistic regression models, and more

general deep learning models in this book are very common examples. Linear models and

logistic regression models are simple deep neural networks. Linear models are studied in detail

in this chapter. Logistic regression models and generalizations are the focus of Chapter 3,

general fully connected deep learning models are the focus of Chapter 5, and other specialized

deep learning models are in chapters 6 and 7. Beyond these deep learning models, other types

of popular machine learning models that we do not cover in the book include support vector

machines (SVM), decision trees, and their generalizations, which include random forests

as well as gradient boosted trees. There are also additional elementary models often used

for instruction of machine learning such as the class of K-nearest neighbours classiﬁcation

models. Indeed, the world of machine learning is vast with ideas and algorithms for creating

both discriminative and generative models. The notes and references at the end of this

chapter point at key resources.

2.3 Linear Models at Our Core

In this section, we focus on linear models which are the basis for many other models including

deep neural networks. This is the ﬁrst model in the book where we explicitly use a loss

function for learning. The basic principles of the linear model and the associated loss function

alternatives extend to more advanced models covered in the sequel. Similarly, other concepts

that we cover here in the context of the linear model, such as the treatme nt of categorical

variables and aspects of multi-class classiﬁcation, are also relevant for more advanced models.

For the linear model, let us consider a feature vector

,...,x

)

œ R

and the

output/resp onse variable

y œ R

. The linear model links the output

to the features

through

y = b + w

€

x + ‘ (2.8)

where the scalar parameter

is called the intercept or bias, the vector parameter

is called

the regression parameter or weight vector, and the

‘

term represents the noise or error. This

is a generalization of the simple linear regression model

(2.3)

allowing

to be a vector and

we now denote

—

via

. To facilitate the presentation of key c oncepts of linear models we

often use a more compact representation of the model,

y =

... w

+ ‘ = ◊

€

˜x + ‘, (2.9)

The likelihood function is a basic statistical concept that we survey in Chapter 3 in the context of

logistic regression.

2 Principles of Machine Learning

where

◊

b, w

,...,w

) encapsulates both

and

and the feature vector

is extended to

˜x

via a constant unit in its ﬁrst position. The dimension of

◊

is the number of parameters

and in this case it is d = p +1.

In order to use the linear model for prediction of unseen data we have to learn the model.

This means to ﬁnd appropriate values for the parameters in

◊

based on training data such

that the model performs well in prediction. Such a suitable learned parameter is further

denoted via

◊

and with such an estimate at hand, a prediction for a new data point

œ R

is given by,

ˆy(x

b +ˆw

+ ...+ˆw

◊

€

˜x

Learn ing the Linear Model

Consider a training dataset

{

(

(1)

)

,...,

(

(n)

)

}

composed of a collection of

samples. For such data it is convenient to deﬁne the

n ◊ d

dimensional design matrix

for

the features, and the corresponding output response vector y, via,

X =

|| |

1 x

(1)

... x

(p)

|| |

with x

(i)

(1)

(n)

, and y =

(1)

(n)

. (2.10)

Using this notation we can express the linear model for all the samples of the training set via

y = X◊ + ‘,

with

‘

,...,‘

) representing a vector of noise. From this representation given a learned

parameter vector

◊, we can further deﬁne the predicted output vector of the model for the

input training data via,

ˆy = X

◊, where ˆy =

ˆy

(1)

ˆy

(n)

A suitable value for

◊ will yield ˆy ¥ y. This closeness is captured via a loss function,

C(◊ ; D)=

i=1

(◊), (2.11)

where

(

◊

):=

(

◊

;

(i)

, ˆy

(i)

) is the loss for the

-th data sample. Speciﬁcally

◊

is typically

chosen so that the loss function is minimal at the point ◊ =

◊.

For the linear model, the most popular loss function is the square loss function also called

quadratic loss where the loss for each data sample is

(◊)=(y

(i)

≠ ˆy

(i)

)

. (2.12)

This loss p e nalizes e ach element

(i)

≠ ˆy

(i)

, also known as the error or residual for

sample

, quadratically. In this case, the loss for the entire training data can be represented

2.3 Linear Models at Our Core

2 4 6 8 10

Figure 2.5:

Squared loss visualisation for one input feature

=1. The sum of the area of the

squares is the loss.

in terms of the L

norm Î ·Î of the corresponding error vector,

C(◊ ; D)=

i=1

(i)

≠ ˆy

(i)

)

Îy ≠ ˆyÎ

ÎeÎ

With this notation, by treating the learning of

◊

as an optimization problem and observing

that the objective can be manipulated via monotonic transformations, we can now represent

the learned parameter vector as,

◊ = argmin

◊œR

Îy ≠X◊Î

= argmin

◊œR

Îy ≠X◊Î

. (2.13)

The search for

◊

that optimizes

(2.13)

is known as the least squares problem. Figure 2.5

presents a visual representation of the squared loss in the case of a single input feature

(

=1and

=2). In this cas e we seek a line speciﬁed by

and

such that the sum of the

squares (total area of blue boxes in the ﬁgure) is minimized.

2 Principles of Machine Learning

The least squares solution can be easily derived by ﬁrst computing the gradient of

ÎX◊ ≠yÎ

with respect to ◊ using vector and matrix diﬀerentiation rules (see Appendix A) as

ˆÎy ≠ X◊Î

ˆ◊

ˆ(y ≠ X◊)

€

(y ≠X◊)

ˆ◊

€

y ≠2y

€

X◊ + ◊

€

X◊

ˆ◊

= ≠2X

€

y +2X

€

X◊.

(2.14)

Then, by setting the gradient to 0, we get the normal equations,

€

X◊ = X

€

y, (2.15)

which describe vectors

◊

that obtain a zero gradient, and, in this case, it can also be shown

that they are global minima of the objective (see further discussion in Chapter 4 about

global and lo c al minima).

The normal equations have a unique solution when the

d ◊ d

matrix

€

, also known

as the Gram matrix of

, is invertible. In this case we can represent the estimator as

◊ =(X

€

≠1

€

y, or, by setting X

†

=(X

€

≠1

€

,wehave,

◊ = X

†

y, (2.16)

where X

†

is called the Moore-Penrose pseudo inverse of X.

In fact, the Moore-Penrose pseudo-inverse,

†

, can be represented in diﬀerent ways. An

alternative form to (

€

)

≠1

€

is based on the singular value decomposition

(SVD) of

. Here,



€

where

is an

n ◊ n

orthogonal matrix,  is an

n ◊ d

matrix with

non-zero elements only on the main diagonal, and

is a

d ◊d

orthogonal matrix. Using the

SVD we can represent the Moore-Penrose pseudo-inverse as

†

= V 

€

, (2.17)

where 

contains the recipro c als of the non-zero (main diagonal) elements of , and has

0 values elsewhere. This SVD-based representation holds both if

€

is singular or not.

Hence

(2.17)

can be viewed as the more general representation of the pseudo-inverse. Note

that

€

is non-singular if and only if the matrix

is a full column rank matrix (i.e., the

columns of X are linearly independent).

Note that, if

is not full column rank (i.e.

€

is singular), then there is not a unique

solution to the normal equations

(2.15)

. However, the solution given via

(2.16)

,usingthe

SVD form (2.17), has a minimal norm for ◊ out of all pos sible solutions.

In the context of high dimensional data when the number of features

is greater than the

number of samples

, the design matrix

is never full column rank. This issue also appears

when some of the features are linear combinations of the others. Even if

is mathematically

full column rank, in some situations there is (strong) multicollinearity among the features,

meaning that some of the features are approximately linear combinations of the others. This

Note that the form of SVD presented here is sometimes called the full SVD.Adiﬀerent form called the

reduced SVD is used in Section 2.6 in the context of PCA.

2.3 Linear Models at Our Core

yields an

€

matrix that is ill-conditioned and diﬃcult to invert. In all these cases, the

SVD-based representation

(2.17)

is useful for using in

(2.16)

to obtain a solution for

(2.15)

Other Loss Functions

A ﬁrst appealing result for the choice of the squared error loss function

(2.12)

is the closed

form solution

(2.16)

for

(2.15)

. Also, when it is assumed that

is measured with uncorrelated

Gaussian noise

‘

, using the squared error loss function

(2.12)

is equivalent to using a solution

derived by the maximum likelihood estimation method.Thisisatechniquewidelyusedin

statistics for parameter estimation which is further discussed in Section 3.1 in the context of

logistic regression.

A second popular loss function for the linear model is the absolute error loss,

(◊)=

(i)

≠ ˆy

(i)

. (2.18)

It is known to be more robust to outliers, however, even in the case of the linear model,

there is no closed-form solution for estimating the parameters. From a statistical point of

view, minimizing the absolute error loss is equivalent to maximum likelihood estimation

when assuming a Laplace distribution

for the error noise

‘

. The Laplace distribution has

heavier tails than the Gaussian distribution, see Figure 2.6 (b). With these tails, large noise

values, i.e. outliers, are more probable than with the Gaussian noise and hence the loss

(2.18) is typically more robust to extreme values.

A third alternative, which is a hybrid between the absolute error loss and squared error loss,

is the Huber error loss. It is parameterized by a hyp er-parameter ” and represented via,

(◊)=

]

[

(i)

≠ ˆy

(i)

)

, if

(i)

≠ ˆy

(i)

< ”,

”

(i)

≠ ˆy

(i)

≠

”

, otherwise.

(2.19)

This loss function penalizes small errors quadratically and deals with outliers by penalizing

larger e rrors similarly to the absolute error loss. Figure 2.6 (a) provides a visual representation

of these three loss functions. Even if it appears to be an appealing tradeoﬀ between the

absolute error loss and squared error loss, the Huber loss has the disadvantage of having to

calibrate the arbitrary extra hyper-parameter

”

and, like the absolute error loss, it does not

have a closed form solution.

Categorical Input Features

We have so far considered numerical input features. We now describe methods for dealing

with categorical input features. The methods we present are useful for linear models as well

as almost any machine learning and deep learning model.

Before describing the general method of using one-hot encoding, let us highlight two cases

that sometimes receive special treatment. One such case is when the categorical feature is

Note that the use of the absolute error loss

(2.18)

(2.11)

is sometimes called the

loss as it is

related to the L

norm. Similarly the use of the square loss (2.12)issometimescalledtheL

loss.

This is a probability distribution over

with density function in the variable

,proportionalto

≠

|u≠µ|

where µ œ R is the mean and b>0 is a scaling parameter.

2 Principles of Machine Learning

(a) (b)

Figure 2.6:

Loss function and error distribution alternatives. (a) Squared, absolute, and Huber

loss functions. (b) Gaussian (normal) and Laplace error distributions.

binary. For example, assume a feature that only takes on two values

red

blue

.Insucha

case, it can be encoded via 0 and 1 respectively and used as a numerical variable. A second

special case is when the categorical feature is an ordinal variable, which may be interpreted

as or converted to a numerical value. For example if the feature is a “user satisfaction rating”

with values

low

medium

, and

high

, it may b e transformed to numerical values 0, 1, and 2.

Note however that this practice is sometimes problematic since diﬀerent spacings between

the assigned numerical values would yield diﬀerent interpretations of the features and in

general yield diﬀerent models. For example assigning numerical values of 0, 1, and 4 would

indicate a bigger gap between medium and high than b e tween low and medium.

Moving on to the general case of non-binary, nominal, categorical features one can use

one-hot encoding, a method that we present now. Denote the number of possible values that

the feature attains via

and here for simplicity assume the feature values are

...

.The

idea is to create

binary features in place of the categorical feature where, if the categorical

feature is z, we construct an L-dimensional vector ˜z =

1{z = 1},...,1{z = L}

with 1{·}

denoting the indicator function taking on 0 or 1. That is,

˜z

where

is the unit vector

with 1 in the position

and 0 elsewhere; see the unit vector notation deﬁned in Section 1.6.

Thus, each one-hot encoded categorical feature is expanded into such a ve ctor and, with this

encoding, the new transformed total number of features is,

˜p = p

num

i categorical

where

num

is the number of numerical features and

is the number of possible values for

categorical feature i.

To see how this one-hot encoding aﬀects the design matrix, assume for simplicity that in

addition to the

num

numerical features there is a single categorical feature with

levels. In

this case the n ◊ (1 + p

num

+ L) dimensional design matrix is,

X =

|| | | |

1 x

(1)

... x

num

)

˜z

(1)

... ˜z

(L)

|| | | |

with ˜z

(j)

˜z

(1)

˜z

(n)

2.3 Linear Models at Our Core

Here

˜z

(j)

for

,...,L

is the vector of indicator variables that marks which observations

are at level

for the categorical feature. In statistics, the new

columns

˜z

(j)

are called

indicator or dummy variables. However, in statistics the practice is to include only

L ≠

dummy variables instead of

. One reason for this is that when using

dummy variables

the design matrix

will never be a full column rank matrix since the sum of the

dummy

columns is equal to the ﬁrst column of 1s. T hus, traditional statistical practice only includes

L ≠

1 dummy variables in the model and the remainder is considered as the reference level.

An alternative is to remove the bias term from the model (meaning drop the ﬁrst column of

X) and then keep all L dummy variables.

Multi-class Classﬁcation

Now that we understand the linear model as well as ways of dealing with categorical variables,

let us consider an application of the linear model for multi-class classiﬁcation. We note that

linear models are generally far from the state of the art when it comes to their application

for classiﬁcation problems, yet seeing the linear model applied to classiﬁcation is instructive.

In classiﬁcation problems each of the labels

takes on one of a ﬁnite number of values. The

number of possible values is denoted via

.When

=2it is a binary classiﬁcation problem

but generally for

2 it is a multi-class classiﬁcation problem. Notationally it is convenient

to denote the set of label indices as

{

,...,K}

and consider some bijection between these

indices and the actual label values e.g.

banana

dog

, etc. As a concrete example we consider

the MNIST digits dataset where the label values are the digits

...

and notationally we

use the label indices {1,...,K = 10}. Here the obvious bijection shifts by 1.

A general scheme for multi-class classiﬁcation is introduced in Section 3.3 in the context of

multinomial (softmax) regression, and is further employed with other deep learning models

in the chapters that follow. An alternative, which we introduce now in the context of linear

models, is based on the fusion of multiple binary classiﬁcation models into a multi-class

classiﬁer. For this we introduce two general methods, namely one vs. rest (also known as

one vs. all) and one vs. one.

Both of these methods assume we have trained binary classiﬁers for sub-problems. With

the one vs. rest strategy, we assume the availability of models for binary classiﬁcation

◊

(

)

for

,...,K

where the

th model can discriminate between the label index

treated

positive

and otherwise if the label index is not

then

negative

. With the one vs.

one strategy we have

(

K ≠

1) binary classiﬁers

denoted

◊

i,j

(

) for all

i, j

,...,K

such that

i ”

. Here the (

i, j

)

classiﬁer discriminates between the label being of index

(positive) or index j (negative).

The output range obtained by

◊

(

) or

◊

i,j

(

) is generally a value on the real number line

Positive outputs indicate

positive

while negative outputs indicate

negative

. The farther

the model output is from 0 the stronger the conﬁdence of the classiﬁcation decision. A cutoﬀ

in similar nature to

(2.5)

is to apply the

sign

(

) function

to the model output and conclude

either positive in case of +1 or negative in case of ≠1.

In practice only half of this number of classiﬁers is needed because the classiﬁer for (

i, j

) can be reverted

to the classiﬁer (j, i).

In case the classiﬁers were trained with the 1, 0 encoding as in the case of logistic regression it is easy

to transform them.

2 Principles of Machine Learning

Now the one vs. rest or one vs. one strategies carry out prediction via,

‚

Y =

argmax

i=1,...,K

◊

(x) in case of one vs. rest,

argmax

i=1,...,K

j”=i

sign

◊

i,j

(x)

in case of one vs. one.

The idea of the one vs. rest strategy is to pick the label index which is most probable among

the

classiﬁcation models where each model focuses on a diﬀerent label index. The idea of

the one vs. one classiﬁer is to pick the label

that when compared to the other

K ≠

1 labels,

was chosen most often. This is achieved via comparison of a summation of

sign

◊

i,j

(

)

for

all other labels

. In both cases, one needs to supply rules for handling ties in the

argmax

the ﬁnal decision, yet these details are generally insigniﬁcant.

We proceed with an example of using both strategies for the MNIST dataset using a linear

model. The crux in creating the supporting binary classiﬁers is to set the label vector

used

(2.16)

to have values of +1 for samples that are

positive

and values of

≠

1 for samples

that are negative.

For example, when learning the

◊

(

) classiﬁer we consider all digit images in the original

dataset with the lab e l value

as having

=+1and otherwise

≠

1. Out of the 60

000

MNIST training samples there are 5

958 training samples that satisfy

=+1and then

for

≠

1 there are 54

042 samples. Obtaining the parameters for this clas siﬁer using

(2.16)

uses the design matrix

as in

(2.10)

which is of dimension 60

000

◊

785 where

785 = 1 + 28

◊

28. Each row of

corresponds to a diﬀerent image and each of the columns

2 to 785 corresponds to a diﬀerent pixel.

Similarly, when learning the

◊

3,8

(

) classiﬁer (this compares the digit

and the digit

)

used in one vs. one, we set +1 for all 5

958 training samples that have

and set

≠

1 for

all 6

265 samples that have a label value of

. Here the design matrix

is of dimension

12, 223 ◊785 since 5, 958 + 6, 265 = 12, 223.

To obtain predictors

◊

i,j

using

(2.16)

we compute the pseudo inverse of the respective

design matrices using (for example) numerical procedures for

(2.17)

. Note that the

◊

classiﬁers require only a single pseudo inverse for all

while

◊

i,j

has a diﬀerent design matrix

for each (i, j) pair and hence requires its own pseudo inverse.

Here 2 is the label value that matches label index 3 when using label indexing { 1,...,K =10}.

2.3 Linear Models at Our Core

Tabl e 2 . 1:

Confusion matrices for the MNIST digit test set using linear classiﬁers trained on the

training set. (a) one v s . rest achieves an accuracy of 0

8603. (b) One. vs. one achieves an accuracy

of 0.9297.

Decision

0123456789

Reality

0 944 0 18 4 0 23 18 5 14 15

1 01107 54 17 22 18 10 40 46 11

2 128132363916112

3 2226880172063017

4 231558812422262780

5 710175659170401

6 14 5 42 9 10 23 875 1 15 1

7 21222121408841277

8 71437221139707594

9 10512441705020801

(a)

Decision

0123456789

Reality

0 961 0 9 9 2 7 6 1 7 6

1 01120 18 1 4 5 5 16 17 5

2 139361863121781

3 1312926130032311

4 0110293185111030

5 6152018001913612

6 841017179080100

7 311074219551021

8 022221315218403

9 000523502313920

(b)

Now after training on the 60

000 MNIST training samples and evaluating p e rformance

on the 10

000 testing samples, we obtain an accuracy of 0

8603 using one vs. rest and an

accuracy of 0

9297 using one vs. one. As MNIST is generally a balanced dataset, the use of

accuracy to evaluate performance is sensible, and the level of accuracy obtained is impressive

since the linear model is very simple and training these models using the pseudo-inverse

computation only takes a few seconds at most. However, to get industrial grade performance

one requires more advanced models such as the convolutional neural networks of Chapter 6.

As mentioned in the previous chapter s tate of the art performance for MNIST is at an

accuracy of over 99.8%.

When evaluating multi-class classiﬁers it is common to use a confusion matrix similar to the

◊

2 confusion matrix presented in Section 2.2 for the case of binary classiﬁcation. Table 2.1

presents the confusion matrices for both one vs. rest and one vs. one. It is insightful to pick

out the entries where non-negligible misclassiﬁcation occurs. For example with the one vs.

rest classiﬁer multiple real digits of

were classiﬁed as

.Thisoccurred77 times. Similarly

in the one vs. one case there were 36 misclassiﬁcations of the digit 5 as the digit 8.

2 Principles of Machine Learning

2.4 Iterative Optimization Based Learning

Linear models coupled with quadratic loss

(2.12)

are gifted with a closed form solution

for the parameter estimate as appearing in

(2.16)

. This solution, which is based on the

pseudo-inverse

(2.17)

, is well studied in the ﬁeld of numerical linear algebra and is typically

suited for eﬃcient numerical evaluation, a topic that we do not cover he re any further. For

example, in the section above, the pseudo inverse computation of the 60

000

◊

785 design

matrix

used for MNIST digit classiﬁcation using the one vs. rest method can be evaluated

in about a second on a modern laptop. Nevertheless, there are multiple scenarios where

this closed form solution may not be used. This is the case when we use an alternative loss

function to the quadratic loss, such as

(2.18)

(2.19)

. Problems with using the explicit

optimal quadratic solution may also arise when the number of features

is very large. In

such cases and others, more generic methods for optimization are needed.

We now present a general iterative optimization method for obtaining this solution. It can

be used in such scenarios where the pseudo-inve rse based computation does not work, yet

more importantly it serves to illustrate how gradient based optimization interplays with

machine learning. Indeed the more complex deep learning models on which this book focuses

use this type of metho d, and its generalizations, almost exclusively. We note that there are

other methods used as well and Chapter 4 focuses entirely on optimization. Our purpose

here is simply to introduce the essence of the most basic technique, namely gradient descent.

Algorithm 2.1: Gradient descent

Input: Dataset D = {(x

(1)

),...,(x

(n)

)},

objective function C(·)=C(·; D), and

initial parameter vector ◊

init

Output: Approximately optimal ◊

1 ◊ Ω ◊

init

2 repeat

3 Compute the gradient ÒC(◊)

4 ◊ Ω ◊ ≠ –ÒC(◊)

5 until ◊ satisﬁes a termination condition

6 return ◊

This gradient descent procedure, presented in Algorithm 2.1, executes over iterations indexed

,...

and works by taking small steps in the direction opposite to the gradient.

That is, it traverses ‘downhill’ each time trying to descend in the steepest direction. In

its simplest form, steps sizes are controled by a ﬁxed

– >

0 called the learning rate. After

some initialization with the vector

◊

(0)

◊

init

, in each iteration

, the next vector

◊

(t+1)

obtained via,

◊

(t+1)

= ◊

(t)

≠ –ÒC(◊

(t)

). (2.20)

The algorithm repeats

(2.20)

where the key object that requires computation at each

iteration

is the gradient of the loss function

ÒC

(

◊

(t)

). For complex deep learning models

this computation is one of the core components of a deep learning framework and is carried

out using the backpropagation algorithm studied in Chapter 5. However, for simpler models

such as the linear model or logistic regression type models of Chapter 3,wehaveexplicit

expressions for the gradient.

2.4 Iterative Optimization Based Learning

In the case of the linear model with quadratic loss, we have already computed the gradient

in (2.14) and we can represent it as,

ÒC(◊)=

€

(X◊ ≠ y). (2.21)

In general the algorithm iterates over

until

◊

(t+1)

and

◊

(t)

are close as measured by some

stopping criteria such as,

Î◊

(t)

≠ ◊

(t+1)

Î < Á, (2.22)

with some ﬁxed

Á >

0. Other stopping criteria and variants of this method are studied in

detail in Chapter 4. The ﬁnal

◊

(t+1)

is used as

◊

and if

–

and

are well chosen the algorithm

output may closely approximate the optimal ◊.

Figure 2.7:

A contour plot of the loss function for a simple linear regression problem. The optimal

p oint in green is reached when starting gradient descent at the origin (black point) with

–

0235. However with the learning rate slightly higher at

–

024 gradient descent does

not converge.

One of the main diﬃculties with the application of gradient descent is choosing the learning

rate

–

. As an illustration, Figure 2.7 presents a contour plot of the squared error loss function

associated with simple linear regression similar in nature to the Boston housing price data

example of Figure 2.3 (a) where we optimize to ﬁnd

◊

—

).Whenrunningwith

= 10

≠5

and starting

◊

init

at the origin, we see that if

–

01 the algorithm terminates

near the optimal parameters in 3

109 iterations. If

–

0235 the algorithm terminates

near the optimal parameters quicker with 1

446 iterations yet follows a more jagged path.

Finally, when running with the learning rate that is just slightly higher at

–

024,the

algorithm diverges and does not terminate at all. The plot in this divergent case is only for

the ﬁrst 300 iterations where the growing oscillations are still in the vicinity of the optimum.

With further iterations the values quickly diverge.

2 Principles of Machine Learning

This simple example illustrates that the value of the learning rate

–

is crucial. Diﬀerent

values imply drastically diﬀerent behaviours. A more thorough investigation of gradient

descent and its generalizations is in Chapter 4. Interestingly, for linear models more explicit

results are available, as we present now.

Learning Rate A nalysis for Linear Models

In general there is not a simple analytical way to determine a suitable learning rate

–

Chapter 4 presents adaptive generalizations of gradient des ce nt. Yet, univers ally there are

no closed form recipes. Nevertheless , when it c omes to the special case of the linear model,

analysis of the dynamics of gradient descent is analytically attractive and we may explicitly

describe the range of

–

for which convergence takes place. This description is not necessarily

of direct practical use. However, it gives insight into the nature of gradient descent.

Consider the linear model with design matrix

as in

(2.10)

and denote

⁄

max

as the maximal

eigenvalue of the Gram matrix

€

. We can now show that gradient descent converges

to a solution of the normal equations (2.15) as long as

– <

⁄

max

. (2.23)

Note that for the data used in Figure 2.7,

⁄

max

= 20

670

33 and

= 490. Hence in this case

the algorithm converges for

–

in the range (0

02371) and this bound is in agreement with

the examples of Figure 2.7 where the ﬁrst two paths have

–

in this range and the third path

with – =0.024 diverges.

To see

(2.23)

use the gradient expression

(2.21)

and the gradient update rule of

(2.20)

obtain the recursion,

◊

(t+1)

= ◊

(t)

≠ –

€

(X◊

(t)

≠ y)

=(I ≠ –

€

¸ ˚˙ ˝

◊

(t)

+ –

€

¸ ˚˙ ˝

or in short

◊

(t+1)

A◊

(t)

. Such a recursion is known as an aﬃne discrete time linear

dynamical system and equilibrium points of such a system, denoted

◊

satisfy,

◊

A◊

or (

I ≠ A

)

◊

. In our case, using

and

, it is evident that such equilibrium points are

solutions of the normal equations

(2.15)

. For simplicity let us assume here that

is full

column rank and hence there is a unique ◊

It follows from linear systems theory that the spectral radius

of the matrix

determines

the convergence or non-convergence of

◊

(t)

◊

. Speciﬁcally, if the spectral radius of the

matrix

is less than unity, then

◊

(t)

converges to

◊

for any initial

◊

(0)

and, if the spectral

radius is greater than unity, then for any initial

◊

(0)

the sequences diverges (the border case

of the spectral radius being 1 is indeterminate). Hence, putting aside the border case of a

spectral radius of 1, we see that convergence occurs if and only if the eigenvalues of

are in

(≠1, 1).

Here “convergence” formally means that for any

0 there is an

Á >

0 of

(2.22)

where the algorithm

terminates in a ﬁnite number of iterations with

Î◊

≠◊

(t+1)

Î < Á

and

◊

is a minimizer of the optimization

problem.

The spectral radius of a square matrix is the largest of all magnitudes of the eigenvalues.

2.4 Iterative Optimization Based Learning

Now, since since

€

is a symmetric matrix, the eigenvalues of

€

are real and since

€

is positive semideﬁnite (this is a property of any Gram matrix), the eigenvalues

lie in the range (0

, ⁄

max

] with

⁄

max

0. Further, the eigenvalues of

≠

–n

≠1

€

lie in

the range [

≠

–n

≠1

⁄

max

0) and the eigenvalues of

I ≠

–n

≠1

€

lie in the range

≠

–n

≠1

⁄

max

1). Thus the critical inequality that ensures that the spectral radius of

less than 1 is

≠1 < 1 ≠ 2–

⁄

max

which is equivalent to (2.23).

The Loss Landscape and Standardization of Inputs

In Section 2.1 we prese nted standardization of inputs via

(2.2)

where the sample mean and

sample variance are computed via

(2.1)

. We now show that such standardization also aﬀects

the loss landscape, often yielding improvement in the execution of gradient descent. As above,

our analysis is in the realm of linear models where explicit analysis is possible.

To illustrate this concept, we consider a simple case of a linear model with two input fe atures

and x

where,

y = w

+ w

+ ‘.

This is similar to the simple regression mo dels of Section 2.2 yet is without an intercept

term. Here, with

data samples, the

n ◊

2 design matrix

has columns composed of the

vectors (

(1)

,...,x

(n)

) for

2; the labels vector is

(1)

,...,y

(n)

); and the model

parameters are

◊

). The square loss function where for simplicity we omit the 1

term in (2.11)is

C(◊ ; X, y ):=Îy ≠ X◊Î

=(y ≠ X◊)

€

(y ≠X◊)

= ◊

€

X◊ ≠ 2y

€

X◊ + y

€

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400

2600

−20 −10 0 10 20

(a)

100

120

140

160

180

200

220

240

260

280

300

320

340

360

380

400

420

440

−20 −10 0 10 20

(b)

Figure 2.8:

Contour levels of a loss function for an example with

=2parameters. (a) The

loss as a function of (

) for the original data has accentuated elliptical contours. (b) The

loss as a function of the parameters, (

˜w

, ˜w

), associated with standardized data, yields much less

accentuated contours.

2 Principles of Machine Learning

A contour plot for some arbitrary (not standardized) data is illustrated in Figure 2.8 (a). It

is evident that the contour levels of the loss function are ellipsoids. For some given loss level

, it can be shown that the lengths of the principal axes of the ellipse are

⁄

for

where

⁄

and

⁄

are the eigenvalues of the Gram matrix

€

and

C ≠y

€

.Withthis,

the elongation of the ellipse, which is the ratio of these lengths, is,

⁄

Now, when standardizing the inputs, each of the columns of

is transformed se parately via

(2.2)

. In such a case, the loss function associated with the standardized data is

(

◊

;

Z, y

)

where we denote by

the design matrix of the standardized data. It can now be shown that

€

Z = n

1 ﬂ

ﬂ 1

, where ﬂ =

j=1

(j)

is also known as the sample correlation between the two features. Note that ﬂ œ [≠1, 1].

With such a normalization and an explicit form for the Gram matrix

€

, we can compute

the eigenvalues of

€

to be

⁄

=1+

|ﬂ|

and

⁄

≠ |ﬂ|

. Thus, the elongation of the

ellipsoid is

⁄

1+|ﬂ|

1 ≠ |ﬂ|

It is evident that in cases where the correlation between the features is low, then

and this implies that the loss landscape of the s tandardized data is much more similar to

Figure 2.8 (b).

Now, in terms of gradient descent, taking steps in a loss landscape such as Figure 2.8 (b) is

generally more eﬃcient than using the loss landscape in Figure 2.8 (a). He nce it is expected

that as long as

ﬂ

is not close to 1 or

≠

1, carrying out standardization will help gradient

descent converge faster. Note that, while this analysis is carried out on a simplistic

=2,

=2linear model with quadratic loss, the principle often applies to more complicated loss

landscapes as well. Further note, that with standardization of the features or any other

transformation one also has to encode the standardization transformation as part of the

deployed model, since the model is now for the standardized features z instead of x.

2.5 Generalization, Regularization, and Validation

The data available while learning and calibrating a model is called the seen data and future

data is called unseen data. These concepts were introduced in Section 2.1. Our purpose of

ﬁtting a model based on seen data is that it will ultimately work well for unseen data, a

property known as generalization ability. With this view, when seeking models that generalize

well, there are two competing negative attributes that one needs to balance, underﬁtting

and overﬁtting. Underﬁtting is a case when the model is too simple and fails to capture the

complexity of the underlying data. On the other hand, overﬁtting is a case where the model

is so specialized to the training data such that unseen examples that slightly diﬀer from

the training data do not perform well. The theme of model selection in machine learning

2.5 Generalization, Regularization, and Validation

and statistics deals with the calibration of underﬁtting and overﬁtting to yield models that

generalize well.

Model selection, or the quest for optimal generalization ability, is one of the hardest problems

in machine learning primarily because the unseen data is not available. For this, one needs

to judiciously budget the seen data by splitting it into the training set, the test set, and also

carry out validation in one of several ways that we outline in this section. In quantifying

generalization ability there are several plots and measures that one can use. These include

the quantiﬁcation of model bias, model variance, and the bias and variance tradeoﬀ.We

present these in this section.

Some classes of models are by construction designed to enable calibration of underﬁtting,

overﬁtting, or the bias–variance tradeoﬀ. One general technique for this is called regularization

which in one common form, includes the introduction of additional terms to the model’s loss

function. We present a taste of regularization te chniques here and then in Section 5.7 we

focus on regularization in the context of deep neural network models.

In terms of notation, throughout this book we use

to denote data with

samples. This

sometimes means only the training set and in other cases means all of the seen data. When

we focus on training speciﬁc types of mo dels such as in Section 2.3, the symbol

is treated

as the data allocated spec iﬁcally for training and hence we assume there are

training

samples for training and potentially other samples for testing and/or validation that we do

not account for. In other cases

is treated as all of the available seen data, part of which

may be used for testing via a testing or hold-out set which we denote via

test

with

test

samples.

Performance on Unseen Data

We have already introduced several examples of performance metrics in Section 2.2.These

include accuracy

(2.6)

,the

score

(2.7)

, mean square error in the case of regression, and

others. In some cases one wishes to maximize the performance m etric where as in other

cases one wishes to minimize it. Note that the loss function used in model training is in

some instances directly related or equal to the performance metric, and, in other instances,

it is diﬀerent.

It is notationally convenient to relate a performance function to the performance metric.

We denote the performance function via

(

·, ·

) and it penalizes diﬀerences between a single

predicted label

ˆy

and the actual label

. For example when the mean square error performance

metric is used then the performance function is

(

ˆy, y

)=(

ˆy ≠y

)

. As another example, if

the accuracy performance metric is used in classiﬁcation then the pe rformance function is

(

‚

Y,y

‚

Y ”

,where

1{·}

is the indicator function and

is taken as the actual label.

Note that we construct the performance function such that small values are des irable.

When we train a model and create a predictor either for regression or classiﬁcation, we

use the data

and based on the model obtain a predictor denoted by

‚y

(

;

). Now, for

some data pair (

x, y

), the value

‚y

(

;

) is the prediction of

and the performance function

evaluated for the prediction of this data pair is P

‚y(x ; D ),y

In this section, to avoid notational confusion between ˆy and

‚

Y,weusethenotationˆy for both cases.

2 Principles of Machine Learning

As outlined in Section 2.1, we ensure that the nature of the seen data is similar to that

of unseen data and with this, the underlying modelling assumption is that both seen and

unseen data are generated by the same underlying processes. Hence, for both theoretical and

empirical analysis, unless we know otherwise, we assume that the probability distribution of

each data sample (

(i)

) is the sam e for all

,...,n

and is further the same as the

distribution of each unseen data s ample (

). That is, we assume there is an underlying

probability space for the observations and we vaguely denote the joint probabilities of the

features and label via

(

x, y

). Our usage of probabilistic statements here is only via expected

values where we denote the expectation operator via

[

] and often use a subscript for the

expectation to denote the objects that are treated as random.

With this notation, the expected value of the performance of the trained mo del for unseen

data points (x

) is denoted via,

unseen

= E

)

‚y(x

; D

train

),y

, (2.24)

where

train

is the training data. This quantity is called the generalization performance or

generalization error. It may be viewed as an average over all possible unseen data points and

hence

unseen

evaluates how well the predictor or model generalizes. With a given training

dataset D

train

, our aim is to build a mo de l that yields the smallest possible E

unseen

Unfortunately, since it is based on unseen data,

unseen

is a theoretical construct and since

we do not know the probability law

(

x, y

) exactly, we cannot compute

unseen

. However,

as a ﬁrst attempt, we can approximate the expe ctation by averaging over available training

data. That is,

train

(x,y)œD

train

‚y(x ; D

train

),y

, (2.25)

where

train

is the number of observations in

train

. It turns out that

train

is typically a

poor estimator of

unseen

because the same training observations that were used to create

the predictor are also used to evaluate the predictor performance. That is, the learned

parameters of the model

◊

that are used to construct

ˆy

(

) depend on

train

. Hence while

train

does present us with some insight about the ability of our model to reproduce the

data that has been learned, it lacks the ability to estimate performance on unseen data.

In order to get a better estimate of

unseen

, it is preferable to average over data that has

not been used for training the model, namely over the test set. In an ideal situation where

we use the test set only once and do not calibrate and adjust the model based on the test

set, the test set observations are completely independent of the model. In such a case, the

estimator,

test

(x,y)œD

test

‚y(x; D

train

),y

, (2.26)

is a good estimator of

unseen

, especially for signiﬁcantly large

test

. Speciﬁcally under the

assumption that the unseen data and the test set have the same distribution, the expected

value of

test

is exactly

unseen

making it a statistically unbiased estimator of performance.

Further it is statistically consistent in the sense that if we are able to allocate more testing

data and

test

æŒ

then

test

æ E

unseen

. This is simply a consequence of the law of large

numbers.

Note that these desirable statistical properties are only for a ﬁxed D

train

Formally the convergence

test

æ E

unseen

may be seen as convergence in probability in one form or

almost sure convergence in a diﬀerent form. We do not focus on such subtleties here.

2.5 Generalization, Regularization, and Validation

The straightforward statistical properties of unbiasedness and consistency enjoyed by

test

make the practice of holding out a test set for performance evaluation attractive. However,

setting aside a test set is costly as we eﬀectively ‘throw away’

test

observations and do

not use them for improving the model. For this reason, it is often tempting in practice to

iteratively evaluate

(2.26)

while adjusting model settings or hyper-parameters. This frowned

upon practice breaks the independence between

test

and the model at which point the

desirable statistical properties of

test

are lost. Hence, as an alternative, we use a validation

set or some other method as described below.

In addition to using an independent tes t set as in

(2.26)

, other alternatives for estimation

of performance also exist, which include using K-fold cross validation. This is a topic we

describe below in the context of validation and hyper-parameter optimization, yet it may

also be used for purposes of performance evaluation.

Model Choice, Underﬁtting, and Overﬁtting

The generalization performance in

(2.24)

is speciﬁc to a ﬁxed single training dataset

train

However, for a given problem, when considering which type of model to use and what

hyper-parameters to choose, it is often useful to think about the expectation over all possible

training datasets. For this we deﬁne the expected generalization performance,

unseen

= E

train

unseen

]=E

train

)

[P(‚y(x

; D

train

),y

)]

. (2.27)

It represents the average of

unseen

over all possible datasets

train

of a given s ize from the

same probability law

(

x, y

) where we keep in mind that each dataset potentially yields a

diﬀerent representation of the model.

A similar quantity for the training set is

train

= E

train

(x,y)œD

train

P(‚y(x ; D

train

),y)

. (2.28)

Keep in mind that

unseen

and

train

are functions of the type of model used, the hyper-

parameters, and the training dataset size. With such relationships present, the machine

learning engineer can in principle ponder about the theoretical shape of

unseen

and

train

and seek a model that appears best. In this respect the generalization gap deﬁned as

 =

unseen

≠

train

is also important.

The combination of

unseen

train

, and the generalization gap



based on estimates, allows

one to seek a balance betwee n underﬁtting and overﬁtting. There are multiple s uggestions

on “best practice” for using the available data to estimate

unseen

train



, and to select

the bes t model. A thorough discussion of such best practices is beyond our scope. The notes

and references at the end of this chapter link to further reading. Instead, let us consider the

schematic Figure 2.9, which presents typical behavior of

unseen

and

train

as a function of

model complexity.

Generally, as model complexity increases, expected training performance,

train

,improves

(decreases) since complex structured models can explain the training data better. At high

extremes this is overﬁtting. Similarly an opposite phenomenon is that models with low

2 Principles of Machine Learning

Undertting Overtting

Optimal Model

unseen

train

Model Complexity

Error

Figure 2.9:

Typical behaviour of expected generalization performance and expected training

performance as a function of model complexity

complexity are not able to describe the data well. The tradeoﬀ between these two regimes is

obtained at the minimum of

unseen

marked by the vertical dashed line.

In practice, unless presented with an inﬁnite pool of data, one is not able to evaluate

unseen

and

train

directly and one is certainly not able to evaluate these quantities over

all possibilities of models, hyper-parameters, and sample sizes. Nevertheless, much of the

practice of model selection revolves around getting a feel for the dependence of

unseen

and

train

on model choice, hyper-parameters, and s ample size. This is typically done using very

limited measurements from one or several training and validation executions.

Typical practice is to monitor empirical estimates of these quantities as a function of model

complexity, hyper-parameter choice, or sample size. The most basic practice is evaluation of

train

(2.25)

together with a validation performance measurement that is of similar nature

test

(2.26)

such as the validation performance or K-fold cross validation performance

which are deﬁned in the sequel. See (2.37) and (2.38)below.

As one simple illustrative example capturing the tradeoﬀs of model complexity, let us cons ider

linear models with polynomial features applied to synthetic univariate (

=1) data. The

model

y = —

+ —

x + —

+ ...+ —

+ ‘ (2.29)

is denoted by

where

is the order of the polynomial. Hence

is the constant model,

is the simple linear model,

is the quadratic model, and so on. A quadratic model of

this nature was used in

(2.4)

of Section 2.2. In this framework, model complexity corresponds

to the degree of the polynomial model.

Now taking one possible realization of

train

, in Figure 2.10 (a) we use this family of

models to ﬁt data of size

train

= 10. With this single realization we clearly see underﬁtting

behaviour for models M

and M

. In contrast, model M

appears to overﬁt the observe d

data. Between these two extremes, model

looks like an appropriate representation of

the observed data.

2.5 Generalization, Regularization, and Validation

−1

0.00 0.25 0.50 0.75 1.00

Polynomial fit with k=0

−1

0.00 0.25 0.50 0.75 1.00

Polynomial fit with k=1

−1

0.00 0.25 0.50 0.75 1.00

Polynomial fit with k=3

−3

−2

−1

0.00 0.25 0.50 0.75 1.00

Polynomial fit with k=9

(a)

train

unseen

0.00

0.25

0.50

0.75

1.00

0 1 2 3 4 5 6 7 8 9

Model (k)

Mean Square Error

(b)

Figure 2.10:

Increasing model complexity illustrated via linear models with polynomial features

where

, the order of the p olynomial, captures the complexity. (a) Fitting several models to a single

realization with

=10data-points. (b) The training performance in red and simulation estimates

of the generalization performance in black.

In Figure 2.10 (b) the red curve presents

train

for this dataset. It is obvious that as

increases training ﬁt improves. Further, in this hypothetical e xample since we know the

underlying process with probability law

(

x, y

) used for purposes of simulation of synthetic

data, we may sample as many (

) pairs as we wish, to obtain a reliable estimate of

unseen

. This curve is plotted in black where in this case we use 10

000 repetitions for each

, each time with the ﬁxed model base d on our single available dataset,

train

. This Monte

Carlo simulation, makes it clear that when

=9or

=8there is overﬁtting and when

2 there is underﬁtting. In practice plots exactly like Figure 2.10 (b) cannot be

produced because we do not know

(

x, y

). Instead one can resort to estimates based on

cross validation to obtain curves similar to the black curve in Figure 2.10 (b).

We also mention that, while we stated that key elements that aﬀect expected performance

are the model type, hyper-parameters, and sample size, in the world of deep learning there

is also an additional major factor, training time. For deep learning models, since the number

of parameters in the model is often huge, letting the model train for longer is similar to

using a more complex model as presented in Figure 2.9.

Bias and Variance Decomposition

A related view to the analysis of expected ge neralization performance and the generalization

gap is the so-called bias and variance decomposition. It focuses on the expected generalization

performance in production,

unseen

, and decomposes it into a sum of terms related to model

bias, model variance, and the noise magnitude. With this decomposition, underﬁtting is

said to be a situation with high model bias and overﬁtting is said to be a situation with

high model variance. Using this terminology, balancing model bias and model variance is

equivalent to balancing underﬁtting and overﬁtting respectively. This is known as the bias

and variance tradeoﬀ.

2 Principles of Machine Learning

The bias and variance decomposition is mathematically elegant in the special case of the

square error performance function

(

ˆy, y

)=(

ˆy ≠y

)

and a speciﬁcally assumed underlying

random reality

y = f(x)+‘, with E[‘]=0, and ‘ is independent of x. (2.30)

Here

a vector of features and

a scalar real valued label. Further

[

‘

] is the variance

of the noise term and is called the inherent noise. In this setting, for some unseen feature

vector

, the predictor trained on data

‚y

(

;

), which we also denote via

(

;

)

since it estimates f(x

). Hence the expected generalization performance of (2.27) becomes

unseen

= E

D,x

,‘

f(x

; D) ≠ (f(x

)+‘)

. (2.31)

Now, a standard algebraic manipulation common in statistics is to add and subtract

D,x

[

(

;

)] inside

(2.31)

, expand the expression, apply the external expectation operator,

and then cancel out terms that have zero expectation (resulting from

[

‘

]=0and the fact

that

‘

and

are indepe ndent). This manipulation transforms

(2.31)

to the bias-variance-

noise decomposition equation,

unseen

f(x

; D)] ≠ E[f (x

)]

¸ ˚˙ ˝

Bias squared of

f(·)

+ Var

f(x

; D)

¸ ˚˙ ˝

Variance of

f(·)

+ E[‘

]

¸˚˙˝

Inherent Noise

. (2.32)

Here the ﬁrst term is the square of the bias, the second term is the variance taking into

consideration variability both from

used for training and

, and the third term is the

inherent noise. The expectations and variances in the bias and variance terms are with

respect to the training dataset

and the arbitrary unseen feature vector

. The main

takeaway from

(2.32)

is that if we ignore the inherent noise, the loss of the model has two

key components, model bias (technically it is the model bias squared), and model variance.

The model bias is a measure of how a typical (expectation over all possible data samples)

model

(

;

) misspeciﬁes the correct relationship

(

). Model classes with high bias, have

that

(

;

) does not accurately predict

(

). That is, high bias generally implies underﬁtting.

Similarly, model classes with low model bias are detailed descriptions of reality since the

exp ected diﬀerence in the bias term is near zero.

The model variance is a measure of the variability of the model class

(

;

) with respect to

the random sample

and the distribution of

as implicitly implied by the probability law

of the data

(

x, y

). Model classes with high model variance are often overﬁt (to the training

data) and do not generalize (to unseen data) well. Similarly, model classes with low model

variance are much more robust to the training data and generalize to the unseen data much

better.

Similar analysis to the derivation that leads to

(2.32)

can also be attempted for other

performance functions other than square error, and model structures other than

(2.30)

With such other settings, the mathematical elegance of

(2.32)

is often lost. Nevertheless, the

concepts of model bias, model variance, and the bias and variance tradeoﬀ still persist. For

example in a classiﬁcation setting we may compare the accuracy obtained on the training

set to that obtained on a validation set. If there is a high discrepancy where the training

accuracy is much higher than the validation accuracy, then there is probably a variance

problem indicating that the mo del is overﬁtting.

2.5 Generalization, Regularization, and Validation

Addition of Regularization Terms

One natural way to control model variance is to induce or force model paramete rs to remain

within some conﬁned subset of the parameter space. This is c alled regularization.Atthe

extreme case where all model parameters are 0, the model variance vanishes as well. In less

extreme case s where there is only some c onstraint on model parameters, model variance

is still controlled. Such decreases in model variance may imply an increase of model bias.

Nevertheless, the ultimate goal of optimizing the expected performance loss typically merits

such adjustments.

A common way to keep model parameters at bay is to augment the optimization objec tive

min

◊

C(◊ ; D) with an additional regularization term R

⁄

(◊). The revised objective is then,

min

◊

C(◊ ; D)+R

⁄

(◊). (2.33)

The regularization term

⁄

(

◊

) depends on a regularization parameter

⁄

, which is often a

scalar in the range [0

, Œ

) but also sometimes a vector. This hyper-parameter allows us to

optimize the bias and variance tradeoﬀ.

A common general regularization technique called elastic net has regularization parameter

⁄ =(⁄

, ⁄

) and,

⁄

(◊)=⁄

Î◊Î

+ ⁄

Î◊Î

with Î◊Î

i=1

|◊

| and Î◊Î

i=1

◊

, (2.34)

where

is the dimension of the parameter space.

Hence the values of

⁄

and

⁄

determine

what kind of p enalty the objective function will pay for high values of ◊

Clearly, with

⁄

, ⁄

=0the original objective is unmodiﬁed. In contrast, as

⁄

æŒ

⁄

æŒ

the estimates

◊

0 and any information in the data

is fully ignored. Indeed, as

⁄

grow, the model bias grows while model variance is decreased and overﬁtting is

mitigated. With regularization there is often a magical ‘sweet spot’ for

⁄

where the objective

(2.33) does a good job at ﬁtting the model.

Particular cases of elastic net are the classic ridge regression, also called Tikhonov regulariza-

tion, and LASSO standing for least absolute shrinkage and selection operator. In the former

⁄

=0and only

⁄

is used, and in the latter

⁄

=0and only

⁄

is used. One of the beneﬁts

of LASSO, also present in the more general elastic net case, is that the

Î◊Î

loss allows the

algorithm to remove variables from the model by “zeroing out” their

◊

values completely.

Hence LASSO is very useful as a model selection technique.

The case of ridge regression is slightly simpler to analyze than LASSO and it ﬁts well within

the framework of linear models presented in Section 2.3. We thus present the details now.

For ridge regression the data ﬁtting problem can be represented as,

min

◊œR

Îy ≠X◊Î

+ ⁄Î◊Î

, (2.35)

Note that in cases such as linear regression or deep neural networks where there is a constant term (

—

for example), the pa ra meters for the constant term are typically not regularized and hence the norms are

taken only on the other parameters.

2 Principles of Machine Learning

where the design matrix

is as in

(2.10)

and we now consider

⁄

as a scalar (previously

(2.34)

it was denoted as

⁄

) in the range [0

, Œ

). Compare

(2.35)

with the original least

squares objective

(2.13)

. Now by manipulating the

Î · Î

expressions, the problem can be

recast as

min

◊œR

≠

⁄

◊

with

⁄

⁄I

where

is the

d ◊ d

identity matrix and 0 is the zero vector in

. The pseudo-inverse

associated with

⁄

†

⁄

€

⁄I

)

≠1

€

⁄I

. Hence, returning to

(2.16)

,the

parameter estimate for ridge regression is

◊ =(X

€

X + ⁄I)

≠1

€

y. (2.36)

As an aside, note that for any

⁄ >

0 the matrix

€

⁄I

is not singular even if

€

singular. Also as

⁄ æ

0 it can be shown that the pseudo-inverse

†

⁄

converges to the SVD

based pseudo-inverse (2.17) associated with X.

We note that while for linear mo dels , the results are very elegant, in other cases closed

form solutions such

(2.36)

do not exist. Still, many machine learning loss functions can

be augmented with a regularization term. We revisit these concepts in Section 5.7 in the

context of deep learning where other regularization methods are presented as well. Also

note that the regularization parameter

⁄

is a ﬁrst class example of a hyper-parameter

that one would like to calibrate during learning. This s peciﬁc parameter serves as a good

lever for optimizing the bias and variance tradeoﬀ. We now discuss the general topic of

hyper-parameter optimization.

Hyper-parameter Calibration and Cross Validation

As alluded to above, calibrating the model choice and the hyper-parameters while reusing the

test set for performance evaluation is bad practice since it pollutes the test set performance

estimator

test

(2.26)

. For this reason it is common to further split the training data

train

using one of several ways while expe rimenting with model conﬁgurations and hyper-

parameters. With such an approach

test

is reserved only for ﬁnal performance evaluation

before rolling out the model to production. Such use of the training data where some parts

of the data are used for training paramete rs and the other parts are used for checking

performance and tuning hyper-parameters is generally called cross validation.

There are multiple common cross validation techniques with many variants used in practice.

Here we present only two main approaches, the train-validate split approach and K-fold cross

validation. The train-validate split approach is common in situations where the total numb e r

of datapoints

is large. The K-fold cross validation approach is useful when data is limited.

The train-validate split approach simply implies that the original data with

samples is

ﬁrst split to training and testing as before and then the training data is further split into

two subsets where the ﬁrst is (confusingly) again called the training set and the latter is the

validation set. Hence cons idering all of the available data D, with this approach,

D = D

train

ﬁ D

validate

ﬁ D

test

, where the unions are of disjoint sets.

In practice we often do not regularize the intercept term and this requires adjusting the i dentity matrix

⁄

2.5 Generalization, Regularization, and Validation

When considering all of the data, this approach is also called the train-validate-test split

approach. If, for example, we use a 80-20 rule for both splits and assuming divisibility holds,

then n

test

=0.2 ◊ n, n

train

=0.64 ◊ n and n

validation

=0.16 ◊ n.

As an example, assume the model is ﬁxed yet regularized with elastic net as presented above.

Hence the hyper-parameters in question are

⁄

, ⁄

) and the choice of these needs to be

tuned. The approach is then to evaluate the estimator on

train

over a grid of such hyper-

parameters, retraining from scratch for every

⁄

. We then choos e

⁄

argmax

⁄

validation

(

⁄

)

with

validation

(⁄)=

validation

(x,y)œD

validation

‚y(x ; D

train

, ⁄),y

, (2.37)

where we can see that the predictor

ˆy

depends on the hyper-parameter. With the optimal

⁄

pair selected, the model with this

⁄

is evaluated on the test set once via

(2.26)

before

being rolled out to production. Note that this approach has many variants used in practice.

...

(1)

validation

k = 1

Validation Data

Training Data

...

(2)

validation

k = 2

...

(K)

validation

k = K

Training Data

Validation Data

Figure 2.11:

K-fold cross validation. For each

,...,K

the data is split into training data

and validation data diﬀerently. This yields

estimates for performance and these estimates can be

averaged.

In case of limited observations a train-validate-test split may be too wasteful of data and an

alternative approach is K-fold cross validation as illustrated in Figure 2.11. This approach

may be used on all of the data

or only on the training data after a train-test split is

performed. Here for simplicity we apply it to some dataset

. The approach is useful both

for model selection, hyper-parameter optimization, and performance evaluation.

The value

of this approach which determines the number of data chunks or repetitions is

a static conﬁguration parameter with a typical value being

=5or

= 10. The approach

is to split D into K equally sized data chunks each denoted D

with,

D = D

ﬁ D

ﬁ ...ﬁ D

, where again the unions are of disjoint sets.

Then for each

,...,K

we ﬁx a training set to be composed of all of the observations

except for

and the validation set (may also be called a test set) to be

. That is,

2 Principles of Machine Learning

denoting set diﬀerence with ‘\’weset,

(k)

train

= D\D

, and D

(k)

validation

= D

. for k =1,...,K.

We may now retrain and evaluate the model separately for each data chunk

where each

time we use

(k)

train

as the training data and

(k)

validation

as the validation (or testing) data.

That is if for example

= 10 and originally

has

observations then for each

we have

train

=0.9 ◊ n and n

validation

=0.1 ◊ n (again assuming n is properly divisible).

With the model trained separately for each repetition

, we can now estimate performance

via,

k=1

(k)

validation

, (2.38)

with,

(k)

validation

(x,y)œD

(k)

validation

‚y

(k)

(x ; D

(k)

train

),y

where ‚y

(k)

is the predictor trained for rep etition k.

Once again, if needed, hyper-parameter optimization may take place by treating

as a

function of the hyper-parameter in question. Also, as mentioned above in situations where

the total number of observations is low and if not tweaking parameters then K-fold cross

validation may serve as an alternative approach to general performance evaluation using

a train-test split. Again as with the train-validate-test split approach, there are multiple

variations for K-fold cross validation with the exact method used in practice often depending

on the speciﬁc situation encountered.

2.6 A Taste of Unsupervised Learning

Now that we have explored key aspects of supervised learning in the sections above, let us

get a taste for the basics of unsupervised learning. In this context the data is unlabelled and

is denoted via

(1)

,...,x

(n)

}

. Here we assume that each sample or observation

(j)

dimensional vector of features in Euclidean space. Observe that there are no labels

(j)

We brieﬂy introduce two popular unsupervised learning methods, one for clustering and

one for data reduction. These are respectively the K-means algorithm and the framework

of principal component analysis (PCA). In exploring PCA we also take a slightly deeper

look at the linear algebraic concept of singular value decomposition (SVD) already used

in Section 2.3 in the context of pseudo-inverse representation. Here we see how SVD has

applications to data compression, a notion sometimes used in more complex deep learning

models. Further reference to more advanced supervised learning methods are in the notes

and references at the end of the chapter.

K-means Clustering

The m achine learning activity of clustering allows us to identify meaningful groups, or

clusters, among the data points and ﬁnd representative centers of these clusters. The aim is

2.6 A Taste of Unsupervised Learning

that the samples within each cluster are more closely related to one another than s amples

from diﬀerent clusters.

Formally, for the dataset

, clustering is the act of associating a cluster

with each

observation, where

comes from a small ﬁnite se t,

{

,...,K}

. That is, a clustering algorithm

works on the data

and outputs a function

(

) which maps individual data points to the

label values

{

,...,K}

. The clustered data (algorithm output) is then a collection of clusters

denoted via,

)

(j)

| c(x

(j)

)=¸,jœ {1,...,n}

, for ¸ =1,...,K.

A clustering algorithm attempts to choose the clusters such that the elements of each

are

as homogenous as possible.

The K-means algorithm is one very basic, yet powerful heuristic algorithm. With K-means,

as with several other types of clustering algorithms, we pre-spe cify a number

,determining

the number of clusters that we wish to ﬁnd. Hence

may be treated as a hyper-parameter.

As the algorithm seeks the function

(

), or alternatively the partition

,...,C

, it also

seeks representative centers (also known as centroids), of the clusters, denoted by

,...,J

each an element of R

One may view the ideal aim of K-means as minimization of,

Clustering loss =

¸=1

xœC

Îx ≠ J

. (2.39)

Such a minimization is generally computationally intractable since it requires considering

all possible partitions of

into clusters. Yet it can be approximately m inimized via the

K-means algorithm using a classic iterative approach. The K-means algorithm does this by

separating the problem into two sub-problems or sub-tasks called mean computation, and

labelling.Wedeﬁnethesenow.

Mean computation: Given

(

), or a clustering

,...,C

where

denotes the number of

elements in cluster ¸,ﬁndJ

,...,J

that minimizes (2.39) via,

xœC

x, for ¸ =1,...,K. (2.40)

Here each

is the vector obtained via the element-wise average over all the vectors in

where each of the p coordinates is averaged separately.

Labelling: Given,

,...,J

and assuming these values are ﬁxed, ﬁnd

(

) that minimizes

(2.39) for every x œ D. This is done by setting,

c(x) = argmin

¸œ{1,...,K}

Îx ≠ J

Î. (2.41)

That is, the label of each element is determined by the closest center in Euclidean space.

2 Principles of Machine Learning

−0.5 0.0 0.5 1.0

−0.5 0.0 0.5 1.0 1.5

Mean computation

Initialization

−0.5 0.0 0.5 1.0

−0.5 0.0 0.5 1.0 1.5

Labelling

−0.5 0.0 0.5 1.0

−0.5 0.0 0.5 1.0 1.5

−0.5 0.0 0.5 1.0

−0.5 0.0 0.5 1.0 1.5

−0.5 0.0 0.5 1.0

−0.5 0.0 0.5 1.0 1.5

−0.5 0.0 0.5 1.0

−0.5 0.0 0.5 1.0 1.5

Cluster output

Figure 2.12:

Workﬂow of the K-means algorithm on synthetic data with

=3. The left column

is the mean computation step and the right column is the labelling step. Each row is an iteration

and the algorithm converges after three iterations. Initialization is presented in the top left corner

and after three iterations the algorithm converges with the output presented in the bottom right

corner.

The K-means algorithm starts with randomly initialized

centers

,...,J

. A labelling

step is then executed and these initial random centers are then used to determine initial labels

according to

(2.41)

. The algorithm then iterates over the mean computation step

(2.40)

followed by the labelling step

(2.41)

and repeats the two steps one after the other. This is

done until no more changes are made to the labels and the means. Such an iteration generally

does not ﬁnd the absolute minimum of the objective

(2.39)

, however the approximation

found is often satisfactory.

In Figure 2.12 we illustrate the workﬂow of the algorithm on synthetic data where we choose

=3. The top left plot is the intilization with three random means represented by the

black circle, the red triangle, and the green cross. Then each row of Figure 2.12 represents

one iteration of the algorithm where the plots on the left column show the output of mean

computation (except for the ﬁrst row which is initilization), and the plots on the right

column show the output of lab e lling.

These may be randomly selected elements of D or some other random set of vectors.

2.6 A Taste of Unsupervised Learning

Note that in practice when presented with data

, one typically ﬁrst standadizes the data

as presented in Section 2.1. Then the process of selecting the hyper-parameter

which is

external to the K-means algorithm is carried out. One way to do so is to run K-means for

increasing values of

and seek a knee point or elbow when plotting

(2.39)

as a function of

. As

increases the objective

(2.39)

generally decreases, however beyond a certain

the

value of adding further clusters quickly diminishes. In some cases, such as the visual pixel

segmentation we present below, visual subjective measures can be used to ﬁnd the most

appropriate K.

Image Segmentation with K-means

We have already brieﬂy discussed image segmentation in Section 1.1; see Figure 1.2 in that

section. As discussed, the goal of image segmentation is to label each pixel of an image with

a unique class from a ﬁnite number of classes. In Chapter 6 we brieﬂy describe a supervised

approach called semantic image segmentation which uses labeled data, namely class masks

in addition to the image for training. Nevertheless, in the absence of such information one

may still carry out unsupervised image segmentation. One way to carry out this task is to

use the K-means clustering algorithm where each pixel of the image is considered a point in

and the dimension of each point is typically

=3(red, green, and blue) for color images.

This can produce impressive image se gme ntation without any other information except for

the image.

Figure 2.13 presents the segmentation of a color image where K-means is used for grouping

the pixels into

diﬀerent clusters. This color image in Figure 2.13 (a) is a

= 640

◊

640 =

409

600 pixel color image (

=3). The segmentation consists of running the K-means

algorithm which groups similar pixels based on their attributes and assigns the attributes

of the corresponding cluster center to the pixel in the image. Figure 2.13 (b) presents the

result of the segmentation using K =6and Figure 2.13 (c) does so with K =2.

(a) (b) (c)

Figure 2.13:

Unsupervised image segmentation using K-means. (a) Original image. (b)

=6.(c)

K=2.

Matrices in Unsupervised Learning

We often organize the data

(1)

,...,x

(n)

}

in the data matrix

, similar to the design

matrix

(2.10)

by stacking each observation vector

(j)

in a separate row. The diﬀerence

between the design matrix and

is that the latter does not have a ﬁrst column of 1s.

Thus X

is an n ◊ p matrix where the ith column has the data samples for feature i.

2 Principles of Machine Learning

It is useful to de-mean the data by deﬁning the centered data matrix,

X = X

≠ 1x

€

, (2.42)

where we (re)use the notation

for the matrix previously used for the design matrix and

where

is a column vector of 1s of length

.Inthecenteringprocess,the

dimensional

vector

has coordinates

which are sample means of the features as deﬁned in

(2.1)

. That

is, for each column (feature) in

we subtract the mean of the feature. Thus the new

n ◊p

matrix X has features that each have a sample mean of 0.

An important matrix for such data is the p ◊ p sample covariance matrix,

S =

€

X. (2.43)

Written in scalar form, the (i, j)th element of the symmetric matrix S is,

i,j

k=1

(k)

≠ x

)(x

(k)

≠ x

and it estimates the covariance between feature

and feature

. On the diagonal of

where

i,i

equals the sample variance

(2.1)

.Theoﬀ diagonal entries account for the

measure and direction of linear depe ndence between features.

We note that the sample covariance matrix can be further normalized to a sample correlation

matrix by dividing each (

i, j

)th entry by the product of the sample standard deviations,

Sample correlation matrices are important for multivariate descriptive statistics, yet we do

not use them explicitly now. Our focus is rather on PCA which we introduce in terms of the

de-meaned data matrix X and the sample covariance matrix S.

Principal Component Analysis

It is often the case that not all

dimensions of the data are equally useful. This is especially

the case in the presence of high dimensional data (large

). Moreover, many features may

be either com pletely redundant or uninformative. These cases are referred to as correlated

features or noise features respectively. In such cases and others, principal component analysis

(PCA) is often employed. It is a well-known and widely used dimensionality reduction

technique applicable to a wide variety of applications such as data compression, feature

extraction, and visualization.

The basic idea of PCA is to project each point of

which has many correlated coordinates

onto fewer coordinates, called principal components, which are uncorrelated. This is done

while still retaining most of the variability present in the data. In this setting, PCA oﬀers a

low-dimensional representation of the features that attempts to capture the most important

information from the data. The principal components found via PCA are a new reduced

set of features, indexed by

,...,m

where

m<p

is some speciﬁed lower dimension. For

visualization we often take

=2or

=3. In other applications, such as for example the

integration of PCA as part of other machine learning procedures,

is often calibrated as a

hyper-parameter.

In a sta t istical c o ntext one often uses

n ≠

1 in the denominator instead of

.Fornon-small

this

distinction is insigniﬁcant. See a similar comment in relation to (2.1)

2.6 A Taste of Unsupervised Learning

As input, PCA uses the de-meaned data from the centered data matrix

(2.42)

where

we denote by

(i)

the

th column of

(corresponding to a vector of feature

for all

observations). PCA uses a linear combination of these columns to arrive at the vectors of

the new features ˜x

(1)

,...,˜x

(m)

. This can simply represented as

˜x

(i)

= v

i,1

(1)

+ v

i,2

(2)

+ ... + v

i,p

(p)

for i =1, . . . , m,

where each new

dimensional vector,

˜x

(i)

, is a linear c ombination of the original features.

The coeﬃcients of this linear combination can be organized in the vector

i,1

,...,v

i,p

)

which is called the loading vector for

.Thus

˜x

(i)

. This can also be represented for all

the reduced features and loading vectors together via,

˜x

(1)

... ˜x

(m)

¸ ˚˙ ˝

n◊m

Reduced data

(1)

... ... x

(p)

¸ ˚˙ ˝

n◊p

Original de-meaned data

◊

... v

¸ ˚˙ ˝

p◊m

Matrix of loading vectors

. (2.44)

It turns out the a very useful way to represent the loading vectors

,...,v

is by normed

eigenvectors associated with eigenvalues of the sample covariance matrix

as in

(2.43)

Speciﬁcally, since

is symmetric and positive semideﬁnite, the eigenvalues of

are real and

non-negative, a fact which allows us to order them via

⁄

Ø ⁄

Ø ... Ø ⁄

0.Wethen

pick the loading vector v

to be a normed eigenvector associated with ⁄

, namely,

= ⁄

, (2.45)

while keeping in mind that the ﬁrst loading vector is associated with the highest eigenvalue;

the second is associated with the second highest eigenvalue; and so forth. The symmetry of

also means that its eigenvectors are orthogonal and hence

is a matrix with orthonormal

columns. In this setting we assume that at least the ﬁrst

eigenvalues are strictly positive,

namely ⁄

> 0.

In the subsection below we derive the main result to show why this choice of loading vectors

based on eigenvectors is attractive. At this point let us consider a numerical example.

We return to the Wisconsin breast cancer data used in Section 2.2 where

= 30 and

= 569. To visualize this data using PCA we set

=2and compute the ﬁrst two loading

vectors from the 30

◊

2 matrix

using standard numerical procedures for eigenvalues and

eigenvectors. Then by multiplying the 569

◊

30 demeaned data matrix

as in

(2.44)

we get the 569

◊

2 matrix

of principal components. We then plot each row which is 2

dimensional in Figure 2.14 (a).

On their own, the two dimensional points in Figure 2.14 (a) may not be insightful. After all

the principal components coordinates

pc1

and

pc2

do not have any physical meaning in this

context. Nevertheless, if we consider Figure 2.14 (b) where we frame this as a supervised

learning problem and color the points based on the labels

benign

vs.

malignant

,auseful

pattern emerges. There is quite a clear separation between the two classes and hence there is

2 Principles of Machine Learning

−5

−6 −3 0 3 6

(a)

−5

−6 −3 0 3 6

benign

malignant

(b)

Figure 2.14:

Breast tumor data samples projected on the two ﬁrst principal component from

the PCA. (a) Unlabeled data. (b) Once adding the label to each sample a pattern and separation

between benign vs. malignant appears.

potential to classify points by separating the region in the principal components plane. We

do not discuss concrete examples of constructing a classiﬁer in this case. We rather point

out that the data following a PCA transformation with reduced dimension

m<p

, can often

be used as input to a supervised learning algorithm.

Derivation of PCA

The PCA framework tries to project the data in the directions with maximum variance.

Returning to

(2.44)

,since

˜x

(i)

we can formulate this by maximizing the sample

variance of the components of

˜x

(i)

. Keeping in mind that

˜x

(i)

is a 0 mean vector, its sample

variance using (2.1), is simply ˜x

€

(i)

˜x

(i)

/n. Hence substituting ˜x

(i)

= Xv

,wehave,

Sample variance of component i =

€

= v

€

where

is the sample covariance of the data as in

(2.43)

. Thus in searching for the ﬁrst

loading vector v

we have the optimization problem,

max

v œR

€

Sv, subject to ÎvÎ =1. (2.46)

Note the constraint which seeks a normalized direction

with

ÎvÎ

=1which is equivalent

€

=1. This representation allows us to use Lagrange multiplier techniques where we

convert this constrained problem to an unconstrained quadratic problem. The objective is

then,

max

v, ⁄

€

Sv + ⁄

1 ≠ v

€

or max

v, ⁄

€

Sv ≠ v

€

⁄v + ⁄, (2.47)

with the Lagrange multiplier

⁄

and the constraint 1

≠ v

€

=0. Now taking derivatives

with respect to the vector v (see also Appendix A) we obtain

Sv ≠ ⁄v

. Thus the ﬁrst-order

conditions in terms of

reduce to the eigenvalue problem

⁄v

. This means that any

eigenvector

adheres to the ﬁrst-order conditions for the optimization problem

(2.46)

2.6 A Taste of Unsupervised Learning

By multiplying the eigenvalue equation by

€

we get

€

Sv ≠ v

€

⁄v

=0. As apparent

from the representation on the right hand side of

(2.47)

, this means that any e igenvector

yields a maximization objective which is equal to the corresponding eigenvalue

⁄

. Hence

the optimization problem is solved by choosing the maximal eigenvalue

⁄

with

being an

associated normalized eigenvector, v

as in (2.45).

The subsequent directions

for

,...,m

are chosen by maximizing the variance of

new linear combinations which are orthogonal to previous ones. That is, the directions

capture the part of variance which has not been previously captured. It can be shown that

a normalized eigenvector which matches the second eigenvalue,

maximizes the variance

once the direction of

is removed. This then continues for

,...,m

and in summary

principal components are determined via (2.45).

PCA Through SVD

We have already used the singular value decomposition (SVD) in Section 2.3 in the context

of the Moore-Penrose pseudo-inverse. We now further revisit the construction of SVD from

linear algebra and see the relationship b etween SVD and PCA.

Any n ◊ p dimensional matrix X of rank r can be represented as

X = UV

€

i=1

”

€

, with  = diag(”

,...,”

), and ”

> 0. (2.48)

Here the

n ◊r

matrix

and the

p ◊r

matrix

are both with orthonormal columns denoted

and

respectively for

,...,r

. These columns are called the left and right singular

vectors respectively. The values

”

in the

r ◊ r

diagonal matrix  are called singular values

and are ordered as

”

Ø ”

Ø ···Ø ”

0. Note that this representation of the SVD diﬀers

from the one employed near

(2.17)

in Section 2.3 where

and

were taken as square

matrices and  was not necessarily square. The form used in Section 2.3 is sometimes called

the full SVD and the form we present here is called the reduced SVD.

Consider now

again as the de-meaned data matrix

(2.42)

. Now using its SVD representation

in the sample covariance (2.43) we obtain

S =

V 

€

¸ ˚˙ ˝

€

UV

€

¸ ˚˙ ˝

V 

€

, with 

= diag(”

,...,”

Here the fact that

has orthonormal columns implies

€

is the

r ◊r

identity matrix and

hence it cancels out. Hence,

S =

i=1

”

€

. (2.49)

We can now compare to the eigenvector based representation of PCA where

is the matrix

of PCA loading vectors as in

(2.44)

. Take

and denote by  the diagonal matrix with

diagonal entries as the eigenvalues of

in decreasing order

⁄

Ø ... Ø ⁄

0. Now using

(2.45)wehavethespectral decomposition of S,

S =

€



V =

i=1

⁄

€

. (2.50)

2 Principles of Machine Learning

We now compare

(2.49)

and

(2.50)

and see that with

⁄

”

the loading vectors in

(2.50)

are the right singular vectors in (2.49). That is,

V = V .

Further, to obtain the data matrix of principal components,

(2.44)

we set

Now using the SVD representation of

(2.48)

and assuming

, PCA can be represented

as,

X = UV

€

¸ ˚˙ ˝

V = U =

|| |

”

... ”

|| |

. (2.51)

That is, each column of the reduced data matrix

is a left singular vector

stretched by

the singular value ”

. Further, for m<rwe only take the ﬁrst m columns.

With these relationships between PCA and SVD, numerical methods for computing the

SVD decomposition of

can be used for PCA. Indeed in practice, eﬃcient and numerically

robust computational metho ds for SVD are employed for PCA.

SVD for Compression

The singular value decomposition can also be viewed as a means for compressing any matrix

. Speciﬁcally, consider the SVD representation in

(2.48)

with

”

Ø ”

Ø ...Ø ”

.Thena

rank m<rapproximation of X is,

‚

X =

i=1

”

€

¥ X, where X ≠

‚

X =

i=m+1

”

€

. (2.52)

The rank of

‚

and since one often uses

signiﬁcantly smaller than

, this is called a

low rank approximation. For small enough

”

m+1

the approximation error is negligible since

the summation of rank one matrices

”

€

for

,...,r

is small. Observe that the

number of values used in this representation of

‚

m ◊

(1 +

) and for small

this

number is generally much smaller than

n ◊ p

which is the number of values in

. Hence

this may viewed as a compression method.

The usefulness of such low rank approximations is validated by a theoretical result called the

Eckart-Young-Mirsky theorem. Here we consider the approximation-error matrix

X ≠

‚

and

we seek to have the best rank

approximation in terms of minimization of

ÎX ≠

‚

XÎ

.The

theorem works for several types of matrix norms, yet here let us focus on the Frobenious

norm

denoted ÎAÎ

for any matrix A. We now have for the Frobenious norm,

min

‚

X of rank m

X ≠

‚

X ≠

i=1

”

€

i=m+1

”

. (2.53)

Singular value decomposition based matrix approximations such as

(2.52)

are useful in

multiple domains including improvement of neural network model size. We do not discuss

these topics speciﬁcally in this book. Instead, consider a simple visual example with a

353

◊

469 monochrome (grayscale) image appearing at the bottom right of Figure 2.15;this

. Then the other images in Figure 2.15 are

‚

with

= 10,

= 30, and

= 50. As

is evident, the

= 50 approximation appears close to the original image. The main plot

This is the square root of the sum of the squared elements of the matrix.

2.6 A Taste of Unsupervised Learning

Figure 2.15:

SVD for data compression: The original image is presented based on compression

with

=10singular values,

=30singular values, and

=50singular values. The images are

presented in terms of the relative approximation error based on the Frobenious norm.

in the ﬁgure is the relative approximation error as given by the right hand side of

(2.53)

divided by the sum of all singular values squared.

Note that the original image uses 353

◊

469 = 165

557 values while the

= 50 approximation

only uses 50

◊

(1 + 353 + 469) = 41

150 values. That is the approximation yields

‚

which is

compressed to about 25% of the size of X and looks very s imilar.

Note that variants of the types of plots as in Figure 2.15 are also common when carrying

out PCA. In that context the plot is called a scree plot and it presents the percentage of

variance explained by the principal components.

2 Principles of Machine Learning

Notes and References

One does not need to master all other branches of machine learning to understand deep learning,

nevertheless getting a taste for key elements of the ﬁeld is useful. Beyond the basics that we presented

in this chapter, one may consult several general machine learning texts. We recommend [

240

] for a

comprehensive mathematical account of practical machine learning and we recommend the more

classic [

] as an additional resource. Further, the book [

299

] provides a probabilistic approach.

Focusing on linear algebra, the introductory book [

] is a good introduction to foundations such as

K-means, least squares, and ridge regression. Further, [

391

] provides a richer context covering PCA,

SVD, and many aspects of matrix algebra appearing in machine learning. Finally for a short read

which provides an overview of many practical aspects of machine learning, see [

]. An additional

recommended reference is [263].

The worlds of machine learning and statistical inference are intertwined and methods developed

in one ﬁeld are often used in the other ﬁeld and vice versa. For those with expertise in one or

both of the ﬁelds it is quite easy to spot the diﬀerences between the approaches, however for those

entering these worlds afresh it may be helpful to read the survey paper “Statistical modeling: The

two cultures” by Leo Breiman, [

]. On that note, to get a feel for many statistical aspects of linear

regression,see,e.g.,[

296

] or one of many other statistical books. Note that [

296

]isalsoagood

reference for understanding interaction terms, a concept that we mentioned in the chapter and did

not cover. A general text that integrates methodology and algorithms with statistical inference and

machine learning together with s peculations of future directions is [115].

Throughout this chapter we have made reference to several aspects of statistics or machine learning

that are not studied further in this book. Here are some references for each. In general, a good

reference for likelihood based inference is [

]. Sp eciﬁcally Akaike information criterion (AIC),

introduced in [

], and the Bayesian informat ion criterion (BIC) are surveyed in [

432

]. A general

class of models also appearing in the next chapter is generalized linear models (GLMs); these ﬁrst

app eared in [

304

] and a go od contemporary applied reference is [

121

]. Other models are general

additive models (GAMs) which extend generalized linear models in which some predictor variables

are modelled by smooth functions; see [

168

]. In terms of non-linear regression the LOESS method is

a generalization of moving average and polynomial regression, see [

]. Further, Nadaraya-Watson

kernel regression is a non-parametric regression method in w hich a kernel function is exploited;

see [378].

We have covered the basics of decision theory via binary classiﬁcation however there are many more

studies for these aspects. See the comprehensive survey [

118

] on metrics for binary classiﬁcation as

well as [

158

]. For a discussion of diﬀerent uses of receiver operating curves and diﬀerent approaches

for them see [

] and [

315

]. The origins of the

score can be attributed to Cornelis Joost van

Rijsb ergen who introduced the eﬀectiveness function of which

score is a special case; see [

408

The SMOTE method for dealing with unbalanced data is from [

]. See also the surveys [

153

], [

211

and [348].

We brieﬂy mentioned the diﬀerences between discriminative and generative learning. More on the

topic is in chapter 9 of [

299

] together with a treatment of the naive Bayes classiﬁer and linear

discriminant analysis (LDA). The area of support vector machines became extremely popular in the

world of machine learning with their height of popularity during the 1990’s and the decade that

followed. A complete treatment of these methods is in [

240

] together with associated ideas of kernel

methods. Speciﬁc to this area is the concept of VC dimension (standing for Vapnik–Chervonenkis)

which we did not cover here; see [409].

Decision trees are also very popular machine learning techniques; see chapter 8 of [

240

] for an

overview. Within the study of machine learning, generic methods of boosting and bagging are

prominent in the context of decision trees. Speciﬁcally see [

366

] and [

]. The random forest

algorithm is one such method that has been hailed the most usable ad-hoc generic method when

there is not further information about the problem; see [

]. Gradient boosting has become very

popular due to a software package called XGBoost;see[

]. The K-nearest neighbours classiﬁcation

algorithm that we mention is often used as an introductory example. See for example Section 2.3

of [166].

The origins of least squares ﬁtting are from the turn of the 19’th century, initially with applications

to astronomy. The ﬁrst least squares publication is typically attributed to an 1809 paper by Gauss

2.6 A Taste of Unsupervised Learning

[

132

] although an earlier 1805 publication by Legendre publicized the concept ﬁrst. An interesting

historical investigation into “who invented least squares” is in [

389

]. Since then, least squares methods

have become some of the most prominent tools in applied mathematics. The Moore-Penrose pseudo

inverse was independently described in [

297

], [

], and [

328

]. Singular value decomposition (SVD)

has origins in diﬀerential geometry with the ﬁrst linear algebra publication typically associated

with the 1936 Eckart Young paper [

114

]. To the best our knowledge the ﬁrst as sociation between

SVD and least squares is in [

139

]. The survey [

105

] may also be of interest as it contrasts diﬀerent

numerical methods for least squares.

Aspects of multi-collinearity are treated in many statistical contexts, see for example [

281

]. Using

regression with other methods such as absolute error loss (robustness) is covered in [

302

] and for a

reference on the Huber error loss see [

197

]. See also [

160

] for a discussion on dealing with categorical

input features by conversion to numbers.

The origins of gradient descent are attributed to Cauchy from 1847 with [

], way before the

invention of any digital electronic computer; see [

253

] for an historical account. The analysis of

the loss landscape in machine learning has been studied multiple times, see for example [

279

] for a

survey, and [284] for theoretical result in a high dimensional context.

We have only touched the tip of the iceberg in terms of model selection. See [

390

] for a survey as

well as the book [

]. There are also recent developments such as [

] dealing with other approaches

for balancing underﬁtting and overﬁtting or bias and variance. In general, model selection is still

a very active open avenue of research. Our discussion s urrounding the generalization gap follows

similar lines to [263]. Our example in Figure 2.10 is inspired by a similar example in [39].

For an excellent reference dealing with the LASSO method see [

167

]. The original Tikhonov

regularization (ridge regression) technique appeared in 1943 in [

401

] yet it is b elieved to have been

invented in parallel in other contexts as well. Elastic net models are more of a speciality; see [

453

]

and also relations to support vector machines in [

452

]. Generalizations, variants, and discussion of

issues arising with K-fold cross validation are in [

424

] and [

232

]. See also [

215

] and [

] for recent

developments as well as [443] for stratiﬁed cross-validation and variants.

The accepted ﬁrst reference for K-means clustering is [

277

] from 1967 although the method was

known prior. An applied book to understand the main concept and algorithms for cluster analysis is

[

226

]. A recent comprehensive survey about clustering adding to what we presented here is in [

120

Further clustering approaches include hierarchical clustering,see[

300

]. See also an older general

survey in [

208

]. Principal component analysis was ﬁrst proposed by Pearson in [

326

] with initial

ideas also attributed to Hotelling in [

189

]. A substantial book on the topic is [

212

]. Relationships to

SVD are well explained in [

391

] and the ﬁrst appearance of the Eckart-Young-Mirsky theorem for

SVD was in [

114

] by Eckart and Young. It was further independently extended by Mirsky in [

292

One popular eﬃcient method for numerical computation of SVD is the so called Golub-Reinsch

algorithm ﬁrst introduced in [140]. A further overview of additional SVD algorithms is in [89].

See also https://losslandscape.com/ for a visual presentation of diﬀerent loss landscapes.