i
i
i
i
i
i
i
i
Mathematical Engineering
of Deep Learning
Benoit Liquet, Sarat Moka and Yoni Nazarathy
March 3, 2026
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
3 Simple Neural Networks
In the previous chapter we explored machine learning in general and also focused on linear
models trained via gradient based optimization. In this chapter we move up a notch and
consider logistic regression and multinomial (softmax) regression models used for regression
and classification. These models often play a key role in statistical modeling and are also very
important from a deep learning perspective. Logistic regression and multinomial regression
models are shallow neural networks, but also involve non-linear activation functions and
hence understanding their structure and means of training is a gateway to understanding
general deep learning networks.
In exploring logistic and multinomial regression, we are introduced to some of the basic
mathematical engineering elements of deep learning. These include the sigmoid activation
function, the cross-entropy loss, the softmax function, and the convexity properties that these
models enjoy. We also use the opportunity to consider simple but non trivial autoencoders
that generalize PCA, introduced in the previous chapter.
In Section 3.1 we consider the statistical viewpoint of the logistic regression model. In the
process we see how binary cross entropy loss results from maximum likelihood estimation. In
Section 3.2 we consider the same model, only this time as a shallow neural network trained
via gradient descent. In Section 3.3 we adapt the model to multi-class classification. Here we
introduce the softmax function as well as the categorical cross entropy loss. In Section 3.4
we investigate feature engineering for such models and see how additional features extend
the resulting classifiers to have non-linear decision boundaries. In Section 3.5 we move onto
unsupervised learning and consider simple non-linear autoencoder models. In that section we
also discuss general autoencoder concepts and describe a few applications of autoencoders.
Note that a reader returning to Section 3.5 after reading Chapter 5, may also envision how
the simple shallow autoencoders that we present in Section 3.5, can be generalized to deep
autoencoders.
3.1 Logistic Regression in Statistics
Logistic regression can be viewed as the simplest non-linear neural network model. However,
outside of the context of deep learning, logistic regression is a very popular statistical model.
In this section we present logistic regression via a statistical viewpoint. We show how to
estimate parameters via maximum likelihood estimation and in the process are introduced to
the (binary) cross entropy loss function common in deep learning models including logistic
regression, but also beyond.
Note that the statistical view of logistic regression also incorporates parameter uncertainty in-
tervals, hypothesis tests, and other statistical inference aspects. Literature for such statistical
inference aspects of logistic regression is provided to at the end of the chapter.
1
i
i
i
i
i
i
i
i
3 Simple Neural Networks
The Model
Logistic regression is a model for predicting the probability that a binary response is 1. It is
suitable for classification tasks, a concept described in Section ??, as well as for prediction
of proportions or probabilities. From a statistical perspective, it is defined by assuming that
the distribution of the binary response variable,
y
, given the features,
x
, follows a Bernoulli
distribution
1
with success probability
ϕ
(
x
) that is defined below. That is, if we represent the
random variables of the feature vector and the (scalar) response via
X
and
Y
respectively,
then
P(Y = 1 | X = x) = ϕ(x) and P(Y = 0 | X = x) = 1 ϕ(x). (3.1)
To capture the relationship between
x
and
ϕ
(
x
), the model uses the odds of
ϕ
(
x
), namely
ϕ(x)/
1 ϕ(x)
, via the log odds represented as the logit function,
Logit(u) = log
u
1 u
. (3.2)
Using this notation, the logistic regression model assumes a linear (affine) relationship
between the feature vector x and the log odds of ϕ(x). Namely,
Logit
ϕ(x)
= b + w
x. (3.3)
Here for feature vector
x R
p
, the parameter space is Θ =
R × R
p
and the parameters
θ
Θ
are denoted via θ = (b, w). In this case, the number of parameters is d = p + 1 similarly to
the linear regression models of Section
??
. Like the linear regression model introduced in
Section
??
, the scalar parameter
b
is called the intercept or bias and the vector parameter
w
is called the regression parameter or weight vector.
The fact that
(3.3)
presents
Logit
(
·
) on the left hand side in contrast to simply the expected
response
y
as one would have in linear regression, sets logistic regression as a non-trivial form
of a generalized linear model (GLM). We do not discuss general GLM further, nevertheless
it is good to note that using GLM terminology,
Logit
(
·
) plays the role of a link function. See
notes at the end of this chapter for details and references.
A closely related function to
Logit
(
·
), central to logistic regression and deep learning, is the
sigmoid function, also called the logistic function. It is denoted
σ
Sig
(
·
) and is also discussed
in Section ?? where it is plotted in Figure ??. Its expression is,
σ
Sig
(u) =
1
1 + e
u
. (3.4)
It has domain
R
, range (0
,
1), and is monotonically increasing. Note that the logit function
and the sigmoid function are inverses. That is, σ
Sig
Logit(u)
= Logit
σ
Sig
(u)
= u.
Applying
σ
Sig
(
·
) to the representation of the model in
(3.3)
we obtain the common represen-
tation of the logistic regression model,
ϕ(x) = P
Y = 1 | X = x ; θ = (b, w)
= σ
Sig
(b + w
x) =
1
1 + e
(b+w
x)
. (3.5)
1
This is a probability distribution of a random variable having only two outcomes, 0 or 1. It is a special
case of the binomial distribution where the parameter for the “number of trials” is set at 1.
2
i
i
i
i
i
i
i
i
3.1 Logistic Regression in Statistics
Side Note: The Logistic Distribution
When considering the statistical logistic regression model, it is also interesting to represent
the relationship between
X
and
Y
via a continuous latent variable, denoted via
Z
. A latent
variable is an unobserved quantity. In the case of logistic regression,
Z
allows us to use the
linear model representation,
Z = b + w
X + ϵ, with
(
Y = 1, if Z 0,
Y = 0, if Z < 0.
(3.6)
Here the noise component
ϵ
follows a (standard) logistic distribution whose cumulative
distribution function,
F
ϵ
(
u
) =
P
(
ϵ u
), is given by
F
ϵ
(
u
) =
σ
Sig
(
u
). Such a representation
agrees with (3.5) since,
ϕ(x) = P(Y = 1 | X = x ; θ) = P(Z 0 | X = x ; θ)
= P(b + w
x + ϵ 0)
= P(b + w
x ϵ 0)
= P(ϵ b + w
x) = σ
Sig
(b + w
x).
Note that in the step between the second line and the third line we use the fact that the
(standard) logistic distribution of
ϵ
is symmetric about 0 and hence
ϵ
and
ϵ
are equal in
distribution.
We do not use the latent representation
(3.6)
any further. Yet it may be interesting to
note that if one chooses to use a Gaussian distribution for
ϵ
in
(3.6)
in place of the logistic
distribution, the model turns out to be a probit regression model instead of
(3.3)
. Hence
such a latent representation of the model is generally interesting.
Estimation Using the Maximum Likelihood Principle
When statistical assumptions are imposed, maximum likelihood estimation (MLE) is a very
common statistical method for estimating parameters. While most deep learning models in
this book do not use MLE or other statistical point estimation techniques directly, exploring
how MLE works for logistic regression is insightful.
Central to MLE is the likelihood function
L
: Θ
R
. It is a function of the parameters
θ
Θ which is obtained by the probability (or probability density) of the data for any given
parameter value
θ
. The idea of MLE is to choose values for
θ
that maximize the likelihood
(function). We see this in action for the logistic regression model now.
Consider the training observations
D
=
(x
(1)
, y
(1)
), . . . , (x
(n)
, y
(n)
)
denoted in a similar
way to the way
D
is defined for the linear model Section
??
. Here each label
y
(j)
is encoded
via either 0 or 1, and the feature vector x
(j)
is a p-dimensional vector in R
p
.
A common assumption in statistics and machine learning is that all features-label pairs
(
x
(j)
, y
(j)
) are identically distributed and mutually independent; this is often denoted i.i.d..
With this assumption, we aim to estimate the parameters
θ
= (
b, w
). The likelihood function
3
i
i
i
i
i
i
i
i
3 Simple Neural Networks
is
L(θ ; D) =
n
Y
i=1
P(Y = y
(i)
| X = x
(i)
; θ), (3.7)
where
L
(
θ
;
D
) is viewed as a function of
θ
given the observed sample
D
. The product is due
to the independence assumption arising as part of the i.i.d. assumption. In the context of
the underlying logistic model, using
(3.1)
and
(3.5)
, since the labels are either 0 or 1, the
likelihood evaluates as
L(θ ; D) =
n
Y
i=1
ϕ(x
(i)
)
y
(i)
1 ϕ(x
(i)
)
1y
(i)
where ϕ(x) = σ
Sig
(b + w
x). (3.8)
The maximum likelihood estimate is defined as a value of the parameter
θ
which maximizes
the likelihood function. That is, MLE is the value which maximizes the probability of the
observations D assuming the logistic model.
With
(3.8)
available, one may proceed to optimize the likelihood directly. However, in this
case of logistic regression, and in many other cases where MLE is applied to i.i.d. data,
it is more convenient to maximize the logarithm of the likelihood called the log-likelihood
denoted via
(
θ
;
D
) =
log (L(θ ; D))
. This is equivalent to maximizing the likelihood since
the
log
function is a monotonic increasing function. For logistic regression, the log-likelihood
expression is,
(θ; D) =
n
X
i=1
h
y
(i)
log
ϕ(x
(i)
)
+ (1 y
(i)
) log
1 ϕ(x
(i)
)
i
, (3.9)
and the maximum likelihood estimate (MLE) is represented via,
ˆ
θ
MLE
:= argmax
θR
1
×R
p
(θ ; D) = argmin
θR
1
×R
p
1
n
(θ ; D), (3.10)
since maximizing
(
θ
;
D
) is the same as minimizing
(
θ
;
D
) and the positive factor 1
/n
does not change the optimization, yet is useful in the presentation that follows.
One can show that the function
1
n
(
·
;
D
) is convex and hence a global minimum exists.
Aspects of optimization and convexity are also further overviewed in Section
??
. However in
contrast to the linear model where an explicit analytic solution as in
(??)
exists, in the case
of logistic regression there is no analytic solution for the minimizer and hence optimization
algorithms are needed to find the MLE (3.10).
The Binary Cross-Entropy Loss
We have already claimed throughout Chapter
??
that learning almost always involves
optimization. The MLE based paradigm for logistic regression certainly reinforces this claim.
We now reposition logistic regression MLE in terms of minimization of a loss function,
following similar lines to the learning of the linear model of Section
??
. This setup continues
straight into more involved deep learning models that follow.
Recall the general loss function formulation which we first presented for linear models in
(??)
where the loss is
C
(
θ
;
D
) =
1
n
P
n
i=1
C
i
(
θ
). In that context of linear models the individual
loss is
C
i
(
θ
) =
C
i
(
θ
;
y
(i)
, ˆy
(i)
) = (
y
(i)
ˆy
(i)
)
2
. Logistic regression learning follows similar
4
i
i
i
i
i
i
i
i
3.1 Logistic Regression in Statistics
lines except that
C
i
(
θ
) is not the quadratic loss. To see this, revisit the minimization form
of
(3.10)
together with the log-likelihood expression
(3.9)
. The scaled negative likelihood
that needs to be minimized can then be represented as a loss function via
C(θ ; D) =
1
n
n
X
i=1
h
y
(i)
log
ˆy
(i)
+ (1 y
(i)
) log
1 ˆy
(i)
i
, (3.11)
where
ˆy
(i)
=
ϕ
(
x
(i)
) =
σ
Sig
(
b
+
w
x
(i)
). This then implies that for logistic regression the
loss for each data sample is C
i
(θ) = CE
binary
(y
(i)
, ˆy
(i)
) where
CE
binary
(y, ˆy) =
y log
ˆy
+ (1 y) log
1 ˆy
, (3.12)
is called the binary cross entropy (estimate) applied to observation y and prediction ˆy.
In general the phrase “cross entropy” and specifically “entropy” is rooted in information
theory. Appendix
??
outlines relationships between cross entropy and related quantities.
However, these relationships are not critical for understanding of deep learning. In terms of
relating
(3.11)
to the probabilistic meaning of cross entropy, see first the definition of the
cross entropy for two probability distributions in
(??)
and then see the specialisation to
distributions with binary outcomes in (??).
We also mention that while here, we arrived to the binary cross entropy as a bi-product
of maximum likelihood estimation for logistic regression, in general, binary cross entropy
has become the default loss function for more complex binary classification deep learning
models, as surveyed in Chapter ?? and onwards.
Predicted Probabilities and Parameter Interpretability
For any observed or postulated feature vector
x
R
p
, the output of logistic regression is
ˆy
=
ϕ
(
x
). This is a probability which can be interpreted as in the left hand side of
(3.1)
.
Hence at its core, the logistic regression model
(3.5)
yields probabilities as outputs. Below we
see how these probabilities can be used for classification, yet first let us consider prediction
of probabilities or proportions.
Recall the breast cancer example presented in Section
??
where we overviewed concepts
of binary classification. In that example, based on the Wisconsin breast cancer dataset,
we used a training set to create two logistic regression models. The first model used
the
smoothness_worst
feature as the single coordinate of
x
and the second model used
all possible 30 features. We now continue to use this dataset, this time using all
n
=
569 observations for statistical inference and setting a single feature model based on the
area_mean feature, and a two feature model based on area_mean and texture_mean.
With the estimated two feature model based on estimated parameters
ˆ
θ
= (
ˆ
b, ˆw
1
, ˆw
2
), the
predicted probability for an observation x
R
2
is
ˆy = σ
Sig
(
ˆ
b + ˆw
1
x
area_mean
+ ˆw
2
x
texture_mean
),
5
i
i
i
i
i
i
i
i
3 Simple Neural Networks
where for clarity we subscript coordinates of the features vector
x
with the feature name.
Using (3.3) this can also be represented as,
log
ˆy
1 ˆy
=
ˆ
b + ˆw
1
x
area_mean
+ ˆw
2
x
texture_mean
. (3.13)
The representation in
(3.13)
is appealing since it endows the estimated parameters
ˆ
b
,
ˆw
1
,
and
ˆw
2
with a concrete interpretation. First observe that the log odds is linearly described
by the features
area_mean
and
texture_mean
. This means that it is estimated that a unit
increase in
area_mean
will see the log odds increase by
ˆw
1
, a unit increase in
texture_mean
will see the log odds increase by
ˆw
2
, and similarly when both features are at zero, the log
odds is at the value
ˆ
b
. This interpretation of parameters is clearly not limited to a model
with two features as it would work for any number of features.
In practice, it is more convenient to interpret the odds given by,
ˆy
1 ˆy
= e
ˆ
b+ ˆw
1
x
area_mean
+ ˆw
2
x
texture_mean
.
Specifically, an increase of
area_mean
by one unit implies a multiplicative increase for the
odds by a factor of
e
ˆw
1
, and similarly for
texture_mean
by a factor of
e
ˆw
2
. With this view
it is typically common to consider the odds ratio where the meaning of the multiplicative
factor
e
ˆw
i
is the ratio between the odds of a model where the feature is at some level
x
i
between the odds at some level
x
i
+ 1. This is especially useful where features are binary
but also in general. Further note that when
e
ˆw
i
>
1 we say the feature has an increasing
effect on the probability of the outcome being
positive
, and in the opposite direction a
feature with e
ˆw
i
< 1 yields a decrease in this probability when the feature is increased.
Such parameter interpretation transcends from linear models to logistic regression models
as well as to other statistical models. This interpretability property often makes models
extremely attractive in biostatistics and similar fields. It means that estimated parameters
of the model are not only useful for their predictive ability, but also for reasoning about the
relationships between variables.
As we progress beyond this chapter to more complicated deep neural networks, direct inter-
pretability of parameter values is often lost. In such non-interpretable cases, the parameter
estimate
ˆ
θ
is not a useful learning outcome on its own, but is rather only useful for the tasks
of the model. Nevertheless we mention that an active area of machine learning and deep
learning research is to seek interpretable models.
In Figure 3.1 we present the actual fit of the single feature model and the two features model
for the Wisconsin breast cancer observations. The response
ˆy
is the probability of malignant
lumps whereas the data involves observations with
y
= 1 (
positive
, i.e., malignant) and
y
= 0 (
negative
, i.e., benign). In (a) we see the sigmoid function applied to
ˆ
b
+
ˆw
1
x
1
for
the estimated single feature model directly and also plot all the observations with a slightly
random jitter on the y axis so that they can be visualized easily, where we cut off high
outliers. In (b) the sigmoid function is applied to
ˆ
b
+
ˆw
1
x
1
+
ˆw
2
x
2
yielding a surface where
we omit plotting the actual observations. In both the univariate and multivariate cases it is
evident that the models present a monotonic relationship between each of the variables and
the predicted probability. This property holds for any number of features.
6
i
i
i
i
i
i
i
i
3.1 Logistic Regression in Statistics
0.0
0.2
0.4
0.6
0.8
1.0
500 1000 1500
x
1
y
^
(a)
500
1000
1500
10
15
20
25
30
35
40
0.2
0.4
0.6
0.8
x
1
x
2
y
^
(b)
Figure 3.1: Logistic regression models for probability prediction fit to the Wisconsin breast cancer
dataset. (a) A
p
= 1 model with the feature
area_mean
(
x
1
)and a confidence band. (b) A
p
= 2
model based on the features area_mean (x
1
) and texture_mean (x
2
).
Logistic Regression for Classification is a Linear Classifier
In addition to the application of probability prediction as described above, logistic regression
models can also be naturally used for binary classification. We have already seen how
to convert such models into binary classifiers. This was first seen in Section
??
where
we introduced the threshold based classifier in
(??)
with the label prediction
b
Y
set to 1
(positive) if ˆy > τ and 0 (negative) otherwise.
As an illustration, consider the two features breast cancer prediction model with a response
surface as in Figure 3.1 (b). By setting
τ
= 0
.
5 for this model we get a classifier as illustrated
in Figure 3.2. The red region corresponds to potential feature vectors that would be classified
as
negative
(benign) and the blue region corresponds to
positive
(malignant). It is seen
that many, but not all, of the training samples are correctly classified.
Interestingly the classification boundary appears like a straight line, or more precisely a
hyperplane in the feature space. One may then wonder if this is a property of logistic regression.
We now show that in fact, logistic regression based binary classifiers are linear classifiers.
Such linear classifiers separate the feature space via a single hyperplane,
H
(
x
) =
˘
b
+
˘w
x
,
where
˘
b R
and
˘w R
p
are the parameters of the hyperplane. Since every hyperplane cuts
Euclidean space into two half-spaces, a natural way to use a hyperplane for classification of
a feature vector x
R
p
is,
b
Y =
(
0 (negative), if H(x
) 0,
1 (positive), if H(x
) > 0.
(3.14)
This means that if
x
falls in one of the half spaces the classification is
negative
and in the
other it is positive. The distinction on the boundry is arbitrary.
7
i
i
i
i
i
i
i
i
3 Simple Neural Networks
500 1000 1500 2000
10 15 20 25 30 35
x
1
x
2
y=0 (benign)
y=1 (malignant)
Figure 3.2: Binary classification of malignant lumps (blue dots) and benign lumps (red dots)
using a logistic model with two features, area_mean (x
1
) and texture_mean (x
2
).
To see that logistic regression is a linear classifier consider estimated model parameters
ˆ
θ
= (
ˆ
b, ˆw
) and probability prediction
ˆy
. The
positive
classification then occurs if
ˆy > τ
,
i.e.,
σ
Sig
(
ˆ
b
+
ˆw
x
)
> τ
. We can then apply the
Logit
(
·
) function
(3.2)
to this inequality.
Since
Logit
(
·
) is a monotonic increasing function and is the inverse of the sigmoid function,
we obtain,
ˆ
b + ˆw
x
> log(
τ
1 τ
), or
ˆ
b + log(τ
1
1) + ˆw
x
> 0.
We thus see that the hyperplane parameters associated with logistic regression classification
are
˘
b
=
ˆ
b
+
log
(
τ
1
1) and
˘w
=
ˆw
. This shows that logistic regression with a threshold
rule yields a linear classifier. Note in fact that with the same argument, any model of the
form
ˆy
=
σ
(
ˆ
b
+
ˆw
x
) where
σ
(
·
) is some bijection (invertible function) from
R
to
R
yields a
linear classifier. This naturally includes the linear model based binary classification outlined
in Section
??
. It also includes the probit model mentioned earlier in this section, as well as
any shallow neural network for binary classification that has a strictly monotonic activation
function, a concept introduced in the next section. Other very common linear classifiers,
including support vector machine models are not surveyed in this book.
Note also that one often transforms classifiers with linear decision boundaries into more
expressive classifiers via feature engineering. In Section 3.4 we explore how feature engineering
based transformations of the features may yield non-linear decision boundaries for the models
studied in this chapter.
3.2 Logistic Regression as a Shallow Neural Network
We now position the logistic regression model as a deep learning model. However, as we
shortly explain, it is not really “deep” but is rather “shallow” since it does not have hidden
layers. This is the first instance in this book where we explicitly consider deep learning
models in some mathematical detail. The general (fully connected) deep learning model is
8
i
i
i
i
i
i
i
i
3.2 Logistic Regression as a Shallow Neural Network
outlined in Chapter
??
and our presentation here serves as a shallow introduction. Note that
the linear model of Section
??
can also be positioned as a deep learning model; we shed light
on this too. Further, the multinomial regression model of the next section is a close relative
of logistic regression, and as we show in that section, it is also a shallow neural network.
Logistic Regression is an Artificial Neuron
Let us first represent the logistic regression model (3.5) as
ˆy = σ
z
z }| {
b + w
x
| {z }
a
. (3.15)
Observe that in
(3.15)
, we omit the subscript from
σ
Sig
(
·
) used in
(3.5)
. We call
σ
(
·
) a scalar
activation function which in the case of logistic regression needs to be
σ
Sig
(
·
), but in other
cases can be a different function. Section
??
is devoted to specific forms of such activation
functions beyond the sigmoid function. At this point, let us just mention a trivial alternative,
the identity activation function
σ
(
z
) =
z
. With this identity activation function, the model
in (3.15) is clearly just the linear model ˆy = b + w
x.
x
1
x
2
.
.
.
x
p
P
b
w
1
w
2
w
p
ˆy = σ(b + w
>
x)
z = b +
P
p
i=1
w
i
x
i
σ(z) =
1
1 + e
z
(0, 1)
Input
x
Weight, Bias
(w, b)
Ane
Transformation
z
Activation
σ (z)
Output
ˆy
Figure 3.3: Logistic regression represented with neural network terminology as a shallow neural
network. The gray box represents an artificial neuron composed of an affine transformation to create
z and an activation σ(z).
The form of
(3.15)
represents what we may call a single layer of a deep learning model, a
shallow neural network, or simply an artificial neuron. In this case the vector inputs
x R
p
are transformed to a scalar output
ˆy R
. However, in general, deep learning models (as well
as shallow neural networks) allow for vector outputs. The next section presents such a case.
Observe that the artificial neuron is composed of an affine transformation
z
=
b
+
w
x
followed by a (generally) non-linear transformation
a
=
σ
(
z
). This notation of using
z
for
the result of the affine transformation and
a
for the result of the non-linear transformation
is common in deep learning models and heavily used in Chapter
??
. Note that the actual
definition of an artificial neuron is sometimes taken as the combination of
z
and
a
, sometimes
9
i
i
i
i
i
i
i
i
3 Simple Neural Networks
just as the non-linear result
a
, and sometimes as the computation mechanism defined by
the right hand side of
(3.15)
. Figure 3.3 summarizes the components of the artificial neuron
focusing on the specific σ(·) = σ
Sig
(·) activation function.
A “deeper” deep learning model would have “hidden layers” based on composition of
constructs similar to
(3.15)
. This would involve multiple
z
and
a
values computed along
the way. Thus when there is a single layer as in
(3.15)
, the neural network is called shallow
and otherwise it is called deep. Logistic regression is the simplest non-linear scalar output
shallow neural network that one can consider.
Training Logistic Regression
We have already seen in
(3.10)
how maximum likelihood estimation positions parameter
fitting of logistic regression as an optimization problem. We then saw that maximization
of the likelihood is identical to minimization of the loss
C
(
θ
;
D
) =
1
n
P
n
i=1
C
i
(
θ
), with
C
i
(
θ
) defined via the cross entropy cost
(3.12)
. Importantly, when considered as a deep
learning model or a machine learning model, one sometimes ignores the maximum likelihood
interpretation of logistic regression and starts off with the cross entropy loss as an engineered
loss function which requires minimization.
When statistical software packages are used to estimate parameters of logistic regression,
this optimization is typically tackled using second-order methods; see Section
??
for an
overview of such techniques. In general, such methods make use of the Hessian matrix of
the loss function or approximations of it; see Appendix ?? for a review.
In contrast to statistical practices, in deep learning and machine learning culture, one often
considers the number of features
p
to be very large in which case standard second-order
optimization methods tend to struggle computationally. Thus when treated as a deep learning
model, the default technique used for logistic regression training is gradient descent. Gradient
descent was already introduced in Section
??
in the context of the linear model, and variants
of gradient based learning are studied in detail in the next chapter as well; see sections
??
and ??.
General deep learning models do not have explicit expressions for the gradient of
C
(
θ
;
D
)
and certainly not for the Hessian matrix. Hence learning such models requires computational
techniques for gradient evaluation. The most common technique is the backpropagation
algorithm, described in Section
??
, which is a form of automatic differentiation, overviewed
in Section
??
. However, in the case of logistic regression, like the linear model, there are
explicit expressions both for the gradient and the Hessian. We see these now.
In the case of logistic regression, the gradient of
C
(
θ
;
D
) =
1
n
P
n
i=1
C
i
(
θ
) with respect to
θ
= (
b, w
)
R × R
p
is a vector in
R
d
with
d
=
p
+ 1. It is denoted as
C
(
θ
) and can be
represented as the average of the gradients of each C
i
(θ). Namely,
C(θ) =
1
n
n
X
i=1
C
i
(θ). (3.16)
This general relationship between the gradient of the total loss function
C
(
θ
) and the
gradients of the loss for each observation
C
i
(
θ
) is common throughout deep learning. In
the case of logistic regression we have that
C
i
(
θ
) =
CE
binary
(
y
(i)
, ˆy
(i)
) with
CE
binary
(
·, ·
)
10
i
i
i
i
i
i
i
i
3.2 Logistic Regression as a Shallow Neural Network
from
(3.12)
. We thus require expressions for the gradient of this binary cross entropy loss
with observation label y
(i)
and predicted value ˆy
(i)
= σ
Sig
(b + w
x
(i)
). That is,
C
i
(θ) = −∇
y
(i)
log σ
Sig
(b + w
x
(i)
) + (1 y
(i)
) log
1 σ
Sig
(b + w
x
(i)
,
where the gradient is with respect to the vector
θ
= (
b, w
). Now using basic differentiation
rules and the structure of σ
Sig
(·), we obtain,
C
i
b
= σ
Sig
(b + w
x
(i)
) y
(i)
,
C
i
w
j
=
σ
Sig
(b + w
x
(i)
) y
(i)
x
(i)
j
for j = 1, . . . , p.
Thus in combining these components we can represent the
d
=
p
+ 1 dimensional gradient
vector as,
C
i
(θ) =
σ
Sig
(w
x
(i)
+ b) y
(i)
1
x
(i)
, (3.17)
where the first scalar expression on the right hand side of
(3.17)
is the prediction difference
for observation
i
and the second expression is the vector of features for observation
i
including
the constant 1 feature.
(a) (b)
Figure 3.4: The loss landscape of logistic regression for a synthetic dataset. (a) Using the squared
loss C
i
(θ) = (y
(i)
ˆy
(i)
)
2
. (b) Using the binary cross entropy loss C
i
(θ) = CE
binary
(y
(i)
, ˆy
(i)
).
Some Benefits of Cross Entropy Loss
If one considers the problem of logistic regression training purely based on an optimization
approach and not based on a maximum likelihood approach, then the cross entropy loss can
potentially be replaced by other loss functions such as for example quadratic cost. However,
it turns out that using the cross entropy loss, generally yields desirable loss landscapes.
As an illustration consider Figure 3.4 based on logistic regression with synthetic data of a
single feature (
p
= 1 and
d
= 2). The parameters to be optimized are the scalars
b
and
w
. In
(a) we use the squared error loss and in (b) we use the cross entropy loss. It is clear from
11
i
i
i
i
i
i
i
i
3 Simple Neural Networks
this image, that at least in this case, the cross entropy loss landscape is more manageable to
navigate for an optimization algorithm like gradient descent. In fact, in the case of logistic
regression, cross entropy always yields a convex loss landscape (further details are in the
next chapter), while other losses such as the squared error loss generally yield non-convex
loss landscapes often presenting multiple local minima as well as saddle points.
Importantly, when considering classification (or probability prediction problems), deep
learning models that are more complex than logistic regression are still often easier to
optimize using the cross entropy loss than the squared error loss or other losses. In such more
complex models, the neat mathematical convexity property that logistic regression enjoys
with cross entropy is lost, and multiple local minima can exit. Nevertheless, computational
and research experience has shown that in general using cross entropy is preferable.
As an illustration of gradient descent applied to a concrete example we return to the
Wisconsin breast cancer dataset, now splitting the observations as we did in Chapter
??
into
n
train
= 456 and
n
test
= 113. We use a model with
p
= 10 features (
d
= 11) and learn the
parameters via gradient descent with some arbitrary initilization. In doing so, we obtain
trajectories as in Figure 3.5.
0.2
0.3
0.4
0.5
0.6
0 10 20 30 40
Iteration
Cross Entropy
Test
Train
(a)
0.75
0.80
0.85
0.90
0.95
0 10 20 30 40
Iteration
F1 score
(b)
Figure 3.5: Training logistic regression via gradient descent for the Wisconsin breast cancer dataset
with an 80-20 train-test split (we can treat the test set as a validation set). (a) Loss over iterations.
(b) Performance over iterations using the F
1
score when τ = 0.5.
3.3 Multi-class Problems with Softmax
The multinomial regression model as it is known in statistics, or softmax regression as it
is known in machine learning
2
is the generalization of logistic regression from the binary
response case to the case of
K >
2 classes. Now the response random variable
Y
takes on
values 1, 2, . . . , K. The feature vector remains X just like in logistic regression.
2
In some machine learning fields, this is also called softmax logistic regression, multinomial logistic
regression, or multi-class logistic regression.
12
i
i
i
i
i
i
i
i
3.3 Multi-class Problems with Softmax
We denote the training observations via
D
=
(x
(1)
, y
(1)
), . . . , (x
(n)
, y
(n)
)
, where each
label
y
(j)
is one of
K
class values
{
1
, . . . , K}
. The purpose of the model is to predict class
probability vectors, or if used for classification, to predict a class in a multi-class setting.
The Model
Just like logistic regression predicts two classes and uses
ϕ
(
x
) for the probability of the
positive response, in multinomial regression the predicted response is the probability vector
ϕ(x) =
ϕ
1
(x), . . . , ϕ
K
(x)
, (3.18)
where,
ϕ
k
(x) = P(Y = k | X = x) for k = 1, . . . , K. (3.19)
To compare
(3.19)
with
(3.1)
, observe that
(3.1)
is like
(3.19)
with
K
= 2 where
ϕ
(
x
) =
ϕ
1
(
x
)
and 1 ϕ(x) = ϕ
2
(x).
The name “multinomial regression” stems from the multinomial distribution which is a
probability distribution over vectors of length
K
consisting of non-negative integers that
sum up to some specified integer
N
. Specifically a random vector (
U
1
, . . . , U
K
) follows
a multinomial distribution with parameters
N
(positive integer) and
ϕ
= (
ϕ
1
, . . . , ϕ
k
)
(probability vector) if,
P (U
1
= u
1
, . . . , U
K
= u
K
) =
N!
u
1
! · . . . · u
K
!
K
Y
k=1
ϕ
u
k
k
for
K
X
k=1
u
k
= N,
with the probability at 0 when
P
K
i=k
u
k
=
N
. A specific case of the multinomial distribution
with
N
= 1 is called a categorical distribution. In this case the random vector (
U
1
, . . . , U
K
)
is like a one-hot encoded vector since exactly one coordinate is 1 and the rest are 0s.
A statistical assumption in multinomial regression is that the one-hot encoded response,
given the features vector
X
=
x
, follows a categorical distribution. The random vector of
the one-hot encoded response of
Y
is denoted (
Y
1
, . . . , Y
k
) where
Y
k
= 1
{Y
=
k}
, i.e., the
k
th coordinate equals 1 if
Y
=
k
and equals 0 otherwise. Now the categorical distribution
assumption is,
P (Y
1
= u
1
, . . . , Y
K
= u
K
| X = x) =
K
Y
k=1
ϕ
k
(x)
u
k
. (3.20)
The key model assumption in multinomial regression
3
is in the way in which the probability
vector
ϕ
(
x
) depends on the features vector
x
. There are two possible parameterizations, which
we refer to as the statistical parameterization and the machine learning parameterization.
The former presents a unique (identifiable) model while the latter is slightly simpler but has
some redundant parameters.
Starting with the statistical parameterization, it assumed that,
ϕ
k
(x) =
e
b
k
+w
(k)
x
1 +
P
K1
j=1
e
b
j
+w
(j)
x
for k = 1, . . . , K 1, (3.21)
3
Like logistic regression, this assumption is rooted in the theory of generalized linear models (GLM), a
topic we defer to the notes and references at the end of the chapter.
13
i
i
i
i
i
i
i
i
3 Simple Neural Networks
and further,
ϕ
K
(x) = 1
K1
X
j=1
ϕ
j
(x). (3.22)
This directly generalizes the sigmoidal relationship of logistic regression
(3.5)
from
K
= 2 to
K
2. With this statistical parameterization, the parameters of multinomial regression are
θ = (b
1
, . . . , b
K1
, w
(1)
, . . . , w
(K1)
) R × . . . × R
| {z }
K1 times
× R
p
× . . . × R
p
| {z }
K1 times
:= Θ, (3.23)
and hence the number of parameters is
d
= (
K
1)(
p
+1). The scalar parameters
b
1
, . . . , b
K1
are bias (intercept) parameters and each of the vector parameters
w
(1)
, . . . , w
(K1)
is a
weight vector (regression parameter). With this parameterization, we may also view the
final bias
b
K
and final weight vector
w
(K)
as the scalar 0 and vector 0 respectively. Such a
restriction on b
K
and w
(K)
allows us to combine (3.21) and (3.22) into,
ϕ
k
(x) =
e
b
k
+w
(k)
x
P
K
j=1
e
b
j
+w
(j)
x
for k = 1, . . . , K. (3.24)
Moving onto the machine learning parameterization the last class also has free parameters
b
K
and
w
(K)
like the other classes. In this case the parameter space is Θ =
R
K
×
(
R
p
K
and
thus
d
=
K
(
p
+ 1). Now again the representation
(3.24)
is valid, yet unlike the statistical
parameterization, the last term in the summation in the denominator is not restricted to be 1.
In this sense the machine learning parameterization is simpler, however when estimating θ
with maximum likelihood estimation there is never a unique
θ
that maximizes the likelihood
(or minimizes the loss).
The Softmax Function and Multinomial Regression as a Shallow Neural
Network
A common function in deep learning, especially when considering classification problems, is
the R
K
R
K
softmax activation function. For z = (z
1
, . . . , z
K
) R
K
, it is defined as,
S
Softmax
(z) =
1
P
K
i=1
e
z
i
e
z
1
.
.
.
e
z
K
. (3.25)
Softmax and its derivative expressions are further discussed in Section
??
. At this point let
us just point out that for any vector
z R
K
, the result of
S
Softmax
(
z
) is a probability vector.
Now with
S
Softmax
(
·
) defined we can revisit the multinomial regression model
(3.24)
and
represent it as the softmax of a vector of length
K
that has
b
k
+
w
(k)
x
at the
k
th coordinate.
More succinctly we have,
ˆy = S
Softmax
zR
K
z }| {
b + W x
| {z }
aR
K
, (3.26)
14
i
i
i
i
i
i
i
i
3.3 Multi-class Problems with Softmax
where
ˆy
is the prediction of
ϕ
(
x
) as in
(3.18)
, and the
K
dimensional vector
b
and the
K × p
dimensional matrix W are
b =
b
1
.
.
.
b
K
, and W =
—— w
(1)
——
—— w
(2)
——
.
.
.
—— w
(K)
——
. (3.27)
Compare
(3.26)
with
(3.15)
of logistic regression. In the logistic regression case,
z
is a scalar
and so is
a
. In contrast, in the multinomial regression case the affine transformation converts
a vector
x R
p
into a vector
z R
K
and then the softmax activation retains the same
dimension to arrive at a.
Note also that
(3.26)
is valid both in the statistical parameterization and the machine
learning parameterization. In the former, the last bias
b
K
and the last row
w
(K)
of
(3.27)
are zeros, where in the latter these are free variables.
b
1
+ w
>
(1)
x
| {z }
z
1
a
1
=
e
z
1
P
K
i=1
e
z
i
ˆ
φ
1
(x)
b
2
+ w
>
(2)
x
| {z }
z
2
a
2
=
e
z
2
P
K
i=1
e
z
i
ˆ
φ
2
(x)
b
K
+ w
>
(K)
x
| {z }
z
K
a
K
=
e
z
K
P
K
i=1
e
z
i
ˆ
φ
K
(x)
x
1
x
1
x
1
x
2
x
2
x
2
x
3
x
3
x
3
x
p
Softmax
ˆ
y
Figure 3.6: Multinomial regression as a neural network model with output
ˆy
which is a probability
vector. Note that each a
i
is a function of all of z
1
, . . . , z
K
due to the softmax operation.
This representation positions multinomial regression as a shallow neural network similarly
to the way logistic regression is a shallow neural network. The difference is that multinomial
regression has vector outputs while logistic regression has a scalar output. Figure 3.6
illustrates the multinomial regression model as a neural network. Here each circle can again
be viewed as an “artificial neuron” however note that the softmax activation affects all
15
i
i
i
i
i
i
i
i
3 Simple Neural Networks
neurons together via the normalization in the denominator of
(3.25)
. Hence the activation
value of each neuron is not independent of the activation values of the other neurons.
Likelihood and Cross Entropy
Now that we understand the multinomial regression model, both from a statistical perspective
using the probabilistic interpretation
(3.20)
and as a deep learning model using
(3.26)
, let
us consider maximum likelihood estimation, and equivalently loss function minimization.
To obtain an estimate
ˆ
θ
using maximum likelihood estimation, we follow a procedure
analogous to the one used for logistic regression. Specifically, the likelihood of the model for
the data
D
is defined as in
(3.7)
. Now using the probability law of the categorical distribution,
(3.20), we obtain the likelihood,
L(θ ; D) =
n
Y
i=1
K
Y
k=1
ϕ
k
(x
(i)
)
1{y
(i)
=k}
for θ Θ,
where we use the parameter space Θ as in the statistical parameterization
4
(3.23)
. Now
considering the log-likelihood by applying log(·) to L(θ ; D) we obtain,
(θ ; D) =
n
X
i=1
K
X
k=1
1{y
(i)
= k} log ϕ
k
(x
(i)
)
=
n
X
i=1
log ϕ
y
(i)
(x
(i)
)
=
n
X
i=1
b
y
(i)
+ w
(y
(i)
)
x
(i)
log
K
X
j=1
e
b
j
+w
(j)
x
(i)
.
(3.28)
In the second step, we use the subscript
y
(i)
because except for the index
k
where
y
(i)
=
k
,
all other summands of the internal sum of the first row are zero. The third step follows from
(3.24).
Now similarly to our development of logistic regression MLE in
(3.10)
, we can represent the
problem as a minimization problem and define the estimator via,
b
θ
MLE
= argmin
θΘ
1
n
(θ ; D).
Further, we can also view this problem as a loss minimization problem, where as before
the loss is of the form
C
(
θ
;
D
) =
1
n
P
n
i=1
C
i
(
θ
), and
C
i
(
θ
) is the loss for an individual
observation. Based on the expressions of the log-likelihood
(3.28)
and using
ˆy
(i)
for the
vector ϕ(x
(i)
), the loss for an individual observation is,
C
i
(θ) =
K
X
k=1
1{y
(i)
= k} log ˆy
(i)
k
= log ˆy
(i)
y
(i)
. (3.29)
4
Alternatively one may opt for the machine learning parameterization with the bigger parameter space.
In this case the MLE is never unique.
16
i
i
i
i
i
i
i
i
3.3 Multi-class Problems with Softmax
Here, the first and second equalities arise from the first and the second lines of
(3.28)
respectively. The expressions on the right hand side of
(3.29)
are in fact called the categorical
cross entropy. More precisely, for a label
y {
1
, . . . , K}
and a probability vector
ˆy
of
length K, we define the categorical cross entropy as,
CE
categorical
(y, ˆy) =
K
X
k=1
1{y = k} log ˆy
k
= log ˆy
y
, (3.30)
where in the final expression,
ˆy
y
means taking the element at index
y
from the vector
ˆy
. We
thus see that
C
i
(θ) = CE
categorical
(y
(i)
, ˆy
(i)
). (3.31)
We have already seen the binary cross entropy in
(3.12)
. Similar to this, the categorical cross
entropy
(3.30)
is the empirical estimate of the cross entropy of two probability distributions
appearing in
(??)
of Appendix
??
. We note here that in many contexts of deep learning,
one just uses the phrase “cross entropy” without the prefix “binary” or “categorical”.
5
Then the distinction between
CE
binary
(
·, ·
) and
CE
categorical
(
·, ·
) is based on context and
the type of arguments
y
and
ˆy
. In the former
y
is a binary outcome in
{
0
,
1
}
and
ˆy
is a
scalar probability in [0
,
1]. In the latter,
y
is a multi-class choice in
{
1
, . . . , K}
and
ˆy
is a
probability vector of length
K
in [0
,
1]
K
. The binary cross entropy is a special case of the
categorical cross entropy and
CE
binary
(
y, ˆy
) with
y {
0
,
1
}
and
ˆy
[0
,
1] can be represented
as CE
categorical
(2 y, [ˆy, 1 ˆy]
).
It may be useful to also see other notational forms for the loss of an individual observation
C
i
(θ). Making use of (3.26) we obtain
C
i
θ
= CE
categorical
y
(i)
, S
Softmax
(b + W x
(i)
)
, (3.32)
where we parameterize
θ
to be composed of the vector
b
and the matrix
W
. Namely,
Θ = R
K
× R
K×p
. Alternatively,
C
i
(θ) = (b
y
(i)
+ w
(y
(i)
)
x
(i)
) + log
K
X
j=1
e
b
j
+w
(j)
x
(i)
. (3.33)
Compare (3.33) to the last line of (3.28).
Derivatives and Learning
Like logistic regression, inference for multinomial regression can be efficiently carried out
using second-order methods. Nevertheless, when multinomial regression is viewed as a deep
learning model with very large
p
, one often uses gradient descent (first-order optimization)
in place of second-order methods. We now present the derivative expressions of
C
i
(
θ
). These
expressions allow one to use gradient descent just as in the case of logistic regression.
Using either the statistical parameterziation with
d
= (
K
1)(
p
+ 1) parameters or the
machine learning parameterziation with
d
=
K
(
p
+ 1) parameters, we have that
θ
= (
b, W
)
consists of both a vector
b
and a matrix
W
. Thus in dealing with the gradient of
C
i
(
θ
) with
respect to
θ
, denoted as
C
i
(
θ
), one needs to either vectorize
θ
into a vector of length
d
, or
5
In practice, when deep learning systems implement the binary and categorical cross entropy,
log
(
u
) in
(3.12)
or
(3.30)
is replaced with
log
(
u
+
ε
) where
ε
is a small fixed parameter. This allows to seamlessly
include the probability u = 0 as an input.
17
i
i
i
i
i
i
i
i
3 Simple Neural Networks
alternatively, to make use of the derivative of a real valued function with respect to a matrix,
as defined in Appendix
??
,
(??)
. Our presentation uses both variants, and it is a prelude for
further derivative expressions involving matrix valued parameters, used in Chapter
??
and
the chapters that follow.
Let us first consider the derivative of
(3.33)
with respect to the individual scalar elements of
θ = (b, W ). Specifically, for k = 1, . . . , K we obtain,
C
i
b
k
=
e
b
k
+w
(k)
x
(i)
P
K
j=1
e
b
j
+w
(j)
x
(i)
1{y
(i)
= k},
C
i
w
k,ℓ
=
e
b
k
+w
(k)
x
(i)
P
K
j=1
e
b
j
+w
(j)
x
(i)
1{y
(i)
= k}
x
(i)
, for = 1, . . . , p.
These derivatives can then be placed in a d dimensional vector
6
to form C
i
(θ).
Now observe that the expressions above involve the softmax function where the ratio
of the exponent with the sum of exponents in each expression is the
k
th coordinate of
S
Softmax
(
b
+
W x
(i)
). This hints at a simpler expressions and indeed we may use unit vector
notation (e
j
) in place of the indicator functions 1{·} to obtain,
C
i
b
= S
Softmax
(b + W x
(i)
) e
y
(i)
,
C
i
W
=
S
Softmax
(b + W x
(i)
) e
y
(i)
x
(i)
.
With such a representation the first expression which is a derivative with respect to a vector
is in fact composed of the coordinates associated with the bias vector
b
of the gradient
C
i
(
θ
). Further, the second expression is a derivative of a real valued function with respect
to a matrix, which we represent using the notation of
(??)
in Appendix
??
to represent the
matrix with (k, )th element denoting
C
i
w
k,ℓ
.
In general, when one implements gradient based optimization as in Step 3 of Algorithm
??
of
Chapter
??
, or using one of the multiple variants presented in Chapter
??
, these derivatives
need to be used with the right shape dimensions taken into account.
We also mention that it is sometimes common to subsume the bias term expressions within
the weight parameter expressions by augmenting the feature vector to be of length
p
+1 where
the first feature is the constant 1. This practice was already used in the linear regression
treatment of Section
??
, and could have also been employed in the logistic regression
treatment of Section 3.1. However we mention that from a deep learning perspective
7
the
weight parameters in
W
often receive a different treatment than the bias parameters in
b
and thus keeping b separate from W notationally is instructive.
6
In fact, when one seeks a Hessian expression for
C
i
(
θ
), using such a
d
dimensional vector layout is
the standard approach. We skip presentation of such a Hessian expression here, but note that (in contrast
to more complicated deep learning models) it is tractable, and is often used in second-order methods for
multinomial regression.
7
This is made clearer in chapters
??
and
??
, for example, in the contexts of dropout and convolutional
neural networks.
18
i
i
i
i
i
i
i
i
3.3 Multi-class Problems with Softmax
Note that like logistic regression, the loss minimization problem of multinomial regression is
convex and thus with proper hyper-parameter choices (e.g., learning rate), a global minimum
can in principle always be reached. We establish the convexity of this optimization problem
in the next chapter.
Classification with Multinomial Regression Yields Convex Polytope
Decision Regions
Recall that with multi-class classification, our goal is to provide a prediction
b
Y
for the label
{
1
, . . . , K}
associated with the input feature vector. In Section
??
we introduced concepts
of binary classification and then via the example of the linear model in Section
??
we
saw strategies for adapting binary classifiers to a multi-class classifier. This was via the
one vs. rest and one vs. one approaches. In contrast to these approaches, the multinomial
model provides a direct solution for multi-class classification since the output
ˆy
is already a
probability vector over
{
1
, . . . , K}
. This approach is also common in more complicated deep
learning models presented in the chapters that follow.
In general, an output vector
ˆy
, which is a probability vector, can be used to create a classifier
by choosing the class that has the highest probability (and breaking ties arbitrarily). Namely,
b
Y = argmax
k∈{1,...,K}
ˆy
k
. (3.34)
This approach is called a maximum a posteriori probability (MAP) decision rule since it
simply chooses
b
Y
as the class that is most probable. It is the most common decision rule
when using deep learning models for classification. As a side note, for binary classification, a
MAP decision agrees with binary classification threshold prediction
(??)
with a threshold of
τ
= 0
.
5, and similarly to the fine-tuning of the parameter
τ
, in the multi-class case we may
also deviate from MAP decisions and adjust (3.34) if needed.
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
x
1
x
2
class 1
class 2
class 3
class 4
Figure 3.7: Multinomial regression for multi-class classification applied to synthetic data. In this
example the training accuracy is 15/17.
19
i
i
i
i
i
i
i
i
3 Simple Neural Networks
As an example, consider Figure 3.7 that represents a MAP based rule applied to the output
of a multinomial model trained on
n
= 17 synthetic data observations with
p
= 2 features
and
K
= 4 classes. Like the binary classification example of Figure 3.2, this multi-class
classification example illustrates the output of
b
Y
(
x
) for any
x
R
p
. Similarly to the
logistic regression case, we observe that the decision boundaries are straight lines. This is
not a coincidence but rather a property of multinomial regression.
Denote by
C
k
the set of input features vectors
x R
p
such that
b
Y
(
x
) =
k
. We now see
that
C
k
is an intersection of half spaces, i.e., it is a convex polytope. To see this, consider
some arbitrary point
x
R
p
and fix some class label
k {
1
, . . . , K}
. We can now compare
ˆy
k
(
x
) and
ˆy
j
(
x
) for all other labels
j
. Specifically, since the denominator of the softmax
expression (3.24) is independent of the index, ˆy
k
(x
) ˆy
j
(x
) if and only if
e
b
k
+w
(k)
x
e
b
j
+w
(j)
x
, or (b
k
b
j
) + (w
(k)
w
(j)
)
x
0.
Thus by defining a hyperplane
H
kj
with
H
kj
(
x
) = (
b
k
b
j
) + (
w
(k)
w
(j)
)
x
, we see that
ˆy
k
(
x
)
ˆy
j
(
x
) holds if and only if
H
kj
(
x
)
0 for all other classes
j
. Now
b
Y
(
x
) =
k
only
if ˆy
k
(x
) ˆy
j
(x
) for all other j. Hence C
k
is an intersection of K 1 hyperplanes.
As a concrete example we return to MNIST digit classification explored in Section
??
. In that
section we used the one vs. rest and one vs. one strategies to build classifiers based on linear
models. See Table
??
where we presented confusion matrices for these classifiers when trained
on the 60
,
000 MNIST train images and tested on the 10
,
000 test images. We now report the
performance of multinomial regression on this dataset as well as the application of one vs.
rest and one vs. one with logistic regression.
8
A summary is in Table 3.1. Our purpose with
this comparison is not to claim which method is best since in practice all these methods are
significantly beat by convolutional neural networks as presented in Chapter
??
. We rather
aim to highlight that a direct multi-class strategy such as multinomial regression requires
training and application of a single model while the other approaches require multiple models,
and are not significantly superior.
Table 3.1: Different approaches for creating an MNIST digit classifier. It is evident that in general
one vs. one classifiers outperform one vs. rest classifiers, yet have many more parameters. Further,
on the same type of classification scheme, logistic regression generally outperforms linear regression.
Finally observe that multinomial regression with only a single model and a low number of parameters,
generally performs almost as good as the top scheme (logistic regression with one vs. one).
Strategy and model type
Number of
models to train
Total number
of parameters
Test accuracy
Linear regression one vs. rest 10 7,850 0.8603
Linear regression one vs. one 45 35,325 0.9297
Logistic regression one vs. rest 10 7,850 0.9174
Logistic regression one vs. one 45 35,325 0.9321
Multinomial regression 1 7,850 or 7,065 0.9221
8
The linear regression based classifiers use the pseudo-inverse and hence the accuracy on the test set
of size 10
,
000 is exact (up to numerical error). The logistic and multinomial classifiers were trained with
gradient based learning with an ADAM algorithm with a learning rate of 0
.
02, mini-batches of size 2000,
and 200 epochs; see Chapter
??
. With this gradient based learning there is room for error with the accuracy
as it depends on the optimization run. Hence the best achievable accuracy with the non-linear classifiers can
potentially be slightly better.
20
i
i
i
i
i
i
i
i
3.4 Beyond Linear Decision Boundaries
3.4 Beyond Linear Decision Boundaries
Recall figures 3.1, 3.2, and 3.7. Logistic regression naturally yields a sigmoidal relationship
between the input features and the response probability. In a classification setting this
translates to linear (affine) decision boundaries. Similarly, multinomial regression used for
classification also yields straight line boundaries via convex polytope decision boundaries. One
may then ask if these shallow neural network models can create other forms of response curves
or decision boundaries? We now show that this is indeed possible via feature engineering
following a similar spirit to the way feature engineering was introduced in Section ??.
Enhancing the Sigmoidal Response
Recall first the linear regression example illustrated in Figure
??
of Chapter
??
. In display
(b) of that figure we saw how linear regression with an engineered quadratic feature yields a
curve that better fits the data. A similar type of idea may also be applied in the case of
logistic regression.
As an example consider a dataset
D
where each observation is for a different geographic
location and where the feature vector has a single coordinate which is the level of precipitation
at that location (in millimeters/month). For each observation, the associated
y
(i)
{
0
,
1
}
,
determines the absence (0) or presence (1) of a certain species. Observations from such a
dataset
9
are presented in Figure 3.8. At locations
i
that are not too dry or not too wet, the
species tends to be present, whereas when the precipitation is very low or very high the
species tends to be absent.
Such a relationship between the precipitation and the probability of presence of the species
cannot be captured by a sigmoidal curve since
σ
Sig
(
b
+
w
x
) is a monotonic function of
x
.
In fact, when fitting a sigmoidal curve to this data, the response, plotted via the red curve
of Figure 3.8, tends to be almost flat in the region of the observations because in this case
ˆw
1
is close to zero.
Nevertheless, if we wish to use logistic regression for this problem we may do so via feature
engineering. Similarly to the housing price example of Section
??
we introduce a new feature
x
2
=
x
2
1
. With the addition of this new feature, the model (as a function of the single feature
x = x
1
) becomes,
ϕ(x) =
1
1 + e
(b+w
1
x+w
2
x
2
)
.
Following from basic properties of the parabola
b
+
w
1
x
+
w
2
x
2
, if
w
2
<
0 then we have that
ϕ
(
x
)
0 as
x −∞
or
x
and further
ϕ
(
x
) is maximized at
x
=
w
1
/w
2
which is
the maximal point of the parabola. Similarly, if
w
2
>
0 the shape of
ϕ
(
x
) is reversed and
ϕ
(
x
) has a minimum point at the minimum point of the parabola. In both cases,
ϕ
(
x
) is
symmetric about x = w
1
/w
2
.
Fitting logistic regression to this feature engineered model yields a negative value for
ˆw
2
and this results in the blue curve of Figure 3.8. We thus see that feature engineering in
the context of logistic regression allows us to extend the form of the response beyond the
sigmoid function.
9
This is a synthetic dataset motivated by ecological applications.
21
i
i
i
i
i
i
i
i
3 Simple Neural Networks
0.0
0.2
0.4
0.6
0.8
1.0
25 50 75
x
1
y
^
Figure 3.8: Logistic regression fit with prediction
ˆy
for a feature engineered model with the feature
x
2
=
x
2
1
. The response curve is plotted in blue together with confidence bounds. The red curve is a
sigmoidal function fit to the single feature x = x
1
.
The General Setup of Polynomial Feature Engineering
As in the example above, in general it is quite common to use powers for feature engineering
and this makes the linear combination of the engineered features a polynomial. In examples
with a small number of features such as the
p
= 1 example above, we can tweak feature
engineering visually. However, when
p
is large, other performance based methods are needed.
When there are initially
p
input features we can automate the creation of more features by
choosing each new feature as a power product or monomial form
x
k
1
1
x
k
2
2
· . . . · x
k
p
p
for some
non-negative integers
k
1
, k
2
, . . . , k
p
. With such a process it is common to limit the degree by
a constant r via
k
1
+ k
2
+ . . . + k
p
r.
For example when r = 2 the set of engineered features is,
˜x = (x
1
, x
2
. . . , x
p
, x
2
1
, x
2
2
, . . . , x
2
p
, x
1
x
2
, x
1
x
3
, . . . , x
p1
x
p
).
In this case
˜x R
p(p+3)/2
and thus
d
= 1 +
p
(
p
+ 3)
/
2 for logistic regression.
10
So if for
example there are initially
p
= 1
,
000 features then there are about half a million engineered
features with
d
= 501
,
501. One may of course cull the number of engineered features to
reduce the number of learned parameters. This can sometimes be done via regularization
introduced in Section ??.
10
This also holds for any model where the number of parameters is one plus the number of features such
as for example linear regression.
22
i
i
i
i
i
i
i
i
3.4 Beyond Linear Decision Boundaries
Let us also go beyond
r
= 2 to higher degrees of the monomial features. From basic
combinatorics, we have that the number of non-negative integer solutions to
k
1
+
. . .
+
k
p
=
is (
+
p
1)!
/
(
p
1)!
!
and thus when using a polynomial feature scheme with degree up
to r we have for logistic regression
d =
r
X
=0
( + p 1)!
(p 1)! !
,
and for multinomial regression,
d
is
K
or
K
1 times this number with the machine learning
or statistical parameterizations respectively.
Using Stirling’s approximation of factorials we have that when
is significantly smaller
than
p
then
d p
r
/r
! for logistic regression. As an example with
p
= 1
,
000 input features
and setting
r
= 4 we have approximately 4
.
16
×
10
10
parameters or and exact number of
42, 084, 793, 751. This is over 40 trillion parameters to learn!
Having 40 trillion parameters in a model is indeed borderline astronomical and as of today
still infeasible. Thus for non-small
p
, going beyond
r
= 2 is rarely used in practice. In fact,
in Section
??
of Chapter
??
we argue that with deeper neural networks one may sometimes
get more expressivity without creating such a large number of parameters.
Versatile Classification Boundaries
To further explore feature engineering let us now consider classification problems. We have
seen above that both logistic regression and multinomial regression yield decision boundaries
defined by hyperplanes which we may generally denote via
H
(
x
). Each such hyperplane
has parameters
˘
b R
and
˘w R
p
which depend on the estimated model parameters in a
simple manner. Decision rules for classification with given input
x
R
p
are then made
based on if
H
(
x
)
0 or not (where the case of multinomial regression involves multiple
such hyperplanes and comparisons).
Now in a feature engineered scenario the hyperplane parameters are adapted to be of larger
dimension and in place of
H
(
x
) we evaluate
˜
H
(
˜x
) with parameters
˘
˜
b
and
˘
˜w
. As an example
consider a scenario with
p
= 2 features where we carry out polynomial feature engineering
as above with
r
= 2. Then the number of parameters is now
d
= 6 and the hyperplane
comparison is then,
˘
˜
b +
˘
˜w
1
x
1
+
˘
˜w
2
x
2
+
˘
˜w
3
x
2
1
+
˘
˜w
4
x
2
2
+
˘
˜w
5
x
1
x
2
0. (3.35)
Now here the set of input feature vectors (
x
1
, x
2
)
R
2
that satisfies the above inequality is
no longer a half space but rather represents a curved subset of
R
2
. We thus see that such
feature engineering enables non linear decision boundaries.
As an example, consider Figure 3.9 based on synthetic data that requires binary classification
with two input features. In (a) we see that for this type of data, logistic regression without
feature engineering is not suitable. The linear decision boundary is not able to capture the
pattern in the data. In (b) we expand the set of features and get a decision boundary of
the form
(3.35)
. It appears that going from
d
= 3 learned parameters to
d
= 6 learned
parameters is worthwhile since the pattern in the data is well captured in (b).
23
i
i
i
i
i
i
i
i
3 Simple Neural Networks
(a)
(b)
(c)
(d)
Figure 3.9: Decision boundaries for binary classification with an expanded set of features for
synthetic data with
p
= 2 input features. (a) No feature engineering (
r
= 1,
d
= 3). (b) Quadratic
r = 2, d = 6. (c) Quartic r = 4, d = 15. (d) r = 8, d = 45.
We may continue with higher orders in an attempt to gain more expressivity. However, as
we see in (c) and (d), for this data, the higher order models yield obscure classification
decision boundaries. In this example these higher orders certainly appear like an overfit.
Such overfitting would probably lead to high generalization error (recall the discussion in
Section
??
). Moreover, the new set of features could become highly correlated and that can
cause difficulty in inference of parameters when taking a statistical approach.
As another visual example, return to the synthetic multi-class classification example illus-
trated in Figure 3.7. We expand the set of features for this example and plot the decision
regions in Figure 3.10. As this is just a synthetic example, our purpose is to show that by
increasing the number of engineered features we can get more curvature in the decision
boundaries. In (a) we consider quadratic features and in (b) we consider an extreme case of
r = 8 which has 180 parameters with the machine learning parameterization.
24
i
i
i
i
i
i
i
i
3.5 Shallow Autoencoders
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
x
1
x
2
class 1
class 2
class 3
class 4
(a)
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
x
1
x
2
class 1
class 2
class 3
class 4
(b)
Figure 3.10: Decision boundaries with an expanded set of features for the multinomial regression
model (K = 4). (a) r = 2, d = 24 (b) r = 8, d = 180.
3.5 Shallow Autoencoders
So far in this chapter we explored logistic and multinomial regression. Both of these models
are shallow neural networks, which do not involve hidden layers, for supervised learning
where the data is labelled. They are special cases of “deep learning models”. The same goes
for the linear regression model of Section
??
(it was made evident that linear regression is
also a deep learning model in Section 3.2). Thus, all the key supervised learning models that
we discussed up to this point are simple neural networks that fall under the umbrella of
deep learning.
We now devote this last section of this chapter to simple neural networks that are used
for unsupervised learning where the unlabelled data is of the form
D
=
{x
(1)
, . . . , x
(n)
}
.
Namely we introduce autoencoder architectures and focus primarily on shallow versions of
these architectures. Recall that in Section
??
we presented an overview of the concept of
unsupervised learning in the context of machine learning activities, and in Section
??
we
surveyed basic unsupervised learning techniques including principle components analysis
(PCA). As we see in this current section, PCA can be cast as a special case of an autoencoder
and this positions PCA as a form of (simple) neural network based learning as well.
Autoencoder Principles
Before we explore several varied applications of autoencoders, let us focus on their basic
architecture. Consider Figure 3.11 which presents a schematic of an autoencdoer with a
single hidden layer. The input x R
p
is transformed into a bottleneck, also called the code,
which is some
˜x R
m
and is the hidden layer of the model. Then the bottleneck is further
transformed into the output
ˆx R
p
. The part of the autoencoder that transforms the input
into the bottleneck is called the encoder and the part of the autoencoder that transforms
the bottleneck to the output is called the decoder. Both the encoder and the decoder have
parameters that are to be learned.
Note that many other deep learning models in this book will also have hidden layers. In
fact, representing and computing gradients for the parameters of such layers is the focus of
25
i
i
i
i
i
i
i
i
3 Simple Neural Networks
x
1
x
2
x
p
Input Layer
Encoder
˜x
1
˜x
2
˜x
m
Bottleneck
ˆx
1
ˆx
2
ˆx
p
Output Layer
Decoder
Figure 3.11: An autoencoder architecture with an encoder, decoder, and a bottleneck in between
which is the single hidden layer of this neural network.
Chapter
??
and stands at the heart of deep learning. In this autoencoder example, the single
hidden layer is also the bottleneck of the autoencoder and thus we informally consider this
shallow autoencoder as “simple”. Other “deeper” autoencoders may have multiple hidden
layers of which one should be treated as the bottleneck or code.
Interestingly for input
x
, once parameters are trained, we generally expect the autoencoder to
generate output
ˆx
that is as similar to the input
x
as possible. This property of autoencoders,
after which they are named, may at first seem obscure since it means that the autoencoder
is a form of an identity function. However, this architecture is actually very useful for a
variety of applications, some of which are detailed in this section and others appearing as
parts of more complicated models such as for example sequence models which we discuss in
Chapter ??.
For now, as the most basic application, consider the activity of data reduction already
surveyed in Section
??
in the context of PCA and SVD (singular value decomposition). For
this, assume that the dimension of the bottleneck m is significantly smaller than the input
and output dimension
p
(in other non-data reduction applications we may also have
m p
).
For example, return to the case of MNIST digits (initially introduced in Section
??
) where
p = 784. For our example here, assume we have an autoencoder with m = 30.
If indeed a trained autoencoder yields
x ˆx
then it means that we have an immediate data
reduction method. With the trained encoder we are able to convert digit images, each of
size 28
×
28 = 784, into much smaller vectors, each of size 30. With the trained decoder we
26
i
i
i
i
i
i
i
i
3.5 Shallow Autoencoders
MNIST test set
Reconstruction with PCA
Reconstruction with shallow autoencoder
Reconstruction with deep autoencoder
Figure 3.12: Reconstruction of the test set of MNIST (first row) using various types of autoencoders.
are able to convert back and get an approximation of the original image. This choice of
m
implies a rather remarkable compression factor of about 26.
Figure 3.12 presents the compression/decompression effect of several types of autoencoders
with
m
= 30. The first row presents untouched MNIST images. The second row presents the
effect of reducing the images via PCA (a shallow linear autoencoder) and the reverting back
to the image. The third row presents the effect of a shallow non-linear autoencoder. Finally
the last row presents the effect with a richer autoencoder that has several hidden layers (a
deep autoencoder).
There are multiple other applications for autoencoders which we soon survey. However, let
us first formulate these types of models more precisely.
Single Layer Autoencoders
As already mentioned above, we may view an autoencoder as a function
f
θ
:
R
p
R
p
, where
θ
are the trainable parameters of this function. These parameters
θ
ideally influence the
function’s operation such that
f
θ
(
x
)
x
where
x
is an arbitrary observation from either
the seen or unseen data. The approximate equality,
”, can be considered informally as
closeness of two vectors.
In practice when faced with training data
D
=
{x
(1)
, . . . , x
(n)
}
, we train the autoencoder
(learn the parameters
θ
) such that
C
(
θ
;
D
) =
1
n
P
n
i=1
C
i
(
θ
) is minimized. This is a similar
loss function setup to that used in supervised contexts such as linear regression, logistic
regression, multinomial regression, or the deep learning models that are in the chapters that
follow. For autoencoders, we construct the loss for an individual observation,
C
i
(
θ
), as some
distance penalty measure between the input observation
x
(i)
and the output
ˆx
(i)
=
f
θ
(
x
(i)
).
Contrast this with supervised learning where we compare the observed label and the predicted
label. That is, with autoencoders the target of the model is the input in contrast to a label
y
(i)
in supervised learning.
The most straightforward choice for the distance penalty in
C
i
(
θ
) is the square of the
Euclidean distance, namely,
C
i
(θ) = x
(i)
f
θ
(x
(i)
)
2
and thus C(θ ; D) =
1
n
n
X
i=1
x
(i)
f
θ
(x
(i)
)
2
. (3.36)
27
i
i
i
i
i
i
i
i
3 Simple Neural Networks
With this cost structure, learning the parameters,
θ
, of an autoencoder based on data
D
is
the process of minimizing C(θ ; D).
In general, autoencoders may have architectures similar to the fully connected deep neural
networks that we study in Chapter
??
as well as to extensions in the chapters that follow.
This may include multiple hidden layers, convolutional layers, and other constructs. At this
point, to understand the key concepts, let us revert to Figure 3.11 and consider autoencoders
composed of the same components of that figure.
Specifically, we decompose
f
θ
(
·
) to be a composition of the encoder function denoted via
f
[1]
θ
[1]
(
·
) and the decoder function denoted via
f
[2]
θ
[2]
(
·
), where
θ
[1]
are the parameters of the
encoder and θ
[2]
are the parameters of the decoder.
11
That is,
ˆx = f
θ
(x) =
f
[2]
θ
[2]
f
[1]
θ
[1]
(x) = f
[2]
θ
[2]
f
[1]
θ
[1]
(x)
,
where the notation using the square bracketed superscripts for the functions and parameters
is in agreement with the notation we use for layers of deep neural networks in Chapter
??
and onwards.
In general, one may construct all kinds of structures for the encoder, decoder, and their
parameters. In our case, we consider a specific single layer neural network structure. We
define,
f
[1]
θ
[1]
(u) = S
[1]
(b
[1]
+ W
[1]
u) for u R
p
(Encoder)
f
[2]
θ
[2]
(u) = S
[2]
(b
[2]
+ W
[2]
u) for u R
m
(Decoder),
(3.37)
where the notation is somewhat similar to
(3.26)
. Specifically for
= 1
,
2,
b
[]
are vectors,
W
[]
are matrices, and
S
[]
(
·
) are vector activation functions with
S
[1]
:
R
m
R
m
and
S
[2]
:
R
p
R
p
. Before describing the exact details of these functions, let us focus on the
autoencoder parameters.
The encoder parameters
θ
[1]
are composed of the bias
b
[1]
R
m
and weight matrix
W
[1]
R
m×p
, and the decoder parameters
θ
[2]
are composed of the bias
b
[2]
R
p
and weight matrix
W
[2]
R
p×m
. Thus the complete list of parameters for the autoencoder is,
θ = (b
[1]
, W
[1]
, b
[2]
, W
[2]
). (3.38)
Returning to the vector activation functions
S
[1]
(
·
) and
S
[2]
(
·
), we construct these based on
scalar activation functions
σ
[]
:
R R
for
= 1
,
2 such as the sigmoid function
(3.4)
, the
identity function
σ
(
u
) =
u
, or one of many other variants (see also Section
??
). Specifically,
we set
S
[]
(
z
) to be the element wise application of
σ
[]
(
·
) on each of the coordinates of
z
.
Namely,
S
[]
(z) =
σ
[]
(z
1
)
.
.
.
σ
[]
(z
r
)
. (3.39)
The choice of the type of scalar activation function may depend on the domain of the input
x
since the output of the model aims to reconstruct the input. For example, use of the sigmoid
function for
σ
[2]
(
·
) restricts the output to be in the range [0
,
1] and this will clearly be
unsuitable in cases where the input is not limited to this range.
11
An alternative way to denote these parameters would have been using
ϕ
(in place of
θ
[1]
) for encoder
and
θ
(in place of
θ
[2]
) for the decoder. This is the notation used in variational autoencoders in Chapter
??
.
28
i
i
i
i
i
i
i
i
3.5 Shallow Autoencoders
With this notation in place, it may also be useful to see the individual representation of the
bottleneck units ˜x
1
, . . . , ˜x
m
, and the outputs ˆx
1
, . . . , ˆx
p
. Specifically, with a
k
= ˜x
k
,
˜x
i
= σ
[1]
b
[1]
i
+
p
X
k=1
w
[1]
i,k
x
k
, for i = 1, . . . , m,
ˆx
j
= σ
[2]
b
[2]
j
+
m
X
k=1
w
[2]
j,k
a
k
. for j = 1, . . . , p.
This set of equations is the first non-shallow (single hidden layer) neural network which we
see in the book. Note also that we often use the notation
a
k
=
˜x
k
as it is an ‘activation’ of a
unit or neuron within the encoder.
Also, it may be useful to see the loss function representation as,
C(θ ; D) =
1
n
n
X
i=1
x
(i)
S
[2]
(W
[2]
S
[1]
(W
[1]
x
(i)
+ b
[1]
) + b
[2]
)
| {z }
f
θ
(x
(i)
)
2
. (3.40)
With this loss function, for given data
D
, the learned autoencoder parameters
ˆ
θ
are given
by a solution to the optimization problem min
θ
C(θ ; D).
PCA is an Autoencoder
It turns out that autoencoders generalize principal component analysis (PCA), introduced in
Section
??
. We overview the details by seeing that PCA is essentially a shallow autoencoder
with identity activation functions
σ
[]
(
u
) =
u
for
= 1
,
2, also known as a linear autoencoder.
Specifically, we now summarize how PCA yields one possible solution to the learning
optimization problem for linear autoencoders. Note that some of the mathematical details
below may be skipped on a first reading without loss of continuity. The key outcome of this
subsection is that PCA and linear autoencoders are essentially the same.
Consider the loss
(3.40)
with identity activation functions. In this case the vector activation
functions S
[]
(·) of (3.39) are each vector identity functions and (3.40) reduces to,
C(θ; D) =
1
n
n
X
i=1
x
(i)
W
[2]
W
[1]
x
(i)
W
[2]
b
[1]
b
[2]
2
. (3.41)
Now it can be shown
12
that by considering the de-meaned data, we can formulate the
objective without the bias vectors
b
[1]
and
b
[2]
in
(3.41)
and focus on optimizing the loss
function,
C(W
[1]
, W
[2]
; D
de-meaned
) =
1
n
n
X
i=1
x
(i)
W
[2]
W
[1]
x
(i)
2
. (3.42)
Here when the data is denoted
D
de-meaned
, we reuse the notation of the data samples
x
(i)
,
taking them now as de-meaned feature vectors (see also
(??)
in Chapter
??
). It can be
shown that optimization of this new loss function
(3.42)
is equivalent to optimization of
12
This is shown by considering the derivative with respect to
b
[2]
as a first step and then reorganizing the
objective
(3.41)
with the expression of
b
[2]
that sets the objective to 0. See also the notes and references at
the end of the chapter
29
i
i
i
i
i
i
i
i
3 Simple Neural Networks
(3.41)
. Specifically, if minimizing
(3.42)
and obtaining minimizers
ˆ
W
[1]
and
ˆ
W
[2]
, then
ˆ
b
[1]
can be set to any value and
ˆ
b
[2]
has a specific expression. With these, together with
ˆ
W
[1]
and
ˆ
W
[2]
minimize (3.41).
Now it is possible to go one step further in reducing the parameter space. In fact, it can be
shown that for the optimum of
(3.42)
we have
ˆ
W
[1]
=
ˆ
W
[2]
. This is the pseudoinverse as
introduced in Section
??
. Specifically when
ˆ
W
[2]
is full column rank the pseudoinverse can
be represented as,
ˆ
W
[2]
= (
ˆ
W
[2]
ˆ
W
[2]
)
1
ˆ
W
[2]
,
and thus the matrix in (3.42) is,
ˆ
W
[2]
ˆ
W
[1]
=
ˆ
W
[2]
ˆ
W
[2]
=
ˆ
W
[2]
(
ˆ
W
[2]
ˆ
W
[2]
)
1
ˆ
W
[2]
.
This is the
p × p
projection matrix which projects vectors of length
p
onto the
m
dimensional
column space of
W
[2]
. Now using the QR decomposition
13
this projection matrix may be
represented as
V V
where the
p × m
matrix
V
has orthonormal columns. This means that
an equivalent optimization problem to the problem of minimizing (3.42) is
min
V R
p×m
V
V =I
m
,
1
n
n
X
i=1
x
(i)
V V
x
(i)
2
, where x
(i)
D
de-meaned
, (3.43)
with
I
m
denoting the
m×m
identity matrix. This constrained optimization problem limits the
search space to the space of
p × m
matrices that have orthonormal columns. Any minimizer
V
of (3.43) can then be used as a minimizer of the losses (3.41) or (3.42) by setting,
ˆ
W
[1]
= V
, and
ˆ
W
[2]
= V
.
Now let us see the connection to PCA. Recall
(??)
representing the encoded lower dimensional
PCA data as
e
X
=
XV
where the
n×p
matrix
X
is a (de-meaned) data matrix as in Section
??
,
the
n × m
matrix
e
X
is the reduced data matrix, and the matrix
V
has columns that are
singular vectors from the SVD decomposition of
X
. Hence the projection of each data point
x
(i)
into this lower dimensional space is given by
˜x
(i)
= V
x
(i)
(3.44)
with
V
the
p × m
matrix of columns
v
1
, . . . , v
m
that are an orthonormal basis of a reduced
subspace of
R
p
. Further, with PCA, the matrix
V
can also be used to reconstruct the data
as a decoder. Specifically, the decoded data points in the original space can be obtained via
ˆx
(i)
= V ˜x
(i)
. (3.45)
Now piecing together the encoder in
(3.44)
and the decoder in
(3.45)
, the reconstruction
error is,
1
n
n
X
i=1
x
(i)
ˆx
(i)
2
=
1
n
n
X
i=1
x
(i)
V V
x
(i)
2
. (3.46)
We see that
(3.46)
agrees with the objective in
(3.43)
and further the matrix
V
of PCA agrees
with the constraint
V
V
=
I
m
. Now from the Eckart-Young-Mirsky theorem appearing in
(??)
of Section
??
as well as the relationship between SVD and PCA captured in
(??)
we
13
See the notes and references of Chapter ?? for suggested linear algebra background reading.
30
i
i
i
i
i
i
i
i
3.5 Shallow Autoencoders
have that
V
of PCA is indeed one of the optimal solutions
14
of
(3.43)
. Hence in summary,
PCA is a (shallow) linear autoencoder.
Original Data
PCA Encoding
Non-linear
Encoding
PCA Decoding
Non-linear
Decoding
Projection on a
hyperplane
Projection on a
manifold
Figure 3.13: Encoding and decoding of synthetic
p
= 2 data with a linear autoencoder (PCA) in
red and a non-linear shallow autoencoder (blue). The reconstruction of PCA falls on a hyperplane
while the non-linear autoencoder projects onto a manifold that is not an hyperplane.
In practice, one would not use gradient based optimization to learn the parameters of linear
autoencoders, but rather employ algorithms from numerical linear algebra. Further, the
specific basis vectors obtained via PCA are insightfull since they order vectors according to
their variance contributions. In contrast, optimizing
(3.43)
without considering PCA, yields
arbitrary
V
(with orthonormal columns). Nevertheless, as we see now, our positioning of
PCA as a special case of the (non-linear) autoencoder is insightful.
Autoencoders as a Form of Non-linear PCA
As we discussed above, encoding and decoding with PCA projects the
p
dimensional feature
vector
x
onto an
m
dimensional subspace. When
p
= 2 and
m
= 1 this can be viewed as a
projection of points from the plane onto a line, when
p
= 3 and
m
= 2 this is a projection of
points from three dimensional space onto some plane, and similarly in more realistic higher
dimensions of
p
, we project onto a linear subspace of dimension
m
. As such, the bottleneck
of the linear encoder (PCA) encodes the location of the points on this projected space.
Linear subspace projection is sometimes a sufficient data reduction technique and at other
times is not. In such cases there are other multiple forms of non-linear PCA where points
14
There is an infinite number of solutions.
31
i
i
i
i
i
i
i
i
3 Simple Neural Networks
are projected onto manifolds that are generally curved. Since autoencoders generalize PCA,
they present us with one such rich class of non-linear PCA models.
As an illustration consider Figure 3.13 where we consider synthetic data with
p
= 2 which we
wish to encode with
m
= 1. This means that the bottleneck layer, or the code, represents a
value on the real line for each data point. If we use PCA (red) then this encoding translates
to a location on a linear subspace of
R
2
. However, if we modify the identity activation
functions in the autoencoder to be non-linear (blue), then the projection is on a manifold
which is generally curved. In this example the non-linear activation functions are taken as
tanh functions; see also Section ??.
As a further example, consider using an autoencoder on MNIST where
p
= 28
×
28 = 784
and we use
m
= 2. We encode this via PCA, a shallow non-linear autoencoder, and a deep
autoencoder that has hidden layers. In Figure 3.14 we present scatter plots of the codes for
various types of autoencoders for both the training and test sets. That is, the autoencoders
are trained on the training set and the codes presented are both for the training set, and for
the test set data. While the training and testing does not directly involve the labels (the
digits
0
9
), in our visualization we color the code points based on the labels. This allows us
to see how different labels are generally encoded onto different regions of the code space.
Recall also Figure ?? (b) which is of a similar nature.
−6
−3
0
3
−5 0
PC1
PC2
digit
0
1
2
3
4
5
6
7
8
9
(a)
−0.5
0.0
0.5
1.0
−0.5 0.0 0.5 1.0
RS1
RS2
digit
0
1
2
3
4
5
6
7
8
9
(b)
0
5
10
15
−10 0 10
RS1
RS2
digit
0
1
2
3
4
5
6
7
8
9
(c)
−3
0
3
6
−10 −5 0
PC1
PC2
digit
0
1
2
3
4
5
6
7
8
9
(d)
−0.5
0.0
0.5
1.0
−0.5 0.0 0.5 1.0
RS1
RS2
digit
0
1
2
3
4
5
6
7
8
9
(e)
0
5
10
15
−10 0 10
RS1
RS2
digit
0
1
2
3
4
5
6
7
8
9
(f)
Figure 3.14: Autoencoders applied to MNIST with each of the 10 digits color coded. (a)–(c) are
for the training set and (d)–(f) are for the test set. In the first column with (a) and (d) we use PCA.
In the middle column with (b) and (e) we use a shallow non-linear autoencoder. In the last column
with (c) and (f) we use a deep autoencoder.
Keeping in mind that one application of such data reduction is to help separate the data, it
is evident that as model complexity increases (moving right along the displays of the figure),
32
i
i
i
i
i
i
i
i
3.5 Shallow Autoencoders
somewhat better separation occurs in the data. In particular, compare (d) based on the
test set using PCA, and (f) based on the test set using the deep autoencoder. Refer also
to Figure 3.12 which illustrates the reconstruction effect for various types of autoencoders
(here
m
= 30). In terms of reconstruction, it is also evident in this case that more complex
models exhibit better reconstruction ability.
Applications and Architectures
We have already seen the archetypical autoencoder application, namely data reduction.
Yet there are many more applications and associated architectures of autoencoders. We
now discuss some of these. We first consider various ways in which data reduction can be
employed to help with additional machine learning activities. We then discuss de-noising
which is a different application to data reduction. We close the chapter with ways in which
interpolation in the latent space of the bottleneck are useful. The discussion here is just a
brief summary and more information is suggested in the notes and references at the end of
the chapter.
In terms of uses of data reduction, let us consider both supervised and unsupervised learning.
In the supervised setting, whenever a dataset
D
=
{
(
x
(1)
, y
(1)
)
, . . . ,
(
x
(n)
, y
(n)
)
}
is used
where the dimension of
x
(i)
is
p
, one may attempt to reduce the dimension of the input
features to
m < p
and obtain a dataset
e
D
=
{
(
˜x
(1)
, y
(1)
)
, . . . ,
(
˜x
(n)
, y
(n)
)
}
. With such a
dataset, training a supervised learning model to predict
y
in terms of
˜x R
m
may sometimes
yield much better performance than using the original
x R
p
. This process requires training
an autoencoder as well as training of the model. Then in production with unseen data, for
any
x
we first use the trained encoder to obtain to obtain
˜x
. We then use the trained
model on
˜x
to obtain
ˆy
. Note that with such an activity we do not use the decoder per-se.
To take this application of data reduction even further, consider a situation with
y
(i)
R
q
where
q
is not small (high dimensional labels such as segmentation maps on images for
example). In this case one may encode both the feature vector and the label, each with their
own autoencoder, to obtain encoded data of the form
e
D
=
{
(
˜x
(1)
, ˜y
(1)
)
, . . . ,
(
˜x
(n)
, ˜y
(n)
)
}
, say
with
˜x
(i)
R
m
p
,
˜y
(i)
R
m
q
,
m
p
< p
, and
m
q
< q
. One may then train a supervised model
using this
e
D
. Such a model predicts
ˆ
˜y
R
m
q
for each encoded feature vector
˜x
R
m
p
.
With this setup, when the model is used in production, one first observes a feature vector
x
, then uses the feature vector encoder to create
˜x
, then uses the model to predict
ˆ
˜y
, and
finally uses the label decoder to obtain
ˆy
. Hence with such an activity, in production the
encoder of the feature vector and the decoder of the label are used.
In terms of unsupervised learning, there are many secondary ways to use applications of
autoencoder based data reduction as well. As one example, assume that we wish to cluster
images. A naive approach may be to treat each image as a vector and then use an algorithm
such as K-means (see Section
??
) to cluster the vectors. This approach has many drawbacks
since the distance between images is based on the exact locations (indices) of pixels within
an image. For example two images of the same object with slightly different camera locations
would be generally “far” when comparing the Euclidean distance between the associated
vectors. A more suitable approach is to use an autoencoder for data reduction of the image
to a low dimension and then to perform clustering on the reduced dimensional codes. See
for example Figure 3.14 (f). Here
m
= 2 and with such encoding, clustering the vectors on
the two dimensional code space will generally work well.
33
i
i
i
i
i
i
i
i
3 Simple Neural Networks
BottleneckEncoder Decoder
Production System
Input
Partially
Destroyed
Training
Data
Original
Training
Data
Reconstructed
Input
Figure 3.15: A denoising autoencoder where during training partially destroyed (noised) data is
fed into the input. The production system is the trained autoencoder.
We now move away from data reduction and consider an architecture called the denoising
autoencoder which is illustrated in Figure 3.15. This model learns to remove noise during the
reconstruction step for noisy input data. It takes in partially corrupted input and learns to
recover the original denoised input. It relies on the hypothesis that high-level representations
are relatively stable and robust to entry corruption and that the model is able to extract
characteristics that are useful for the representation of the input distribution.
x
(i)
x
naive
λ
x
(j)
x
encoder
λ
Figure 3.16: Interpolation of images with
λ
= 1
/
2. The left and right images of
x
(i)
and
x
(j)
are raw images. The top image is the naive interpolation and the bottom image is obtained via
interpolation using an autoencoder.
In terms of architectures, denoising autoencoders exhibits a similar architecture to the usual
autoencoder as they involve an encoder
f
[1]
θ
[1]
(
·
) and decoder
f
[2]
θ
[2]
(
·
). However a key difference
is that during the learning process it is trained on noisy samples and the loss function is
modified. First the initial input
x
is corrupted into
˘x
via some predefined stochastic mapping
performing the partial destruction. In practice the main corruption mappings include adding
34
i
i
i
i
i
i
i
i
3.5 Shallow Autoencoders
Gaussian noise, masking noise (a fraction of the input chosen at random for each example is
forced to 0), or salt-and-pepper noise (a fraction of the input chosen at random for each
example is set to its minimum or maximum value with uniform probability).
Now with the corrupted input
˘x
, the autoencoder encodes and decodes
˘x
in the usual way
yielding a reconstructed
ˆ
˘x via,
f
θ
(˘x) = f
[2]
θ
[2]
f
[1]
θ
[1]
(˘x)
ˆ
˘x.
However, instead of using a loss like
(3.36)
which would compare
˘x
and
ˆ
˘x
we modify the loss
function for each observation
i
to be
C
i
(
θ
) =
x
(i)
ˆ
˘x
(i)
2
. Hence during training, we seek
parameters
θ
that try to remove the noise as much as possible. When used in production
on an unseen data sample
x
, the data corruption step is clearly not employed. Instead, at
this point the autoencoder output
f
θ
(
x
) is conditioned to generate outputs
ˆx
that are
denoised.
As a third general application of autoencoders, let us consider interpolation on the latent
space. Take x
(i)
and x
(j)
from D = {x
(1)
, . . . , x
(n)
} and consider the convex combination
x
naive
λ
= λx
(i)
+ (1 λ)x
(j)
,
for some
λ
[0
,
1]. Ideally such an
x
naive
λ
may serve as a weighted average between the two
observations where
λ
clearly captures which of the observations has more weight. However
with most data, such arithmetic on the associated feature vectors is too naive and often
meaningless for similar reasons to those discussed above in the context of K-means clustering
of images. For example, in the top image of Figure 3.16 we see such interpolation with
λ = 1/2.
When considering the latent space representation of the images it is often possible to create a
much more meaningful interpolation between the images. An example is in the bottom image
of Figure 3.16. To carry out this interpolation we train an autoencoder and then encode
x
(i)
and
x
(j)
to obtain
˜x
(i)
and
˜x
(j)
. We then interpolate on the codes, and finally decode
˜x
λ
to
obtain an interpolated image. That is, using the notation of
(3.37)
and omitting parameter
subscripts we have,
x
encoder
λ
= f
[2]
λf
[1]
(x
(i)
) + (1 λ)f
[1]
(x
(j)
)
.
This property of latent space representations and the ability to interpolate with these
representations, also plays a role in the context of generative models discussed in Chapter
??
.
At this point let us mention that one potential application of such interpolation is for design
purposes, say in art or architecture, where one chooses two samples as a starting point and
then uses interpolation to see other samples lying “in between”.
35
i
i
i
i
i
i
i
i
3 Simple Neural Networks
Notes and References
A comprehensive book on applied logistic and multinomial regression in the context of statistics
is [
19
] where one can find out about confidence intervals, hypothesis tests, and other aspects of
inference. The statistical origins of logistic regression are most probably due to Chester Ittner Bliss
whom developed the related probit regression model in the 1930’s; see [
5
]. Probit was initially used
for bioassay studies. See also [
12
] for an historical account where the development of the logistic
function (the sigmoid function) as a solution to logistic differential equations is presented. In the
context of machine learning, logistic regression may be viewed as one example of a probabilistic
generative model, see for example Section 4.2 of [4].
These days, in the context of deep learning, logistic regression is treated as the simplest general
non-linear (shallow) neural network. However, the original simplest (non-linear) neural network is
not actually logistic regression but rather the perceptron; see [
22
]. That model, created by Rosenblatt
in 1958, differs from logistic regression in that the activation function is the ; see
(??)
in Chapter
??
.
With this activation function, gradient based optimization is not possible. Yet Rosenblatt introduced
the which finds a classifier in finite time when the data is linearly separable; see also [20].
The phrase “softmax” for
(3.25)
started to be used in the machine learning community at around
1990; see [
8
]. The presentation of both logistic regression and multinomial regression as specific
cases of generalized linear models is described in [
13
]. We also mention several variants of these
models. These include Dirichlet regression which is used to analyse continuous proportions and
compositional response data; see [
14
], [
16
], and [
17
]. Also ordinal regression is relevant for predicting
an ordinal response variable. Examples of ordinal models are the ordered logit and ordered probit
models. A comprehensive reference for analysing categorical variables is [
1
] and for ordinal variables
see [
2
]. See also chapters 10 and 12 of [
18
]. Related terms from the world of machine learning are
ranking learning, also known as learning to rank. This field builds on ordinal regression; see for
example [
23
]. See [
10
] as a general reference for inference using maximum likelihood estimation
in the context of biostatistics as well as the classic theoretical statistics book [
11
]. Further, see
[
6
] for additional computational aspects including optimization of multinomial regression using
second-order methods.
Our focus on autoencoders in this chapter is mostly on the most popular architecture where the
hidden layer presents a lower number of units than the inputs. This specific architecture is called
undercomplete as opposed to the overcomplete case presenting more hidden units than the input.
We mention that autoencoders in general, and specifically overcomplete architectures, go hand in
hand with regularisation, achieving sparse representations of the code; see for example chapter
14 of [
15
]. A general overview of autoencoder applications and architectures is in [
9
]. Specifically,
different variants of autoencoders have recently emerged including sparse autoencoders, contractive
autoencoders, robust autoencoders, and adversarial autoencoders. One particular class explored in
Chapter
??
is variational autoencoders. For relationships between PCA and autoencoders see [
21
]
as well as the more classic works [3] and [7].
36
i
i
i
i
i
i
i
i
Bibliography
[1] A. Agresti. Categorical data analysis. John Wiley & Sons, 2003.
[2] A. Agresti. Analysis of ordinal categorical data. John Wiley & Sons, 2010.
[3]
P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning
from examples without local minima. Neural networks, 1989.
[4] C. M. Bishop. Pattern Recognition and Machine learning. Springer, 2006.
[5] C. I. Bliss. The method of probits. Science, 1934.
[6]
D. Böhning. Multinomial logistic regression algorithm. Annals of the institute of
Statistical Mathematics, 1992.
[7]
H. Bourlard and Y. Kamp. Auto-association by multilayer perceptrons and singular
value decomposition. Biological cybernetics, 1988.
[8]
J. S. Bridle. Probabilistic interpretation of feedforward classification network outputs,
with relationships to statistical pattern recognition. In Neurocomputing. 1990.
[9]
D. Charte, F. Charte, S. García, M. J. del Jesus, and F. Herrera. A practical tutorial on
autoencoders for nonlinear feature fusion: Taxonomy, models, software and guidelines.
Information Fusion, 2018.
[10]
D. Commenges and H. Jacqmin-Gadda. Dynamical biostatistical models. CRC Press,
2015.
[11] D. R. Cox and D. V. Hinkley. Theoretical statistics. CRC Press, 1979.
[12]
J. S. Cramer. The origins of logistic regression. Tinbergen Institute Working Paper No.
2002-119/4, 2002.
[13]
A. J. Dobson and A. G. Barnett. An introduction to generalized linear models. Chapman
and Hall/CRC, 2018.
[14]
J. C. Douma and J. T. Weedon. Analysing continuous proportions in ecology and
evolution: A practical introduction to beta and Dirichlet regression. Methods in Ecology
and Evolution, 2019.
[15] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
37
i
i
i
i
i
i
i
i
Bibliography
[16]
R. Gueorguieva, R. Rosenheck, and D. Zelterman. Dirichlet component regression and
its applications to psychiatric data. Computational Statistics & Data Analysis, 2008.
[17]
R. H. Hijazi and R. W. Jernigan. Modelling compositional data using Dirichlet regression
models. Journal of Applied Probability & Statistics, 2009.
[18] J. M. Hilbe. Logistic regression models. Chapman and Hall/CRC, 2009.
[19]
D. W. Hosmer Jr, S. Lemeshow, and R. X. Sturdivant. Applied logistic regression. John
Wiley & Sons, 2013.
[20]
A. B. Novikoff. On convergence proofs for perceptrons. Technical report, Stanford
Research Inst. Menlo Park CA, 1963.
[21] E. Plaut. From principal subspaces to principal components with linear autoencoders.
arXiv:1804.10253, 2018.
[22]
F. Rosenblatt. The perceptron: a probabilistic model for information storage and
organization in the brain. Psychological review, 1958.
[23]
A. Shashua and A. Levin. Ranking with large margin principle: Two approaches.
Advances in neural information processing systems, 2002.
38