
i
i
i
i
i
i
i
i
3 Simple Neural Networks
Autoencoders as a Form of Non-linear PCA
As we discussed above, encoding and decoding with PCA projects the
p
dimensional feature
vector
x
onto an
m
dimensional subspace. When
p
=2and
m
=1this can be viewed as a
projection of points from the plane onto a line, when
p
=3and
m
=2this is a projection of
points from three dimensional space onto som e plane, and similarly in more realistic higher
dimensions of
p
, we project onto a linear subspace of dimension
m
. As such, the bottleneck
of the linear encoder (PCA) encodes the location of the points on this projected space .
Linear subspace projection is sometimes a sufficient data reduction technique and at other
times is not. In such cases there are other multiple forms of non-linear PCA where points
are projected onto manifolds that are generally curved. Since autoencoders generalize PCA,
they present us with one such rich class of non-linear PCA models.
As an illustration consider Figure 3.13 where we consider synthetic data with
p
=2which we
wish to encode with
m
=1. This means that the bottleneck layer, or the code, represents a
value on the real line for each data point. If we use PCA (red) then this encoding translates
to a location on a linear subspace of
R
2
. However, if we modify the identity activation
functions in the autoencoder to be non-linear (blue), then the projection is on a manifold
which is generally curved. In this example the non-linear activation functions are taken as
tanh functions; see also Section 5.3.
As a further example, consider using an autoencoder on MNIST where
p
= 28
◊
28 = 784
and we use
m
=2. We encode this via PCA, a shallow non-linear autoencoder, and a deep
autoencoder that has hidden layers. In Figure 3.14 we present scatter plots of the codes for
various types of autoencoders for both the training and test sets. That is, the autoencoders
are trained on the training set and the codes presented are both for the training set, and for
the test set data. While the training and testing does not directly involve the lab e ls (the
digits
0
–
9
), in our visualization we color the code points based on the labels. This allows us
to see how different labels are generally encoded onto different regions of the code space.
Recall also Figure 2.14 (b) which is of a similar nature.
Keeping in mind that one application of such data reduction is to help separate the data, it
is evident that as model complexity increases (moving right along the displays of the figure),
somewhat better separation occurs in the data. In particular, compare (d) based on the
test set using PCA, and (f) based on the test set using the deep autoencoder. Refer also
to Figure 3.12 which illustrates the reconstruction effect for various types of autoencoders
(here
m
= 30). In terms of reconstruction, it is also evident in this case that more complex
models exhibit better reconstruction ability.
Applications and Architectures
We have already seen the archetypical autoencoder application, namely data reduction.
Yet there are many more applications and asso c iated architectures of autoencoders. We
now discuss some of these. We first consider various ways in which data reduction can be
employed to help with additional machine learning activities. We then discuss de-noising
which is a different application to data reduction. We close the chapter with ways in which
interpolation in the latent space of the bottleneck are useful. The discussion here is just a
brief summary and more information is suggested in the notes and references at the end of
the chapter.
106