i
i
i
i
i
i
i
i
Mathematical Engineering
of Deep Learning
Benoit Liquet, Sarat Moka and Yoni Nazarathy
March 3, 2026
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
8 Specialized Architectures and
Paradigms
In each of the chapters
??
,
??
, and
??
, we presented one concrete deep learning paradigm,
namely feedforward networks, convolutional neural networks, and sequence models respec-
tively. Such models are useful in their own right, yet in the world of deep learning one
often integrates them within more complex architectures for specific activities. For example
the convolutional neural networks of Chapter
??
may be inter-connected with sequence
models of Chapter
??
for applications that involve both images and text. In addition, other
specialized architectures and paradigms have also emerged where in each case, non-trivial
ideas are employed to create powerful models. In the current chapter we present such ideas
emerging from different domains, yet all using deep neural networks. Some of these domains
include generative modelling, where we focus on diffusion models and generative adversarial
networks, after an overview of variational autoencoders. Other domains are in the area of
automatic control and decision making where we present concepts of reinforcement learning.
Finally, we explore the domain of graph neural networks, an area that is proving to be ever
so useful for complex problems that can be represented with graph structures. Without
space constraints, each of these topics deserves its own chapter or a sequence of chapters,
yet within this single chapter we hope that the reader gains an overarching view.
In Section 8.1 we introduce generative modelling principles. For this we introduce principles
of variational autoencoders which are the basis for diffusion models, the topic of Section 8.2.
In Section 8.3, we describe the ideas of generative adversarial networks, which also have
some pinnings in game theory. These two variants, namely diffusion models of Section 8.2
and generative adversarial networks of Section 8.3, have become the most popular means
of generative modelling to date. We continue in Section 8.4 where we outline principles
of reinforcement learning. Towards that end we first define Markov decision processes
and discuss principles of optimal control, and then tie the ideas to deep reinforcement
learning. Note that while the application domain of reinforcement learning differs from the
generative modelling domain of the earlier sections, ideas of Markov chains used in diffusion
models, reappear in reinforcement learning of Section 8.4. Finally, graph neural networks
are introduced in Section 8.5. As we see, graph neural networks generalize the convolutional
neural networks of Chapter
??
while allowing us to have general graph structures within the
data in contrast to simple spatial connections.
8.1 Generative Modelling Principles
The field of generative modelling deals with algorithms and models for creating (generating)
data such as fake images, generated text, or similar. In this space we often think probabilis-
tically and assume that the data has an underlying probability distribution. Our goal is
then to train models that generate random yet realistic data from that distribution, with
or without explicitly capturing the form of the distribution. Generative modelling can be
1
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
applied both in the supervised case (features
x
and labels
y
) and the unsupervised case
(no labels
y
). In Chapter
??
towards the end of Section
??
we mentioned a few names of
supervised learning generative modelling approaches such as naive Bayes and others. In
contrast, now we only focus on unsupervised learning. In such a case observed data
D
is
composed of
{x
(1)
, . . . , x
(n)
}
and we assume that all
x
(i)
are distributed according to a
probability distribution, which we simply denote as
p
(
x
). Note that in general
x
(i)
is a high
dimensional object such as a high resolution color image, and hence
p
(
x
) is a complicated
distribution.
There are many generative approaches, yet our focus in this chapter is on two approaches
that have become very popular due to their ability to generate data that appears realistic.
One approach that has recently become popular is the diffusion model approach, and this
is the focus of Section 8.2. Another approach is the generative adversarial network (GAN)
approach which is the focus of Section 8.3. The ideas of the diffusion approach require
understanding of variational autoencoders. For this purpose, in the bulk of this section, we
introduce variational autoencoders, a class of generative models that is also interesting and
useful in their own right.
In Figure 8.1 we present a schematic of both the GAN approach and the diffusion approach.
In both cases, an underlying idea is to first generate random noise. The space of this noise is
typically called the latent space, and it has a very simple distribution, denoted
1
as
p
(
z
), for
example a multivariate standard normal. In both approaches, the sample
z
is processed as
input to a model whose output is a point
x
, which is approximately distributed according
to
p
(
x
) and hence “looks realistic”. In the GAN case, one often uses
z
which is of lower
dimension than
x
; for example
x
may be a 3
×
200
×
200 dimensional color image while
z
may be a vector of length 100. In contrast, in the diffusion models case, the latent variable
z
has the same dimension as the target sample x.
Both the diffusion approach and the GAN approach embody the conditional distribution
of the data given the noise, denoted as
p
(
x |z
). The GAN approach learns an approximate
algorithm for sampling from
p
(
x |z
) with a so called generator network, without explicitly
learning
p
(
x |z
). In contrast, diffusion models are probabilistic models which approximately
learn
p
(
x |z
) within a so-called decoder. One additional difference between the approaches is
that GANs generate
x
from
z
in one shot, i.e., via one application of the generator network.
In contrast diffusion models iterate over multiple steps of the decoder, starting with the
latent noise variable and eventually reaching the target output; see also Figure 8.4.
Both the diffusion decoder and the GAN generator use deep neural networks with learned
parameters. As illustrated in Figure 8.1, these parameters are learned by training the models
using the training data
D
. In both cases an auxiliary network (not illustrated in Figure 8.1)
also plays a part in training. In GANs this auxiliary network is called the discriminator and
to train the generator we train the discriminator network in parallel. For diffusion models
the auxiliary network is called an encoder and in this case, it is fixed in advance and has no
learned parameters. More details are in Section 8.3 for GANs and Section 8.2 for diffusion
models. We begin the study of generative models with variational autoencoders which lay
down the foundations for understanding diffusion models.
1
Observe that in the context of this chapter we use
p
(
·
) in multiple ways, where the actual distribution
used should be inferred from the argument of the function. For example
p
(
x
) is the distribution of the data
while p(z) is the distribution of the latent space.
2
i
i
i
i
i
i
i
i
8.1 Generative Modelling Principles
2.084
0.412
0.84
0.232
0.424
0.173
0.259
0.985
1.808
0.112
z
D
z
GAN Generator
has learned to sample from p(x|z)
x
Diusion Decoder
has learned p(x|z)
x
Figure 8.1: Generative adversarial networks (GANs) and diffusion models are different approaches
for generative models. Both use a latent space
z
and are able to sample from the conditional
distribution
p
(
x | z
) to generate
x
. For image generation, GANs typically have a smaller dimensional
z
while diffusion models use
z
of the same dimension as the image. Both cases are trained using data
D
where the resulting model in GANs is called a generator, and the resulting model in diffusion
models is called a decoder.
Variational Autoencoders
In Section
??
we introduced basics of autoencoders within the context of shallow autoen-
coders. We now focus on variational autoencoders which are probabilistic enhancements of
autoencoders. We first focus on a statistical perspective of variational autoencoders and
then see a general framework for generating data from p(x) after learning it.
3
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
Like many statistical models, a variational autoencoder is a parametric model where we
assume some parameters
θ
determine
p
(
x
) and hence denote the distribution as
p
θ
(
x
). By
learning
θ
we can learn the distribution and then also generate samples from
p
θ
(
x
). A key
principle for estimation of
θ
is maximum likelihood estimation (MLE). In Section
??
, we
briefly introduced this commonly used approach in the context of logistic regression.
With the maximum likelihood approach, similarly to
(??)
of Chapter
??
, our estimate of
θ
for the data D is
b
θ
MLE
:= argmax
θ
1
n
n
X
i=1
log p
θ
(x
(i)
), (8.1)
where the optimized quantity (barring the 1
/n
term) is the log-likelihood under the as-
sumption of independent and identically distributed elements of
D
. In practice, solving
(8.1)
requires information about the expression or form of p
θ
(x).
In variational autoencoders, as alluded to at the start of the section, an unseen latent
variable
z
plays a key role. More precisely we suppose that
p
θ
(
x
) is coupled with
z
which
has distribution
p
(
z
). The distribution
p
(
z
) is assumed to be simple and does not depend on
the parameters θ. The relationship involving z and x can then be represented as,
p
θ
(x, z) = p
θ
(x |z) p(z). (8.2)
Hence in summary, both the joint distribution
p
θ
(
x, z
) and the conditional distribution
p
θ
(x |z) are paramterized by θ, but p(z) is not parameterized by θ.
We can use (8.2) to obtain p
θ
(x) by marginalizing the joint distribution over z as,
p
θ
(x) =
Z
p
θ
(x |z) p(z)dz. (8.3)
Such a representation is helpful in expressing complex
p
θ
(
x
) using relatively simple expressions
for
p
θ
(
x |z
) and
p
(
z
). However, even in this setting, the integral
(8.3)
is intractable and
hence indirect optimization using ELBO, defined and described in the sequel, is used.
To parameterize
p
θ
(
x |z
) and describe
p
(
z
) we make use of multivariate normal distributions
where the probability density for a random vector
u
is represented via
N
(
u
;
µ,
Σ) with
µ
denoting the mean vector and Σ denoting the covariance matrix. A particular simple case is
the standard multivariate normal where
µ
= 0 (zero vector) and Σ =
I
(identity matrix). A
slightly more general case is setting Σ =
σ
2
I
, where the positive scalar
σ
2
determines the
variance shared among all coordinates.
We sometimes reparametrize
2
a covariance matrix Σ R
p×p
via
Σ = Γ Γ
, where Γ R
p×p
. (8.4)
Covariance matrices are always positive semi definite, and characterizing this constraint on
the individual entries of the matrix can be difficult. In contrast, with reparameterizations to
Γ, one may end up with an unconstrained matrix Γ which is easier to work with.
2
For example Γ in Σ = Γ Γ
can be obtained via a Cholesky factorization in which case Γ is an
unconstrained lower triangular matrix. Other factorizations such as the spectral decomposition (or singular
value decomposition) can also be used.
4
i
i
i
i
i
i
i
i
8.1 Generative Modelling Principles
Starting with
p
θ
(
x |z
) we assume that this conditional distribution is multivariate normal
where the mean vector and covariance matrix are functions of
z
that are both parameterized
by
θ
. In particular, keeping in mind that
m
is the dimension of
z
, and
p
is the dimension
3
of
x, we assume that we have learned functions,
µ
θ
: R
m
R
p
and Σ
θ
: R
m
R
p×p
, (8.5)
each parameterized by θ. With these, we have the conditional distribution set as
p
θ
(x |z) = N
x ; µ
θ
(z), Σ
θ
(z)
. (8.6)
In this case with
(8.3)
and
(8.6)
,
p
θ
(
x
) is a Gaussian mixture model
4
with mixture components
p
θ
(
x |z
) indexed by the values of
z
, and corresponding mixture weights provided by
p
(
z
). It
is well known that any distribution can essentially be closely approximated by a Gaussian
mixture model if the mixture components and mixture weights are properly selected.
5
While the general form of Gaussian mixture models is very versatile, in variational autoen-
coders we make some simplifying assumptions. First and most importantly we assume that
the distribution of the latent variable
z
is multivariate standard normal. Namely, the mixture
weights are,
p(z) = N(z ; 0, I). (8.7)
Further, it is common to reduce the complexity of the mixture components
(8.6)
and assume
that the covariance function is simply
σ
2
I
where
σ
2
is a pre-determined (not learned)
hyper-parameter. Thus, (8.6) is reduced to
p
θ
(x |z) = N
x ; µ
θ
(z), σ
2
I
, (8.8)
and now
(8.8)
together with
(8.3)
implies that
θ
parameterizes the distribution of
x
only
via the mean function
µ
θ
(
·
). In the context of deep learning, this mean function is taken
as a neural network with
θ
representing the weights and biases. However in the general
framework of variational autoencoders the mean function
µ
θ
(
·
) could be any model and not
necessarily a deep neural network.
With the model defined, suppose now that based on data
D
we manage to approximate the
maximum likelihood estimate
(8.1)
and learn
θ
for
µ
θ
(
·
). We can now use the model as a
generative model similar to the spirit of Figure 8.1. In particular to generate a new random
data sample
x
, we first generate a random latent sample
z
from the standard multivariate
normal distribution. We then compute
µ
θ
(
z
) using the learned model, and then generate
x
using
(8.8)
where the mean taken is
µ
θ
(
z
). In this sense, every random
z
yields its
own
µ
θ
(
z
) which in turn yields a random
x
. This generative process can be illustrated as
follows:
z
µ
θ
(·)
µ
θ
(z
)
p
θ
(x | z)
x
. (8.9)
Note that the diffusion models alluded to in Figure 8.1 are a variant of variational autoen-
coders and we cover the details of this powerful class of generative models the Section 8.2.
3
In case x is an image we vectorize it and assume the dimension is p.
4
A Gaussian mixture model is a probability mixture model where a collection Gaussian distributions are
“weighted” to create a new probability distribution. As with any mixture model, it is composed of mixture
components, and mixture weights, both of which appear under the integral sign as in (8.3).
5
Mathematically this property can be phrased as the fact that Gaussian mixture models are dense in the
space of probability distributions.
5
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
As already mentioned, variational autoencoders are related to the standard autoencoders
presented in Section
??
, yet standard autoencoders are deterministic while variational
autoencoders are probabilistic. It is this this difference that allows variational autoencoders
to be used as generative models while standard autoencoders on their own cannot. To see
this distinction, assume we were to try and carry out a generative procedure similar to
(8.9)
with a decoder trained on a standard autoencoder. This can be represented as follows:
z
Decoder
x
,
where again we would sample
z
from some predetermined distribution. However the standard
autoencoder on its own does not capture the distribution in the latent space where meaningful
samples of
z
are present. To see this, consider for example Figure
??
in Chapter
??
, and
observe that
z
would be some random point in the latent space and then mapped to a
random output
x
via the decoder. With this, there is no guarantee to get any realistic
x
because the model does not capture the region of interest in the latent space. Hence
standard autoencoders, while useful for other purposes as discussed in Section
??
, are not
useful as generative models on their own right.
The Encoder-Decoder Architecture for Variational Autoencoders
As alluded to above, a variational autoencoder has both an encoder and a decoder. The main
object used for generative models,
p
θ
(
x |z
), is represented via the decoder and as described
above this distribution is parameterized by the mean function
µ
θ
(
·
) as well as sometimes by
the covariance function Σ
θ
(
·
). However, for learning the parameters
θ
, we require the full
(variational) encoder-decoder architecture as illustrated in Figure 8.2.
Training
Production
x
q
φ
(z | x)
z
p
θ
(x | z
)
x
Data
sample
Encoder
Latent
variable
Decoder
Generated
data sample
Figure 8.2: The variational autoencoder architecture comprised of an encoder
q
ϕ
(
z | x
) and a
decoder p
θ
(x | z).
Like the decoder, the encoder can be a neural network parameterized by weights and biases
denoted as
ϕ
. Also like the decoder, the encoder is probabilistic in the sense that it describes
a probability distribution, denoted as
q
ϕ
(
z |x
) and known as the variational posterior of the
latent variable, that provides a distribution of the latent variable
z
given the data sample
x
.
That is, the encoder transforms an input data sample
x R
p
from
D
to a latent variable
6
i
i
i
i
i
i
i
i
8.1 Generative Modelling Principles
z R
m
. Like the decoder, the encoder has a mean function and a covariance function which
we denote respectively as,
µ
ϕ
: R
p
R
m
and Σ
ϕ
: R
p
R
m×m
, (8.10)
and the conditional distribution of the latent variable
z
given a data point
x
is assumed to
be normally distributed. Namely,
q
ϕ
(z |x) = N
z ; µ
ϕ
(x), Σ
ϕ
(x)
. (8.11)
Hence compare the encoder’s
(8.10)
and
(8.11)
with the decoder’s
(8.5)
and
(8.6)
respectively.
Taking all
x D
under consideration, the latent variable sample marginal distribution is
given by,
q
ϕ,D
(z) =
1
n
X
x∈D
q
ϕ
(z |x). (8.12)
Ideally, like
(8.7)
, we would like
q
ϕ,D
(
z
) to be a standard normal distribution in which case
it no longer depends on
ϕ
and
D
. This is then set as a training goal when jointly training
the encoder (learning
ϕ
) and decoder (learning
θ
) in a variational autoencoder. To achieve
such a goal, we aim to minimize a loss function of the general form,
C(ϕ, θ ; D) = C
PriorMatching
(ϕ) + C
Reconstruction
(ϕ, θ). (8.13)
In general this loss depends on the data
D
, yet for simplicity we omit this relationship
for the two terms on the right hand side. Minimization of the first term,
C
PriorMatching
(
ϕ
),
aims to set the latent distribution
(8.12)
to be as close as possible to a standard normal. It
is called prior matching because when treating the variational autoencoder as a Bayesian
inference model, the latent distribution can be viewed as a prior distribution. Minimization of
the second term,
C
Reconstruction
(
ϕ, θ
), aims to capture maximum likelihood estimation with
respect to both encoder and decoder parameters and hence achieve optimal reconstruction
of the output distribution. Details of implementing these loss functions are below, after we
introduce the concept of the evidence lower bound.
Relations to Maximal Likelihood and ELBO
Our aim with training variational autoencoders is that minimization of
(8.13)
will act as
a means for maximum likelihood estimation as in
(8.1)
. In our exposition here we focus
on a simple stochastic gradient descent setting and thus we consider a single datapoint
x D
. Extending to multiple datapoints, or mini-batches is straightforward. Our goal is to
optimize,
max
θR
p
log p
θ
(x). (8.14)
For this goal, let us decompose
p
θ
(
x
) where in our decomposition, we use the encoder
distribution
q
ϕ
(
z |x
) as well as distributions associated with the joint distribution of the
decoder from
(8.2)
. Namely we use the joint distribution
p
θ
(
x, z
), as well as
p
θ
(
z |x
) which
uses the decoder to describe the distribution of the latent variable
z
given
x
. This distribution
can be obtained via,
p
θ
(z |x) =
p
θ
(x, z)
p
θ
(x)
. (8.15)
7
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
Note that
p
θ
(
x, z
) and
p
θ
(
z |x
) are not operationally used (in a learning or inference
algorithms) but rather only appear theoretically for the decomposition. A key quantity in
the decomposition is called the evidence lower bound (ELBO), and is defined as,
ELBO(θ, ϕ ; x) =
Z
q
ϕ
(z |x) log
p
θ
(x, z)
q
ϕ
(z |x)
dz. (8.16)
Note that
ELBO
is an expectation of
log
p
θ
(x,Z)
q
ϕ
(Z | x)
where
Z
is distributed according to
q
ϕ
(
·|x
).
With ELBO defined, the decomposition of the log-likelihood log p
θ
(x) is,
log p
θ
(x) = ELBO(θ, ϕ ; x) + D
KL
(q
ϕ
(z|x) p
θ
(z|x)) , (8.17)
where
D
KL
(q
ϕ
(z|x) p
θ
(z|x))
denotes the KL-divergence (see
(??)
in Appendix
??
) which is
a measure of how
q
ϕ
(
z |x
) is different from
p
θ
(
z |x
). Note that the KL-divergence is always
non-negative, and equal to zero if and only if both distributions are identical. To see
(8.17)
,
observe that,
Z
q
ϕ
(z |x) log
p
θ
(x, z)
q
ϕ
(z |x)
dz
| {z }
ELBO(θ,ϕ ; x)
+
Z
q
ϕ
(z |x) log
q
ϕ
(z |x)
p
θ
(z |x)
dz
| {z }
D
KL
(q
ϕ
(z|x) p
θ
(z|x))
=
Z
q
ϕ
(z |x) log
p
θ
(x, z)
p
θ
(z |x)
dz = log p
θ
(x),
where moving from left to right, in the first step we combine the log terms and in the second
step we use (8.15).
A consequence of the decomposition
(8.17)
as well as the non-negativity of the KL-divergence
is that ELBO is a lower bound on the log-likelihood. Namely,
log p
θ
(x) ELBO(θ, ϕ ; x), (8.18)
hence the name “lower bound” in
ELBO
. Note that the log-likelihood is sometimes called
the evidence, and hence the term “evidence” in
ELBO
. Equality holds in
(8.18)
if and only if
q
ϕ
(z |x) and p
θ
(z |x) are the same.
Now let us assume that the encoder model is flexible enough to yield any mean and covariance
function as in
(8.10)
. With this assumption on the flexibility of the encoder, there exists
some
ϕ
such that
q
ϕ
(
z |x
) is equal to
p
θ
(
z |x
). Since
log p
θ
(
x
) does not depend on
ϕ
we have
from (8.18) and (8.17) that such a ϕ maximizes the evidence lower bound. Namely,
log p
θ
(x) = max
ϕR
m
ELBO(θ, ϕ ; x). (8.19)
With such a representation of
log p
θ
(
x
) for any
θ
, we can also maximize
(8.19)
over
θ
to
obtain,
max
θR
p
log p
θ
(x) = max
θR
p
, ϕR
m
ELBO(θ, ϕ ; x). (8.20)
We thus see that the maximum likelihood estimate
(8.1)
, when constrained to a single
x D
, can be obtained by maximization of
ELBO
in terms of both the encoder parameters
ϕ
and the decoder parameters
θ
. This general idea of indirect likelihood optimization via
maximization of
ELBO
(with the additional
ϕ
set of parameters for the encoder) is the crux
of training variational autoencoders. The importance of this approach is that approximate
maximization of
ELBO
is computationally feasible in contrast to direct maximum likelihood
estimation, where the integration in (8.3) is a computational barrier.
8
i
i
i
i
i
i
i
i
8.1 Generative Modelling Principles
Details of the Loss Function
We now see how minimization of the encoder-decoder loss function
(8.13)
can be constructed
for approximate maximization of ELBO. For this let us expand the expression in (8.16) as,
ELBO(θ, ϕ ; x) =
Z
q
ϕ
(z |x) log
p
θ
(x |z) p(z)
q
ϕ
(z |x)
dz
=
Z
q
ϕ
(z |x) log p
θ
(x |z)dz
|
{z }
E log p
θ
(x | Z)
Z
q
ϕ
(z |x) log
q
ϕ
(z |x)
p(z)
dz
| {z }
D
KL
(q
ϕ
(z|x) p(z))
. (8.21)
In the first equality we used the representation of the joint distribution
p
θ
(
x, z
) from
(8.2)
. Then in the second equality we expand the logarithm. The resulting first term is an
expectation of the function
log p
θ
(
x |Z
) when
Z
is distributed according to
q
ϕ
(
·|x
). For the
second term, keeping in mind that
p
(
z
) as in
(8.7)
is a parameter-free multivariate standard
normal, we have a KL-divergence that measures how close the encoder distribution is to the
standard normal. Keep in mind that this a different application of the KL-divergence to the
one used for lower bounding the evidence in (8.17).
With the
ELBO
expression present, we can now fill in the details for the terms in the loss
expression
(8.13)
, such that minimization of this loss works towards maximization of
ELBO
.
Our focus is on a stochastic gradient descent approach as in Section
??
where at any gradient
descent step, instead of considering the whole data
D
, we consider a single arbitrary
x D
.
In such a case, the loss components on the right hand side of
(8.13)
can be defined based
on a single data sample
x D
where, as shown below, for the second component we also
generate a random latent variable z
R
m
from (8.11).
The idea of minimizing the first term of
(8.13)
,
C
PriorMatching
(
ϕ
), is to drive the parameters
such that
q
ϕ,D
(
z
) of
(8.12)
is approximately a multivariate standard normal as in
(8.7)
. For
this we set,
C
PriorMatching
(ϕ) = µ
ϕ
(x)
µ
ϕ
(x) log det
Σ
ϕ
(x)
+ tr
Σ
ϕ
(x)
, (8.22)
where
det
(
·
) is the determinant of a matrix and
tr
(
·
) is the trace of a matrix. Minimization of
this expression is equivalent to minimization of the KL-divergence where the first argument
is a multivariate normal distribution with mean vector
µ
ϕ
(
x
) and covariance matrix Σ
ϕ
(
x
),
and the second argument is an
m
-dimensional standard multivariate normal distribution.
This is because,
D
KL
(q
ϕ
(z|x) p(z)) =
1
2
C
PriorMatching
(ϕ)
m
2
,
as can be verified with (??) of Appendix ??.
Moving onto the second term in
(8.13)
, using a single
x
and a single
z
, we construct this
term as,
C
Reconstruction
(ϕ, θ) =
x µ
θ
(z
)
Σ
θ
(z
)
1
x µ
θ
(z
)
+ log det
Σ
θ
(z
)
. (8.23)
This term aims to capture the
E log p
θ
(
x |Z
) term in
(8.21)
. Observe that
C
Reconstruction
(
ϕ, θ
)
is a random variable with distribution influenced by
ϕ
through the distribution of
z
. The
form of
(8.23)
arises from the log-density expression in
(??)
of Appendix
??
. We do not
have an explicit expression for the expectation
E log p
θ
(
x |Z
) term of
(8.21)
, therefore, as
9
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
a simple estimate we use a single sample
z
of the latent variable from the distribution
q
ϕ
(z |x) and rely on the approximate relationship,
E log p
θ
(x |Z) log p
θ
(x |z
), (8.24)
as a proxy for the optimization. Now considering (8.23) we have,
log p
θ
(x |z
) =
1
2
C
Reconstruction
(ϕ, θ) +
1
2
log(2π).
Hence minimization of
C
Reconstruction
(
ϕ, θ
) maximizes
log p
θ
(
x|z
), which approximates
maximization of E log p
θ
(x |Z) via (8.24).
We have thus seen that the encoder-decoder architecture, learned via stochastic gradient
descent with loss function
(8.13)
and individual terms
(8.22)
and
(8.23)
approximates
maximization of ELBO, and thus facilitates maximum likelihood estimation of θ.
The Reparameterization Trick
During training of the variational autoencoder, we need to compute the gradient of
the loss function
(8.13)
. With standard backpropagation we can compute the gradient
of the
C
PriorMatching
(
ϕ
) component in a straight forward manner. However, since the
C
Reconstruction
(
ϕ, θ
) component is based on a random sample
z
, it is not obvious how
to use backpropagation through the number random generator. Luckily, via a reparametriza-
tion of the random vector
z
as an affine function of a standard random vector
ϵ
, we can
overcome this difficulty. Specifically, if ϵ
is a standard m-dimensional multivariate normal
random vector, then we may set,
z
= µ
ϕ
(x) + Γ
ϕ
(x) ϵ
, (8.25)
where as before
µ
ϕ
(
x
) is the learned mean function of the encoder. With the reparam-
eterization
(8.25)
, the desired learned covariance function Σ
ϕ
(
x
) is decomposed using
Σ
ϕ
(
x
) = Γ
ϕ
(
x
) Γ
ϕ
(
x
)
, as in
(8.4)
. Now Γ
ϕ
(
·
) is learned in the encoder neural network in
place of Σ
ϕ
(
·
) of
(8.10)
. With this, the distribution of
z
is as required, and further there
are no longer issues with enforcing the positive definite constraint that is required for the
covariance matrix.
Importantly, backpropagation can now be carried out, because the random numbers generated
in
ϵ
are now generated independently of the parameters whose gradients we seek. In some
cases, the desired covariance matrix is diagonal. In such cases, the reparameterization is
especially simple since Γ
ϕ
(
x
) is also diagonal with each entry being the square root of the
associated diagonal entry of Σ
ϕ
(x).
8.2 Diffusion Models
Diffusion models are a class of generative models that have shown great promise in creating
both realistic looking images, as well as highly impressive surreal artwork. Figure 8.3 presents
a few examples with several paradigms of application including image generation, colorization,
style transfer, and text to image creation. The ease of use from a user’s perspective of such
platforms, and the quality of the images created is impressive.
10
i
i
i
i
i
i
i
i
8.2 Diffusion Models
(a) (b)
(c) (d)
Figure 8.3: Images generated via various diffusion architectures
6
with various types of paradigms.
(a) A 256
×
256 generated images based on a label. (b) Image-to-image generation (colorization). (c)
Applying a style (left column) to an image (middle column). (d) A text-to-image application.
11
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
The principles underlying diffusion models hinge on variational autoencoders presented in the
previous section, as well as on a generalisation of such models, called hierarchical variational
autoencoders and in particular Markovian hierarchical variational autoencoders. Diffusion
models are a special case of such models, hence in exploring diffusion models, we first study
Markovian hierarchical variational autoencoders. We then construct diffusion models as a
special case of Markovian hierarchical variational autoencoders, based on auto-regressive
Gaussian processes.
Hierarchical Variational Autoencoders
In the autoencoders of Section 8.1 the encoder acts as a noising mechanism that converts
the input
x
into a noisy latent variable
z
. On the other hand the decoder can be viewed as
a denoising mechanism for converting a noisy latent variable
z
to meaningful output. The
main idea of hierarchical variational autoencoders is to break up this process into multiple
levels. In the encoder, the input
x
is noised slightly to create
z
1
, then
z
1
is further noised to
create
z
2
up until some final level
T
with a fully noisy latent variable
z
T
. In the decoder
the reverse process takes place where each step from
z
t
to
z
t1
enacts a slight denoiseing
operation.
Hence in hierarchical variational autoencoders, instead of a single latent variable
z
, we have
a sequence of
T
levels with latent variables,
z
1
, . . . , z
T
. Such models also enforce a specific
dependence structure within this sequence and are called “hierarchical”, since in the encoder
each
z
t
can be generated based on the values of the lower-level latent variables
z
1
, . . . , z
t1
and in the decoder each latent variable
z
t
can be generated based on the values of the higher
latent variables
z
t+1
, . . . , z
T
. In such a model, one can view the input
x D
as the 0-th
level, namely the complete sequence of levels is x, z
1
, . . . , z
T
, where we can treat x as z
0
.
The most common hierarchical variational autoencoders are Markovian in which case the
sequence
x, z
1
, . . . , z
T
is a Markov chain, and the model is called a Markovian hierarchical
variational autoencoder. That is, the sequence follows the Markov property which can be
defined in multiple ways. One way is that for any
t
= 1
, . . . , T
1, given the value of
z
t
, the
sequence x, z
1
, . . . , z
t1
and the sequence z
t+1
, . . . , z
T
are independent.
Considering the decoder, a consequence of the Markov property is that the joint distribution
p
θ
(
x, z
1
, . . . , z
T
) can be described as a product of one step transition probabilities. We have,
p
θ
(x, z
1
, . . . , z
T
) = p
θ
1
(x |z
1
)p
θ
2
(z
1
|z
2
) ···p
θ
T
(z
T 1
|z
T
)p(z
T
), (8.26)
where
θ
= (
θ
1
, . . . , θ
T
) are the decoder parameters, and each
p
θ
t
(
z
t1
|z
t
) is the decoder
conditional one step transition probability from level
t
to level
t
1. Note that in this
expression,
p
(
z
T
), the distribution of level
T
, is parameter free. Since it is assumed to be
fully noisy, it is often taken as a multivariate standard normal similar to
(8.7)
from the
(non-hierarchical) variational autoencoder.
In such models, the one step transition probabilities
p
θ
t
(
·|z
t
) are generally represented
as multivariate Gaussian distributions where for level
t
, we learn neural networks for the
mean function
µ
θ
t
(
z
t
) and potentially for the covariance function Σ
θ
t
(
z
t
). Hence the learned
parameters
θ
t
describe the distribution at level
t 1
given the value at the previous level,
z
t
.
6
Image (a) is thanks to Ho, et al. [
39
]. Image (b) is thanks to Saharia, et. al. [
78
]. Image (c) is thanks to
Wang, et. al. [97]. Image (d) is thanks to Nichol, et. al. [66].
12
i
i
i
i
i
i
i
i
8.2 Diffusion Models
Observe that the Markovian hierarchical variational autoencoder joint distribution in
(8.26)
is the parallel of the joint distribution
(8.2)
for the non-hierarchical case. A potential benefit
of using
(8.26)
is that by breaking up the parameterization of the decoder into
T
smaller
steps, each with its own learned parameters, more expressive models can be achieved.
A decoder parameterized by
θ
= (
θ
1
, . . . , θ
T
) implements a generative model, similarly
to the non-hierarchical case. First
z
T
is drawn from the distribution
p
(
z
T
), and then for
each transition we compute
µ
θ
t
(
z
t
) and Σ
θ
t
(
z
t
) and then draw
z
t1
from the transition
probability
p
θ
t
(
·|z
t
) parametrized by these values. Hence while a non-hierarchical variational
autoencoder draws
z
once and then generates
x
, here we draw
z
T
, then
z
T 1
, and so forth
until attaining a generated data sample x
= z
0
.
x
z
1
q
φ
0
(z
1
|x)
p
θ
1
(x|z
1
)
z
t1
z
t
q
φ
t1
(z
t
|z
t1
)
p
θ
t
(z
t1
|z
t
)
z
t+1
q
φ
t
(z
t+1
|z
t
)
p
θ
t+1
(z
t
|z
t+1
)
z
T 1
z
T
q
φ
T 1
(z
T
|z
T 1
)
p
θ
T
(z
T 1
|z
T
)
Encoder
Decoder
Figure 8.4: A Markovian hierarchical variational autoencoder has a latent space
z
1
, z
2
, . . . , z
T
,
with
x
=
z
0
. The encoder transforms from
z
t
to
z
t+1
(adding noise) and the decoder works in the
opposite direction (removing noise).
Like non-hierarhcial variational autoencoders, in Markovian hierarchical variational autoen-
coders, the encoder serves as an aid for training the generative model (the decoder). The
encoder also incorporates a Markovian structure within levels. The encoder parameters
are
ϕ
= (
ϕ
0
, . . . , ϕ
T 1
), where this time for
t
= 0
, . . . , T
1, we have conditional one step
transition probabilities,
q
ϕ
t
(
z
t+1
|z
t
) describing the transition from level
t
to level
t
+ 1. Like
the decoder, in the encoder we can assume Gaussian transition probabilities which are this
time parameterized by µ
ϕ
t
(z
t
) and Σ
ϕ
t
(z
t
). See Figure 8.4.
Considering the encoder, it is useful to represent the conditional distribution of
z
1
, . . . , z
t
given
x
for any
t
= 1
, . . . , T
, denoted as
q
ϕ
(
z
1
, . . . , z
t
|x
). Due to the Markovian assumption
we have,
q
ϕ
(z
1
, . . . , z
t
|x) = q
ϕ
0
(z
1
|x)q
ϕ
1
(z
2
|z
1
) ···q
ϕ
t1
(z
t
|z
t1
), for t = 1, . . . , T. (8.27)
Further, the distribution of
z
t
for a given
x
is denoted as
q
ϕ
(
z
t
|x
) which may formally be
represented as a marginal distribution from q
ϕ
(z
1
, . . . , z
t
|x), and obtained via,
q
ϕ
(z
t
|x) =
Z
···
Z
q
ϕ
(z
1
, . . . , z
t
|x) dz
1
···dz
t1
. (8.28)
13
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
In the specific diffusion models below this distribution is constructed with an explicit form,
yet for now the general expression (8.28) is appropriate.
With this notation and the construction of the encoder and decoder in place, we can seek
to approximate maximum likelihood estimation of
θ
. This is done in a similar manner to
the non-hierarchical variational autoencoder case outlined in Section 8.1. Specifically we
maximize an ELBO term over both ϕ and θ.
The
ELBO
term in this case is similar to
(8.16)
yet can be represented via
z
1
, . . . , z
T
in place
of z. Namely,
ELBO(θ, ϕ ; x) =
Z
q
ϕ
(z
1
, . . . , z
T
|x) log
p
θ
(x, z
1
, . . . , z
T
)
q
ϕ
(z
1
, . . . , z
T
|x)
dz
1
···dz
T
. (8.29)
Now using the Markovian structure of
(8.26)
,
(8.27)
, and after some extensive manipulation,
the following expansion of ELBO(θ, ϕ ; x) arises:
Z
q
ϕ
0
(z
1
|x) log p
θ
1
(x |z
1
)dz
1
| {z }
E log p
θ
1
(x | Z
1
)
(Reconstruction)
Z
z
T
q
ϕ
(z
T
|x) log
q
ϕ
(z
T
|x)
p(z
T
)
dz
T
| {z }
D
KL
(q
ϕ
(z
T
|x) p(z
T
))
(Prior matching) (8.30)
T
X
t=2
Z
z
t
q
ϕ
(z
t
|x)
Z
q
ϕ
(z
t1
|z
t
, x) log
q
ϕ
(z
t1
|z
t
, x)
p
θ
t
(z
t1
|z
t
)
dz
t
.
| {z }
E D
KL
(
q
ϕ
(z
t1
|Z
t
,x) p
θ
t
(z
t1
|Z
t
)
)
(Denoising matching)
With this expansion
ELBO
is now represented in terms of three types of terms, namely a
reconstruction term, a prior matching term, and denoising matching terms. The first two
terms are also present in non-hierarchical variational autoencoders whereas the denoising
matching terms arise due to the multi-level structure of the latent space.
The reconstruction term in
(8.30)
is an expectation of the conditional log-likelihood with
respect to
Z
1
, distributed according to
q
ϕ
0
(
z
1
|x
). Maximization of this term drives maximum
likelihood estimation. The prior matching term in
(8.30)
is the KL-divergence between the
standard normal distribution
p
(
z
T
) and the encoder based last step distribution
q
ϕ
(
z
T
|x
).
Minimization of this KL-divergence attempts to enforce that
z
T
at the exit of the encoder is
distributed approximately as a standard normal.
In the additional
T
1 terms, we use
q
ϕ
(
z
t1
|z
t
, x
) which can be viewed as a denoising
distribution in the encoder. Note that in contrast to the direction of the encoder (
z
t1
to
z
t
),
this distribution is in the opposite direction. The expected KL-divergence in each denoising
matching term is between
q
ϕ
(
z
t1
|z
t
, x
) and the decoder’s one step transition
p
θ
t
(
z
t1
|z
t
).
Ideally both the encoder and the decoder are similar at level
t
and minimization of the
expected KL-divergence enforces the denoising to be as close as possible. Note that when
represented as an expectation, the expectation is with respect to the random variable
Z
t
distributed as q
θ
(z
t
|x).
14
i
i
i
i
i
i
i
i
8.2 Diffusion Models
As is evident, the
ELBO
(
θ, ϕ
;
x
) expressions rely on
q
ϕ
(
z
t
|x
) which in general needs to
be computed via
(8.27)
and
(8.28)
. It also relies on
q
ϕ
(
z
t1
|z
t
, x
), which is generally
difficult to compute. We now see, that in the special case of a diffusion model, with the
assumptions imposed, the expressions for
q
ϕ
(
z
t
|x
) and
q
ϕ
(
z
t1
|z
t
, x
) simplify and are
efficient to compute.
The process of learning parameters
θ
for Markovian hierarchical variational autoencoders is in
principle similar to learning variational autoencoders as presented in Section 8.1. Specifically,
we approximately maximize
ELBO
(
θ, ϕ
;
x
) over both
θ
and
ϕ
using estimates obtained
via samples of the latent variables. The resulting decoder with parameters
θ
emerges and
approximates maximum likelihood estimation as in the non-hierarchical case.
The Diffusion Model Assumptions
A diffusion model is a Markovian hierarchical variational autoencoder with a few specific
assumptions. All latent variables are of the same dimension of the data samples. Fur-
ther, recalling that
x
=
z
0
, the one step encoder and decoder transition probabilities are
respectively,
q(z
t+1
|z
t
) = N(z
t+1
;
p
1 β
t
z
t
, β
t
I) for t = 0, . . . , T 1, (8.31)
p
θ
t
(z
t1
|z
t
) = N(z
t1
; µ
θ
t
(z
t
), σ
2
t
I) for t = T, . . . , 1. (8.32)
Here we have two sequences of scalar hyper-parameters. The encoder hyper-parameters are
β
0
, . . . , β
T 1
, each in the range [0
,
1]. The decoder hyper-parameters are
σ
2
1
, . . . , σ
2
T
, each a
positive number. Note that the encoder has no learned parameters and thus there are no
ϕ
t
subscripts for
q
(
z
t+1
|z
t
). The decoder learned parameters,
θ
1
, . . . , θ
T
, are used for the
mean functions
µ
θ
1
(
·
)
, . . . , µ
θ
T
(
·
), but not for the one step covariance matrices that are of
the form
σ
2
1
I, . . . , σ
2
T
I
. Each of these mean functions,
µ
θ
t
(
·
), is a neural network, and hence
the model has T learned neural networks.
A key aspect of the diffusion model is the structure of the one step transition probabilities of
the encoder
(8.31)
. Since the encoder steps have a conditional mean vector and covariance
matrix of
1 β
t
z
t
and
β
t
I
respectively, they describe an auto-regressive stochastic sequence
of the form,
z
t+1
=
p
1 β
t
z
t
+
p
β
t
ϵ
t
, (8.33)
where
ϵ
t
is a multivariate standard normal vector. Conditioned on the value of
z
t
, the
expected value of
z
t+1
resulting from
(8.33)
is
1 β
t
z
t
and the covariance matrix is
β
t
I
.
Hence, this stochastic recursion then follows the next step distribution
(8.31)
. The name
“diffusion model” can be attributed to the fact that if
(8.33)
was in continuous time, then
the process is a type of a stochastic diffusion process.
A strength of the recursion
(8.33)
, is that it also enables closed form expressions of the
multi-step transition probabilities
q
ϕ
(
z
t
|x
). Such probabilities are heavily used in the
ELBO
expression
(8.30)
. For this purpose it is useful to define the products
γ
t
for
t
= 1
, . . . , T
,
with
γ
t
= (1 β
0
) ···(1 β
t1
).
15
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
With this notation, standard recursive computations of means and variances involving
geometric sums yield,
q(z
t
|x) = N(z
t
;
γ
t
x, (1 γ
t
)I) for t = 1, . . . , T. (8.34)
Hence for such an auto-regressive stochastic sequence, the distribution of
z
t
given the initial
value x has a mean vector
γ
t
x and a covariance matrix (1 γ
t
)I.
It is of further interest to have explicit expressions for
q
(
z
t1
|z
t
, x
), a probability appearing
in the denoising matching term in
(8.30)
. The diffusion model assumptions enable us to
obtain this conditional probability as well. Namely,
q(z
t1
|z
t
, x) =
q(z
t
|z
t1
, x) q(z
t1
|x)
q(z
t
|x)
=
q(z
t
|z
t1
) q(z
t1
|x)
q(z
t
|x)
= N
z
t1
;
(1 γ
t1
)
p
1 β
t1
1 γ
t
z
t
+
γ
t1
β
t1
1 γ
t
x
| {z }
Mean vector
,
β
t1
(1 γ
t1
)
1 γ
t
I
| {z }
Covariance Matrix
!
.
(8.35)
The first equality follows from Bayes’ rule of conditional probability, maintaining the
condition on
x
. In the second equality we use the Markovian structure which implies that
q
(
z
t
|z
t1
, x
) =
q
(
z
t
|z
t1
). Then to obtain the final expression we manipulate normal
densities as given by
(8.31)
and
(8.34)
. Note that this final manipulation requires a few
algebraic steps involving completion of the squares; we omit the details. Hence we have
explicit expressions for the conditional (on
z
t
and
x
) mean vector and covariance matrix of
z
t1
.
Loss Function
Training a diffusion model involves learning the parameter vectors
θ
1
, . . . , θ
T
. This is the
learning of
T
distinct neural networks together. While it is a special case of training general
Markovian hierarchical variational autoencoders, the fact that there are no learned parameters
in the encoder, and the fact that the decoder covariance functions are also without learned
parameters, eases training.
With any Markovian hierarchical variational autoencoder, we would ideally like to maximize
ELBO
represented in
(8.30)
, yet in the special case of a diffusion model, the closed form
diffusion expressions
(8.31)
,
(8.32)
,
(8.34)
, and
(8.35)
simplify the training process. Some of
the general parameters and associated probabilities in
(8.30)
are now captured by closed
form expressions and hyper-parameters in the diffusion model case.
Considering the ELBO expression
(8.30)
we see that for diffusion models we do not need a
prior matching term, since this term only depends on the encoder parameters
ϕ
, and diffusion
models do not have learned parameters for the encoder. With this, a single observation loss
function for the diffusion model can then be represented as,
C(θ
1
. . . , θ
T
; x) = C
Reconstruction
(θ
1
) +
T
X
t=2
C
DenoisingMatching
(θ
t
), (8.36)
16
i
i
i
i
i
i
i
i
8.2 Diffusion Models
where
x D
. The loss component
C
Reconstruction
(
θ
1
) relates to the reconstruction term
in
(8.30)
, and the loss components
C
DenoisingMatching
(
θ
t
) relates to the denoising matching
terms in
(8.30)
. Note that for brevity we omit the dependance on
x
in the terms on the
right hand side. These loss components are implemented as,
C
Reconstruction
(θ
1
) =
1
σ
2
1
x µ
θ
1
(z
1
)
2
, (8.37)
C
DenoisingMatching
(θ
t
) =
1
σ
2
t
(1 γ
t1
)
p
1 β
t1
1 γ
t
z
t
+
γ
t1
β
t1
1 γ
t
x µ
θ
t
(z
t
)
2
, (8.38)
for t = 2, . . . , T . Here each z
t
is generated randomly using (8.34) for given data sample x.
Minimization of the terms
(8.37)
and
(8.38)
is approximately equivalent to ELBO maxi-
mization, in a similar nature to the variational autoencoder. The reconstruction loss
(8.37)
is similar to the standard variational autoencoder reconstruction loss
(8.23)
, based on the
log-density expression
(??)
of Appendix
??
. Here we exploit the fact that there are no learned
covariance parameters. Similarly to the standard variational autoencoder of Section 8.1,
minimization of this reconstruction loss serves to approximately maximize
E log p
θ
1
(
x |Z
1
)
in (8.30).
The denoising matching loss terms
(8.38)
are based on the KL-divergence expression
(??)
of
Appendix ?? and in fact for every level t,
D
KL
q(z
t1
|z
t
, x)
p
θ
t
(z
t1
|z
t
)
=
1
2
C
DenoisingMatching
(θ
t
) + constant,
where the constant on the right hand side depends only on the (non-learnable) hyper-
parameters
β
t1
,
γ
t1
, and
σ
2
t
. From this expression, it is clear that minimization of the
KL-divergence is equivalent to minimization of the Euclidean distance between the mean
vectors of the distributions
q
(
z
t1
|z
t
, x
) and
p
θ
t
(
z
t1
|z
t
). Further, this term serves as an
estimate of
E D
KL
(q(z
t1
|Z
t
, x) p
θ
t
(z
t1
|Z
t
))
from
(8.30)
, where the mean and covariance
expressions of
(8.35)
are used for the inner integral, and the single
z
t
drawn from
q
(
z
t
|x
)
serves as an approximation of the outer integral.
This loss formulation in
(8.36)
,
(8.37)
, and
(8.38)
presents us with a simple mechanism
for learning the parameters of the neural networks
µ
θ
1
(
·
)
, . . . , µ
θ
T
(
·
) using gradient based
optimization where at each epoch we draw random
z
1
, . . . , z
T
according to
(8.34)
. Similar to
other deep learning cases, we can also integrate mini-batch based learning and other gradient
descent variations. As we show now, these loss expressions can even be further simplified
using the reparameterization trick.
The Reparameterization Trick and Simplified Loss
In
(8.25)
of Section 8.1, we saw the reparameterization trick in the context of variational
autoencoders. This allows us to carry out model training using backpropagation. A similar
reparameterization trick can be applied to diffusion models. In the context of diffusion
models, this trick simplifies the loss function
(8.36)
and often yields better numerical
stability during training. Such simplification is achieved by replacing the neural networks
µ
θ
1
(
z
1
)
, . . . , µ
θ
T
(
z
T
), which are used for sequentially denoising the latent variables to obtain
17
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
data sample
x
, with a different collection of neural networks
ˆµ
θ
1
(
z
1
)
, . . . , ˆµ
θ
T
(
z
T
). Each such
ˆµ
θ
t
(z
t
) is trained to predict a standard noise component as constructed below.
Similar to the variational autoencoder case, the reparameterization trick we use is applied
on the samples of the latent variables. In particular, based on
(8.34)
we can represent
z
t
as
z
t
=
γ
t
x +
p
1 γ
t
ϵ
t
, (8.39)
where
ϵ
t
denotes an independent multivariate standard normal vector with dimension equal
to the dimension of
x
. Using this reparameterization trick, as we show below, at a given
data sample x, the loss function can be of the form,
C(θ
1
. . . , θ
T
; x) =
T
X
t=1
C
Reparameterized
(θ
t
), (8.40)
where each C
Reparameterized
(θ
t
) is defined as,
C
Reparameterized
(θ
t
) =
β
2
t1
σ
2
t
(1 γ
t
)(1 β
t1
)
ϵ
t
ˆµ
θ
t
γ
t
x +
p
1 γ
t
ϵ
t
| {z }
z
t
2
, (8.41)
and the dependence on
x
is notationally suppressed. Before delving into the mathematical
formulation of obtaining the loss component
(8.41)
, let us focus on how the diffusion model
is trained.
When using
(8.40)
and
(8.41)
we train the neural networks
ˆµ
θ
t
(
·
) either with a single
x D
or with a mini-batch approach. Importantly, in each gradient evaluation during training, we
generate a standard multivariate normal
ϵ
t
which is mapped to
z
t
using
(8.39)
. We then
obtain the gradient of
ˆµ
θ
t
(
z
t
) at
z
t
=
z
t
and this allows us to compute the gradient of
C
Reparameterized
(θ
t
).
In the remaining part of the section, we provide a derivation of the loss expressions and then
discuss how the trained diffusion model is used in the production. With
(8.39)
, a sample of
z
t
can be obtained by first generating a sample of
ϵ
t
and then using a data sample
x
to get
z
t
. Thus, if the values of ϵ
t
and the corresponding z
t
are given, we can represent x as,
x =
r
1
γ
t
z
t
r
1 γ
t
γ
t
ϵ
t
.
Using this, we can now manipulate the mean vector expression in
(8.35)
. After some
simplification, based on the fact that γ
t
t1
= 1 β
t1
, we obtain
(1 γ
t1
)
p
1 β
t1
1 γ
t
z
t
+
γ
t1
β
t1
1 γ
t
x =
1
p
1 β
t1
z
t
β
t1
p
(1 γ
t
)(1 β
t1
)
ϵ
t
.
Consequently, the denoising loss component in (8.38) can be expressed as
C
DenoisingMatching
(θ
t
) =
1
σ
2
t
1
p
1 β
t1
z
t
β
t1
p
(1 γ
t
)(1 β
t1
)
ϵ
t
µ
θ
t
(z
t
)
2
. (8.42)
18
i
i
i
i
i
i
i
i
8.3 Generative Adversarial Networks
Hence the neural network
µ
θ
t
(
z
t
) attempts to predict
1
1β
t1
z
t
β
t1
(1γ
t
)(1β
t1
)
ϵ
t
. Now
the essence of the reparameterization trick is to use a different neural network, denoted as
ˆµ
θ
t
(
z
t
), that takes
z
t
to predict the noise
ϵ
t
. Note that only for the notational convenience
we use the same notation θ
t
to denote the parameters of new neural network as well.
With the new neural network,
ˆµ
θ
t
(
z
t
), we replace
µ
θ
t
(
z
t
) in
(8.42)
with a new random latent
variable,
1
p
1 β
t1
z
t
β
t1
p
(1 γ
t
)(1 β
t1
)
ˆµ
θ
t
(z
t
), (8.43)
where
z
t
is obtained from
ϵ
t
using
(8.39)
. Now with basic manipulation the
C
DenoisingMatching
(
θ
t
)
expression in (8.42) can be replaced by C
Reparameterized
(θ
t
) of (8.41).
With a similar argument, we can show that for
t
= 1,
C
Reconstruction
(
θ
1
) in
(8.37)
can be
replaced with
C
Reparameterized
(θ
1
) =
β
0
σ
2
1
(1 β
0
)
ϵ
1
ˆµ
θ
1
(z
1
)
2
,
As a consequence of these neural network replacements and the fact that
γ
1
= (1
β
0
), the
loss component for each
t
must be of the form
(8.41)
and hence the final loss function is
given by (8.40).
In production, once the diffusion model is trained, to generate a realistic looking sample
x
, we start at
t
=
T
with a sample
z
T
from a multivariate standard normal. Iteratively, as
we decrement
t
, for each step we generate a different standard multivariate normal
ϵ
and
compute ˆµ
θ
t
(z
t
). By using
z
t1
=
1
p
1 β
t1
z
t
β
t1
p
(1 γ
t
)(1 β
t1
)
ˆµ
θ
t
(z
t
)
| {z }
ˆz
t
+ σ
t
ϵ
, (8.44)
we obtain the (denoised) input to the next iteration, where
ˆz
t
is given by
(8.43)
. We repeat
until
x
=
z
0
yields a realistic looking generated data sample. Observe that randomness of
this sample is due to the initial random
z
T
as well as to the
ϵ
that was generated in each
iteration.
Notice from
(8.44)
that
z
t1
follows a multivariate normal distribution with the conditional
(on
z
t
) mean vector being
ˆz
t
and the covariance matrix being
σ
2
t
I
. Since with the reparam-
eterization trick,
ˆz
t
is a replacement of
µ
θ
t
(
z
t
), in production, the distribution of
z
t1
is
approximately equal to p
θ
t
(·|z
t
) which is what we aim to achieve.
8.3 Generative Adversarial Networks
Generative adversarial networks or GANs, are generative deep learning architectures that
have made great impact on the field of synthetic data generation. For example in Figure 8.5 we
can see synthetic images, all created via various GAN architectures, with several paradigms
of application. In this section our aim is to highlight the key ideas of generative adversarial
networks, focusing on mathematical principles. The basic setup of using such a generative
model to approximately create samples from the distribution
p
(
x |z
) as was illustrated in
Figure 8.1. We now present details of generative adversarial network architectures.
19
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
(a) (b) (c)
Figure 8.5: Images generated via various GAN architectures
7
with various types of paradigms.
(a) Faces with 1024
×
1024 resolution. (b) Picture to picture generation. (c) A text to image
application.
The GAN Approach to Generative Modelling
The general idea of GAN modelling is to train a neural network, called the generator and
denoted here as
f
G
θ
(
·
). This model applied to a noise vector
z
, generates a data point
x
that ideally appears similar to other data points
x D
. Training is motivated by an
adversarial game like approach, where we simultaneously train both the generator
f
G
θ
(
·
),
and another network called the discriminator, and denoted here as
f
D
ϕ
(
·
). Hence training
implies learning the generator parameters
θ
while in parallel also learning the discriminator’s
parameters ϕ.
The generator’s purpose is to create “fake” data samples and the discriminator’s purpose is
to try and distinguish between fake and real samples. As such, the discriminator is a binary
classifier, which when presented with some data point
˜x
, would ideally be able to output a
probability value near 1 if ˜x D (real) and a probability value near 0 if ˜x = x
(fake).
While in a normal classification setting, having a well performing disciriminator
f
D
ϕ
(
·
) is
desirable, in the GAN setting, the discriminator is actually used to help train the generator
f
G
θ
(
·
). Once training is complete, we would generally have that
f
D
ϕ
(
˜x
) is on average near
1
/
2 for both real
˜x D
and fake
˜x
=
x
, with
x
=
f
G
ϕ
(
z
) . As such, during training of a
GAN, we can expect to see a sequence of network parameters,
(ϕ
(1)
, θ
(1)
)
| {z }
iteration 1
, (ϕ
(2)
, θ
(2)
)
| {z }
iteration 2
, . . . , (8.45)
such that as training progresses to high iterations
t
,
θ
(t)
are the parameters of a generator
that creates “sensible fake samples”, while the discriminator parameters
ϕ
(t)
, are no longer
able to discriminate between “fake” and “real”, and then yield performance near 1/2.
Empirically it has been experienced that one needs to train both the generator and discrimi-
nator together, in order to achieve the desired behaviour. The alternative of starting with
an already trained discriminator, and then training a generator is not feasible, first because
7
Image (a) is thanks to Karras, et. al. [
48
]. Image (b) is thanks to Isola, et. al. [
43
]. Image (c) is thanks
to Kang, et. al. [47].
20
i
i
i
i
i
i
i
i
8.3 Generative Adversarial Networks
Production
Training
x
x D
f
G
θ
(·)
f
D
φ
(·)
Fake/Real
z
Figure 8.6: The GAN architecture for learning the generator
f
G
θ
(
·
). A random
z
passes through
the generator to produce
x
. The generator is trained simultaneously with the discriminator
f
D
ϕ
(
·
),
whose role is to determine if its input is fake, x
, or real, x D.
such a trained discriminator is not available without the generator trained, and secondly
because the joint training provides a loss landscape for the generator parameters
θ
that
is suitable for gradient based learning. In this sense, training a GAN, follows a dynamic
equilibrium approach which can often be posed as a mathematical game between two players.
In fact, as we present below, at the equilibrium point of this game, one can properly use
mini-max game analysis to analyze some properties of the optimization.
Figure 8.6 illustrates the general GAN architecture where in training we use both the
generator and discriminator networks, while in production only the generator network is
used for creating samples from
p
(
x |z
). Note that the latent space size of
z
, denoted
m
, is
typically in the order of several dozen to several hundred dimensional and it is typically
taken to be a multivariate standard normal distribution as in the previous generative models
of this chapter.
Training a GAN
The key to training a GAN with steps of a sequence such as
(8.45)
is to alternate between
learning
ϕ
for the discriminator, and learning
θ
for the generator. For the discriminator,
training involves improving the binary classifier
f
D
ϕ
(
·
) and for the generator, training is
adversarial since it involves seeking parameters
θ
that yield the “worst possible” generator
f
G
θ
(
·
) for a given discriminator
f
D
ϕ
(
·
). We cast these alternative goals as a minimax objective,
a notion that we now describe.
First note that as in Section 8.1 we use
D
to denote the set of data points and assume that
there are
n
such data points. In addition we assume some set of random latent variable
samples which we denote as
Z
with each
z
Z
being an
m
-dimensional random vector.
In practice, one does not need to randomly sample all of
Z
in one go, but can rather
resample from a standard multivariate normal distribution every time we seek an arbitrary
element of
Z
. Yet for simplicity of the exposition and analysis we assume that the number
of elements in Z
is the same as in D, i.e., n.
21
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
With this notation, based on
D
and
Z
, for fixed generator parameters
θ
, the objective for
the discriminator parameters
ϕ
can be taken as the classic binary cross entropy objective;
recall
(??)
of Chapter
??
. Specifically, from the discriminator’s perspective we want
x D
to be considered as a positive example and we want
f
G
θ
(
z
) for
z
Z
to be considered
as a negative example. That is, the discriminator applied to
x
should be as close to 1 as
possible, and the discriminator applied to
f
G
θ
(
z
) should be as close to 0 as possible. Now
given the generator’s parameters
θ
, together with
D
and
Z
, a binary cross entropy loss
function (scaled by a factor of 2) can be formulated as,
C
θ
(ϕ ; D, Z
) =
1
n
X
x∈D
log
f
D
ϕ
(x)
+
X
z
∈Z
log
1 f
D
ϕ
f
G
θ
(z
)

. (8.46)
Hence the objective of learning the discriminator parameters is,
min
ϕ
C
θ
(ϕ ; D, Z
). (8.47)
The generator’s learning process is adversarial as the generator wishes for
C
θ
(
ϕ
;
D, Z
)
to be as high as possible. Specifically, from the generator’s point of view, we seek to find
parameters
θ
that are the worst possible parameters (maximization of
C
θ
) for the optimal
choice of the discriminator. That is, we seek to solve
max
θ
min
ϕ
C
θ
(ϕ ; D, Z
)
. (8.48)
The joint objective
(8.48)
is generally called a minimax objective as it captures the competing
goals of both players in the game.
To implement
(8.48)
we use an iterative algorithm that goes through iterates as in
(8.45)
and in each iteration alternates between
ϕ
and
θ
, with multiple steps within the iteration for
each. Specifically, at iteration
t
we first carry out mini-batch based gradient descent steps
on the loss function
(8.46)
where
θ
=
θ
(t1)
is fixed and we improve
ϕ
, starting with
ϕ
(t1)
in the iteration. The result of these mini-batch gradient descent steps for iteration
t
is
ϕ
(t)
.
In the second part of iteration
t
, we seek to maximize
(8.46)
over
θ
while keeping
ϕ
(t)
fixed.
Now since the first term in
(8.46)
does not depend on
θ
, this maximization is equivalent to
gradient descent steps (minimization) for,
min
θ
1
n
X
z
∈Z
log
1 f
D
ϕ
(t)
f
G
θ
(z
)

. (8.49)
Such iterative gradient based learning, ideally approximates a solution of
(8.48)
. In practice
one often tunes the number of mini-batch discriminator steps and the number of
z
samples
for the generator steps, carried out per iteration
t
. One problem that sometimes arises
during training is called mode collapse and it is a situation where the generator is stuck at a
point in the parameter space of
θ
where it generates samples that do not cover the breadth
of the data
D
. In general, when carrying out diagnostics of GAN training, we expect the
performance of the discriminator to converge to about 1/2.
22
i
i
i
i
i
i
i
i
8.3 Generative Adversarial Networks
Minimization of the Jensen-Shannon Divergence
The training procedure outlined above can be cast in a theoretical setting. Motivated by
the discriminator loss
(8.46)
we can describe a theoretical construct which abstracts what
a GAN training procedure actually optimizes. Specifically, we now shift to probabilistic
thinking, similar to the notions of Section 8.1 in the context of variational autoencoders and
Section 8.2 in the context of diffusion models.
For this let us define three probability distributions used in the analysis. First, the distribution
of the data
D
is assumed to be captured by
p
D
(
·
). Let us assume this is a probability
distribution covering all of
R
p
. Then the distribution of each of the latent variables
Z
is assumed be captured by
p
Z
(
·
). Let us assume that this a probability distribution over
R
m
which typically is multivariate standard Gaussian and hence covers the whole latent
space. Finally, the distribution of the output of the generator is captured by
p
G
(
·
). This last
distribution can in theory be obtained by applying the generator
f
G
θ
(
·
) as a transformation
on the random variables in
Z
. Like the distribution of the data, this is a distribution over
R
p
and we assume it also covers the whole space.
Ideally we would like a GAN to learn the generator parameters
θ
such that
p
G
=
p
D
, i.e.,
that the distribution of the generator is the same as the distribution of the data. To capture
such a goal, it turns out that in the context of GANs, a good measure of the distance between
the two distributions is the Jenson-Shannon divergence, defined in
(??)
of Appendix
??
.
Specifically, let us denote,
J
θ
= JSD(p
G
p
D
) =
D
KL
(p
D
p
M
) + D
KL
(p
G
p
M
)
2
, where p
M
(u) =
1
2
p
G
(u)+p
D
(u)
,
and
D
KL
(
··
) is the KL-divergence. Hence for generator parameters
θ
, the divergence value
J
θ
captures how far off our generator is from the actual data. In the ideal situation of
p
G
=
p
D
we have that
J
θ
= 0, otherwise it is greater than 0. Note that upon representing
the KL-divergences as integrals and manipulating the 2 constant we have,
J
θ
=
1
2
Z
R
p
p
D
(u) log
p
D
(u)
p
D
(u) + p
G
(u)
du + p
G
(u) log
p
G
(u)
p
D
(u) + p
G
(u)
du + log 2. (8.50)
Consider now the discriminator loss
(8.46)
and assume that
n
. In this case, due to
laws of large numbers, the loss can be rephrased in terms of expectations and expressed as,
C(ϕ, θ ; p
D
, p
Z
) = E
p
D
log
f
D
ϕ
(X)
E
p
Z
log
1 f
D
ϕ
f
G
θ
(Z)

, (8.51)
where the first expectation is of the random variable X distributed according to p
D
(·) and
the second expectation is with respect to the latent random variable
Z
distributed according
to p
Z
(·). This expected loss can further be manipulated as,
C(ϕ, θ ; p
D
, p
Z
) = E
p
D
log
f
D
ϕ
(X)
E
p
G
log
1 f
D
ϕ
(W )
=
Z
R
p
p
D
(u) log
f
D
ϕ
(u)
du
Z
R
p
p
G
(u) log
1 f
D
ϕ
(u)
du
=
Z
R
p
p
D
(u) log
f
D
ϕ
(u)
du + p
G
(u) log
1 f
D
ϕ
(u)
du,
where in the first equation we set
W
=
f
G
θ
(
Z
) which is a random variable distributed
according to
p
G
(
·
). Hence the second expectation in the first equation is with respect to the
23
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
distribution
p
G
(
·
). Moving from the first equation to the second step we simply represent
the expectations as integrals and then since both integrations are on the same domain we
can move from the second step to the third step.
Now let us identify a pattern inside the final integral of the form
a log
(
y
)+
b log
(1
y
), where
a
=
p
D
(
u
),
b
=
p
G
(
u
), and
y
=
f
D
ϕ
(
u
). It is easy to check that for
a, b >
0, the function
g(y) = a log(y) + b log(1 y) is maximized at y = a/(a + b). Hence, for any y,
a log(y) + b log(1 y) a log(
a
a + b
) + b log(
b
a + b
) (8.52)
with the inequality being an equality at y = a/(a + b).
We can now use
(8.52)
to bound the integrand in the final expression for
C
(
ϕ, θ
;
p
D
, p
Z
)
as follows,
Z
R
p
p
D
(u) log
p
D
(u)
p
D
(u) + p
G
(u)
du + p
G
(u) log
p
G
(u)
p
D
(u) + p
G
(u)
du C(ϕ, θ ; p
D
, p
Z
).
Or, using the expression for J
θ
in (8.50) we have
2 log 2 2J
θ
C(ϕ, θ ; p
D
, p
Z
), (8.53)
where at the best theoretical generator parameters,
θ
, we have that
J
θ
= 0 and the inequality
is an equality with C = 2 log 2 1.386.
Let us note some parallels between this Jensen-Shannon based analysis for GANS and the
ELBO
analysis of Section 8.1 for variational autoencoders. In the variational autoencoder
case our inherent implicit goal is maximum likelihood estimation of
θ
and since we cannot
achieve such estimation in a computationally efficient manner, we resort to (approximate)
optimization of the lower bound,
ELBO
. Nevertheless, a theoretical inequality such as
(8.18)
,
and the additional optimization over the encoder’s
ϕ
in
(8.20)
ensures that once we have
optimized
ELBO
both in terms of
θ
and
ϕ
, then the desired maximum likelihood estimation
is also achieved.
In our GAN case, the inequality
(8.53)
helps justify a similar idea. If we apply a minimax
style optimization on it, as in (8.48) then we obtain,
2 log 2 = max
θ
min
ϕ
C(ϕ, θ ; p
D
, p
Z
)
,
which is the best possible value.
Variations to the Objective Function
While the ideas of GANs presented up to now are useful in their own right, there is much
room for architectural improvements. On the one hand one may consider different structures
for
f
D
ϕ
(
·
) and
f
G
θ
(
·
), such as for example setting the discriminator and generator neural
networks to be convolutional networks. On the other hand, one may revise the objective
function
(8.46)
as well as its expected value formulation
(8.51)
, to yield better training
performance. We now focus on this approach, where variations of the objective function are
considered.
24
i
i
i
i
i
i
i
i
8.3 Generative Adversarial Networks
p
D
(·)
p
G
(·)
f
D
φ
(t)
(·)
Figure 8.7: A schematic of a potential scenario during early phases of GAN training. The data
distribution
p
D
(
·
) and the generator distribution
p
G
(
·
) are far while the discriminator
f
D
ϕ
(t)
(
·
) is
already capable of separating the two distributions well. This reduces the “signal” for generator
training.
In general, problems may occur in initial training iterations when generator parameter values,
θ
(t)
, are nearly arbitrary. In such a case there is often extreme separation between
p
D
(
·
) and
p
G
(
·
), because the latter distribution is nearly arbitrary. Such a difficult situation requires a
“signal” from the discriminator that will force learning to improve subsequent
θ
(t)
values.
However, an objective such as
(8.46)
, does not always cater for such learning. In particular,
see Figure 8.7, where for simplicity we assume the data is one dimensional and we plot
data points together with
p
D
(
·
) as well as the generator distribution
p
G
(
·
) associated with
some arbitrary
θ
(1)
. In such a case, after only a few iterations, the discriminator parameters
ϕ
(t)
may quickly be trained to separate between the two distributions. The problem with
this is that at that point, when considering gradients for minimization of the generator, as
in
(8.49)
, there will often be a “lack of signal” (or nearly zero gradients) for learning the
generator parameters
θ
(t)
. This situation is quite common in practice, and often prohibits
effective learning of GANs.
As a consequence of such scenarios, multiple alternative objective formulations have been
proposed, some of which yield much better training performance. We now outline a few
of these ideas where we first describe a simple modification called non-saturating GAN
(NS-GAN). We then describe a deeper idea using Wasserstein distances which yields a
framework called Wasserstein GAN (W-GAN), and finally we discuss improvements to
W-GAN, which involve regularization concepts.
The first idea of non-saturating GAN (NS-GAN) is simply to modify the generator objective
(8.49) to use log(u) instead of log(1 u). Namely, the NS-GAN generator objective is,
min
θ
1
n
X
z
∈Z
log
f
D
ϕ
(t)
f
G
θ
(z
)

, (8.54)
and NS-GAN retains the same discriminator objective
(8.46)
. The motivation for such
a modification is simply the fact that at
u
0, the derivative of
log
(1
u
) (as in the
original formulation) is approximately
1, while the derivative of derivative of
log
(
u
) (as
in NS-GAN) is nearly negative infinite. This may yield an improvement for initial phases of
training and is particularly useful for cases when
p
D
(
·
) is highly separated from
p
G
(
·
) and
the discriminator parameters already yield good separation as in Figure 8.7. In this case,
f
D
ϕ
(t)
f
G
θ
(t)
(
z
)
0 and thus the NS-GAN generator objective
(8.54)
yields much higher
magnitude gradients than the original generator objective
(8.49)
, due to the nearly negative
infinite derivative.
For both the original GAN and NS-GAN, as training progresses and the expected value of
f
D
ϕ
(t)
f
G
θ
(
z
)
approaches 1
/
2, the objectives
(8.54)
and
(8.49)
behave similarly. Hence in
principle, the NS-GAN modification should not hamper with training. However, while the
25
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
simple idea of NS-GAN may appear like a good improvement, in practice it has shown to
yield unstable gradient updates. We chose to present NS-GAN here, merely as simple idea
where modification of the objective can potentially yield training performance improvement.
A more fundamental way to improve GANs is based on the Wasserstein distance. This is
a concept used to determine the “distance” between two probability distributions, with a
similar goal of JS-divergence, yet with a completely different approach that yields different
properties of the distance metric.
Formally when presented with two probability distributions, in our case,
p
D
(
·
), and
p
G
(
·
),
the Wasserstein distance, W(·, ·), may be represented as
W (p
D
, p
G
) = inf
p
Π
Π(p
D
,p
G
)
E
(x,˜x)p
Π
x ˜x. (8.55)
Let us unpack
(8.55)
by assuming that our data and generator are each in
R
p
. Here Π
(p
D
, p
G
)
is the space of all probability distributions over
R
2p
where the first
p
coordinates are for the
data and the following
p
coordinates are for the generator. This space of distributions is
defined such that each element distribution
p
Π
Π
(p
D
, p
G
)
has a marginal distribution of
the first
p
coordinates of
p
Π
that agrees with
p
D
, and likewise the marginal distribution of the
remaining
p
coordinates of
p
Π
agrees with
p
G
. That is, for every distribution
p
Π
Π
(p
D
, p
G
)
,
p
D
(x
1
, . . . , x
p
) =
R
···
R
p
Π
(x
1
, . . . , x
p
, u
1
, . . . , u
p
) du
1
···du
p
,
p
G
(˜x
1
, . . . , ˜x
p
) =
R
···
R
p
Π
(u
1
, . . . , u
p
, ˜x
1
, . . . , ˜x
p
) du
1
···du
p
.
(8.56)
The infimum in
(8.55)
acts “like a minimum” and formally seeks the greatest lower bound.
This minimization is over the expectation of the Euclidean norm
x ˜x
with
x
and
˜x
each
elements of
R
p
. Thus the pair (
x, ˜x
)
R
2p
is distributed according to
p
Π
and hence
x
is
(marginally) distributed according to
p
D
and
˜x
is (marginally) distributed according to
p
G
.
For clarity, note that the expectation in (8.55) can be represented as,
Ex ˜x =
Z
······
Z
v
u
u
t
p
X
i=1
(x
i
˜x
i
)
2
p
Π
(x
1
, . . . , x
p
, ˜x
1
, . . . , ˜x
p
) dx
1
···dx
p
d˜x
1
···d˜x
p
.
(8.57)
As an extreme example, consider the case where
p
D
=
p
G
. Then one particular
p
Π
Π
(p
D
, p
G
)
is a distribution that only has support on elements
u
1
, . . . , u
p
, u
p+1
, . . . , u
2p
where
(
u
1
, . . . , u
p
) = (
u
p+1
, . . . , u
2p
), and the density over each such element is
p
D
(
u
1
, . . . , u
p
) which
is the same as
p
G
(
u
p+1
, . . . , u
2p
). In this case, using
p
Π
in
(8.57)
we get 0 because probability
mass (or density) is only concentrated on points (
x, ˜x
)
R
2p
where
x
=
˜x
. Hence this
p
Π
minimizes
(8.57)
, and hence the Wasserstein distance is 0. However, when
p
D
=
p
G
, one can
show that the Wasserstein distance is strictly positive.
In general, the joint distribution
p
Π
captures an “earth moving” plan which describes how
to shift mass from one distribution
p
D
to the other
p
G
. To get a feel for this interpretation,
let us grossly simplify the situation and assume that
p
D
and
p
G
are discrete one dimensional
(
p
= 1) distributions on a finite domain with values, such as 1, 2, and 3. In this case, the
joint distribution
p
Π
is on the 3
×
3 grid, capturing the probability mass over the 9 possible
joint locations. In this simplistic case, the marginal distribution assumptions
(8.56)
can be
26
i
i
i
i
i
i
i
i
8.3 Generative Adversarial Networks
written as,
p
D
(x) =
P
3
˜x=1
p
Π
(x, ˜x) for x = 1, 2, 3,
p
G
(˜x) =
P
3
x=1
p
Π
(x, ˜x) for ˜x = 1, 2, 3,
(8.58)
where p
Π
(·, ·), p
D
(·), and p
G
(·) are probability masses.
Now we may interpret
p
Π
(
x, ˜x
) as a “plan” of how much mass (or earth) to shift from location
x
to location
˜x
. For this, the marginal distribution assumptions
(8.58)
serve as constraints
where the first constraints imply that mass is shifted properly out of
p
D
and the second
constraints imply that all resulting mass ends up creating
p
G
. The Wasserstein distance is
then the expectation of
x ˜x
for the optimal plan, or optimal joint distribution. Note
that in such a discrete finite case, one may actually formulate and solve this optimization
problem using linear programming. While such an approach is not important in deep learning
practice, ideas of duality theory from linear programming play a role as we describe below.
One important attribute of the Wasserstein distance is that it is sensitive to differences
in the “location” of the distributions and captures such differences in a much better way
than the JS-divergence (based on the KL-divergence). As an illustration let us focus on
a case with
p
= 1 where we can easily plot
p
D
and
p
G
as marginals. Consider Figure 8.8
where we plot a red pair (
p
D
, p
G
) with the distributions concentrated at quite far away
locations, and a green pair (
p
D
, p
G
) with the distributions closer together. We cannot exactly
see the Wasserstein distance because it requires minimization over all possible Π
(p
D
, p
G
)
distributions. However, in this example any possible
p
Π
Π
(p
D
, p
G
)
is concentrated around
the respective red or green cloud. Hence we roughly see the Wasserstein distance as marked
in the figure, representing the difference between the center of the cloud and the diagonal
x
=
˜x
line. Importantly, the distance for the red pair is much higher than that of the green
pair. Now in contrast, if we were to consider the JS-divergence, then in both the red and
the green pair cases, the divergence would be very close
8
to 0.
An important result for the Wasserstein distance, called the Kantorovich-Rubinstein duality
theorem, provides us with a dual representation of the minimization problem in
(8.55)
. This
result implies the following representation of W (p
D
, p
G
), for any K > 0,
W (p
D
, p
G
) =
1
K
sup
h
L
K
n
E
xp
D
h(x) E
˜xp
G
h(˜x)
o
. (8.59)
Let us unpack
(8.59)
. The supremum in
(8.59)
acts “like a maximum” and formally seeks the
lowest upper bound. Continuing to assume that
p
D
and
p
G
are distributions over
R
p
, this
maximization searches over functions
h
:
R
p
R
that also satisfy the
K
-Lipschitz property,
denoted by h
L
K. A function h(·) is K-Lipschitz if
h(u) h(v) Ku v, for all u, v R
p
.
This implies that the function does not ascend or descend in any direction with steepness
greater than
K
. Note that in stating the Kantorovich-Rubinstein duality theorem using
K = 1 is common, yet for our GAN purposes keeping K free is preferable.
The elegance and applicability of
(8.59)
is that instead of searching over all possible joint
probability distributions (earth moving plans) as in
(8.55)
, we now only need to search over
8
In fact, one needs to assume that the support of the
p
D
and
p
G
distributions overlaps for the JS-
divergence to be properly defined.
27
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
p
D
p
G
W (p
D
, p
G
)
W (p
D
, p
G
)
Figure 8.8: Two pairs of distributions where in the red pair, the distributions are farther away
than in the green pair. While we cannot exactly see the Wasserstein distance in this case, we get an
approximate feeling for it using the distances marked with vertical lines between the centers of the
point clouds and the diagonal line.
all
K
-Lipschitz functions. It turns out, that in the context of generative adversarial networks,
the latter formulation is much easier to implement. We omit further details of the duality
derivation between (8.55) and (8.59), and now show how to use (8.59) in a GAN setting.
Since
p
G
is determined by the generator
f
G
θ
(
·
), with a Wasserstein distance approach our
overall goal in a generative setting is to find generator parameters
θ
that minimize
W
(
p
D
, p
G
).
Replacing the supremum in
(8.59)
by a maximum, and dropping the 1
/K
constant term, we
have and overall objective,
min
θ
max
h
L
K
n
E
xp
D
h(x) E
˜xp
G
h(˜x)
| {z }
Depends on θ
o
. (8.60)
The key idea of W-GAN is now to treat the function
h
(
·
) as a discriminator, even though it is
no longer a binary classifier. Following the previous notation, we still denote this discriminator
as
f
D
ϕ
(
·
) with parameters
ϕ
, even though now the architecture of this discriminator allows it
to output any potential real value, not just in [0
,
1]. With this,
(8.60)
can be approximated
with the objective,
min
θ
max
ϕΦ
K
n
1
n
X
x∈D
f
D
ϕ
(x)
1
n
X
z
∈Z
f
D
ϕ
f
G
θ
(z
)
o
, (8.61)
where now Φ
K
is some space of discriminator parameters that approximately enforces a
K
-Lipstchitz condition on
f
D
ϕ
(
·
). That is, with every
ϕ
Φ
K
we have that
f
D
ϕ
(
·
) is
K
-
28
i
i
i
i
i
i
i
i
8.3 Generative Adversarial Networks
Lipstchitz. Hence the discriminator objective, posed as a minimization problem constrained
on Φ
k
(and dropping the 1/n terms) is,
min
ϕΦ
K
n
X
x∈D
f
D
ϕ
(x) +
X
z
∈Z
f
D
ϕ
f
G
θ
(z
)
o
. (8.62)
In training a W-GAN we iterate between discriminator and generator steps, and hence in the
discriminator steps we carry out mini-batch gradient descent steps for
(8.62)
. The constraint
of keeping
ϕ
Φ
K
is discussed below. The outcome of such a discriminator iteration is
ϕ
(t)
and then in carrying out generator steps for (8.61) we carry out iterations for
min
θ
X
z
∈Z
f
D
ϕ
(t)
f
G
θ
(z
)
. (8.63)
Compare the W-GAN discriminator problem
(8.62)
with the original GAN discriminator
problem
(8.47)
, and similarly compare the W-GAN generator problem
(8.63)
with the original
GAN generator problem
(8.49)
(similarly we may compare to NS-GAN with generator
problem as in
(8.54)
). Quite surprisingly, if we ignore the Φ
K
constraint, the differences are
only very minor where the W-GAN does not use logarithms. In practice, W-GAN generally
overcomes the problems illustrated in Figure 8.7 by enhancing the algorithm with a much
better discrimination signal.
We still need to specify how to enforce the Φ
K
,
K
-Lipstchitz constraint in
(8.62)
. For this
there are multiple techniques. The most basic technique is called weight clipping where we
constrain the weights for the neural network
f
D
ϕ
(
·
) to lie in some range, e.g., [
0
.
05
,
0
.
05].
For almost all standard deep learning architectures, this will mean that
f
D
ϕ
(
·
) will be
K
-
Lipstchitz for some
K
that depends on the architecture of the network and on the clipping
half-range 0.05.
A different approach which in practice has generally shown better performance is gradient
penalty. This approach relies on the following property of
(8.59)
. If a function
h
:
R
p
R
attains the supremum in
(8.59)
, then this function has with probability one, a unit gradient
norm on any random point
x
η
that is a convex combination between a point drawn from
p
D
and a point drawn from
p
G
. That is, if we take
x
as a random point from
p
D
and
x
as a random point from
p
G
(for some generator), then for any
η
[0
,
1], the point
x
η
= ηx + (1 η)x
satisfies,
∥∇h
(x
η
) = 1, with probability 1. (8.64)
We omit the proof, yet we use this property algorithmically to approximately enforce the
constraint
ϕ
Φ
K
. To do so, the gradient penalty approach uses
(8.64)
as an optimization
constraint in place of
ϕ
Φ
K
. This constraint is then integrated in the objective using an
approximate Lagrange multiplier approach.
Specifically for some tunable λ > 0, we modify the discriminator objective (8.62) to
min
ϕ
n
X
x∈D
f
D
ϕ
(x) +
X
z
∈Z
f
D
ϕ
f
G
θ
(z
)
+ λ
X
x
η
∈D
η
Z
∥∇f
D
ϕ
(x
η
) 1
2
o
, (8.65)
where the set of random points
D
η
Z
is a set of points based on pairs
x D
and
z
Z
,
where each
x
has as single matching
z
, and for each pair we generate a uniformly random
η [0, 1] and set x
η
= ηx + (1 η)f
G
θ
(z
).
29
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
In practice the gradient penalty approach works very well and out of all of the GAN objective
variations that we presented, it is the most common approach used. In a typical algorithm
that implements mini-batch gradient descent steps for (8.65), we would sample a subset of
the data
D
and a matching sized sample of
Z
followed by a sample of
D
η
Z
with each
element determined by a uniform η.
Beyond Data Generation with GANs
Up to now we considered the application of generative adversarial networks for generative
modelling where an unlabelled dataset
D
=
{x
(1)
, . . . , x
(n)
}
is used to train a generator
f
G
θ
(
·
) and then this generator can create samples similar to those in
D
. This is already useful
for a variety of applications, one of which is data augmentation, where we can then use
additional generated (fake) samples to train other models. Yet, there are many more tasks
where generative adversarial networks can be employed. We now outline a few paradigms
that extend the basic GAN architecture and allow us to handle additional tasks. We outline
the conditional generation paradigm, the image to image paradigm, and the style transfer
paradigm.
With the conditional generation paradigm the resulting generator from training is not just
a function of the latent space, but is also a function of additional variables. Namely we
can describe the generator as
f
G
θ
(
z, w
) where
z
is still a random latent variable and the
newly introduced
w
contains some additional information. One such example of conditional
generation, called a conditional generative adversarial network (C-GAN) is where the data is
labeled and is of the form
D
=
{
(
x
(1)
, y
(1)
)
. . . ,
(
x
(n)
, y
(n)
)
}
. Here we can use the additional
information
w
as some attribute vector that potentially encodes the particular label. Hence
for example in a case of the MNIST digits dataset, our GAN
f
G
θ
(
z, w
) will create random
digits, where
w
may indicate which digit to create using one-hot encoding. The basic way in
which such a GAN is trained, is by also supplying the attribute vector to the discriminator
during training. We omit further details.
A slightly different form of conditional generation is also described in the auxiliary classifier
generative adversarial network (AC-GAN) architecture. Here the generator is similar to
the C-GAN case where
w
is specifically a class of the desired sample to generate, yet in
training, the discriminator has two outputs, one of which is the binary classification fake/real
as before, and another is a probability vector determining which class is detected. Such a
training process, often results in a better architecture for conditional generation than the
original C-GAN.
Ideas of conditional generation can be extended in multiple ways, where one notable way is
the interpretable representation learning generative adversarial network (Info-GAN) approach.
Here we are back to unlabeled data such as
D
=
{x
(1)
, . . . , x
(n)
}
, and we use the GAN
training process to separate some elements of
w
“out of
z
”, where typically
w
is of smaller
dimension than
z
. The resulting generator,
f
G
θ
(
z, w
), then uses
w
for tweaking specific
attributes of the image, where these attributes are not controlled apriori, but are rather
discovered during the learning process. As a concrete example, again in the case of MNIST
digits, we may have
w R
4
where after the training process it turns out that the first
coordinate of
w
controls the stroke widths of the digits, the second coordinate controls the
slant of the digit, and other two coordinates of
w
either have some interpretation or not.
The beauty of Info-GAN is in the self discovery of these properties. The general idea of
training in such a framework is that the discriminator outputs additional variables that are
30
i
i
i
i
i
i
i
i
8.3 Generative Adversarial Networks
meant to match the inputs
w
. The analysis involves mutual information with basic ideas
from information theory, and is beyond our scope.
With the image to image paradigm, or more generally data to data paradigm our goal is
not to generate random general images or data examples, but rather to be able to enhance
images or data. Assume a training dataset such as
D
=
{
(
x
(1)
in
, x
(1)
out
)
, . . . ,
(
x
(n)
in
, x
(n)
out
)
}
, where
each
x
(i)
in
has lower information content than the corresponding
x
(i)
out
. For example
x
(i)
in
may
be a black and white image of a portion a street map, while
x
(i)
out
may be a color satellite
image of the same geographic location.
9
The goal of an image to image generator, say
˜
G
,
is to operate on input images similar to
x
(i)
in
in structure, and create the corresponding
enhanced output image. Then when presented with a new input image
˜x
in
, we would have
that the trained generator applied to it,
f
˜
G
θ
(
˜x
in
), is the enhanced image. Hence for example
in the geographic case, we can generate fake satellite images, based on street map images.
Other applications include the color enhancement of images or films, and almost any domain
where datasets such as D = {(x
(1)
in
, x
(1)
out
), . . . , (x
(n)
in
, x
(n)
out
)} appear.
The basic training architecture in the image to image paradigm is based on a discriminator
that distinguishes between real pairs (
x
(i)
in
, x
(i)
out
), and fake pairs (
x
(i)
in
, x
(i)
out
), where in the
first pair,
x
(i)
out
matches
x
(i)
in
from the data, while the second pair,
x
(i)
out
is generated. That is,
the discriminator, is the function
f
˜
D
ϕ
(
˜x
in
, ˜x
out
) that tries to determine if the pair it is fed
with is completely real or if the second image fed to it is generated. In this sense, the image
to image paradigm is similar to the C-GAN paradigm. Key differences include the fact, that
in the image to image case, one would often want a complicated generator network such as
a U-Net.
10
In deep learning, style transfer, sometimes called neural style transfer, is an area broader
than generative adversarial networks, mostly focusing on image data. The general theme of
style transfer is to modify input images such that they appear to resemble a certain style.
Examples that have been popularized include the recreation of arbitrary images using the
distinctive drawing style of Vincent van Gogh. Within the context of generative adversarial
networks, the style transfer paradigm uses GANs for style transfer and for similar image
modifications.
A notable GAN architecture specifically focused on style transfer is the style-GAN. Here,
multiple new ideas are introduced in the context of the generator network, specifically for
image data using convolutional architectures. One of the notable features of StyleGAN is its
ability to generate high-resolution images with remarkable detail and diversity. Style-GAN
has been widely used in applications such as image synthesis, artistic expression, and creating
realistic faces with customizable attributes.
Several of the key ideas of Style-GAN are specific to convolutional architectures and include
upsampling, as well as adaptive instance normalization which is somewhat similar to batch
normalization of Section
??
and layer normalization, used in Section
??
. We do not focus on
these ideas here but rather comment on the general structure of the generator which differs
from generators used in other paradigms. The Style-GAN generator is composed of two
distinct functions, a mapping network and a synthesis network, which we denote as
f
G
θ
m
(
z
1
)
and f
G
θ
s
(z
2
, w) respectively.
9
For this specific example and other examples of image to image generation, it is not hard to create such
training data.
10
See the notes and references in Chapter ?? for background on the U-Net architecture.
31
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
The mapping network’s role is to convert a latent space noise vector
z
1
into a more meaningful
representation
w
=
f
G
θ
m
(
z
1
) which is an intermediate variable that is later amenable to
manipulation. The synthesis network’s role is somewhat similar to the Info-GAN generator
where
z
2
serves as noise, while
w
captures style information. One way to use a trained
generator without any style modifications is via,
x
= f
G
θ
s
z
2
, f
G
θ
m
(z
1
)
| {z }
w
,
where
z
1
and
z
2
are noise vectors, and
x
is the resulting random output image. However,
in more advanced applications we may keep one latent noise vector fixed, while perturbing
the other noise vector, and/or varying the intermediate variable
w
. Variations of this form
enable Style-GAN to create meaningful modifications of images.
8.4 Reinforcement Learning
The topic of reinforcement learning deals with controlling dynamic processes over time. This
is different than most tasks presented in the book, which on their own are typically applied
at one instant of time. For example, classification is based on a static input
x
at some time,
to determine an output
ˆy
relevant for that time. Yet the classification task does not care
about previous or future classification decisions or the time at which it is carried out. In
contrast, with reinforcement learning, we have the task of making decisions as time evolves,
where our decisions often affect the evolution of the system and thus our decisions need to
take planning ahead into consideration.
Typical applications of reinforcement learning include decisions for automatic pilots, robot
control, playing strategy games, financial management, and other applications that involve
decisions over a time horizon. Indeed, in the second decade of this century, the deep
learning variant of reinforcement learning, namely deep reinforcement learning, made big
headlines with the game of Go, where the world’s best Go players were eventually beaten
by an engineered deep reinforcement learning platform. Earlier, it was shown that deep
reinforcement learning systems can be trained to automatically play classic arcade Atari
video games.
In general, the field of control theory is the engineering field where automatic control systems
are designed and analyzed. This field has a rich history, including many advances made
during the space programs of the 1950’s and the 1960’s, including concepts such as Kalman
filtering and many related ideas. The area of reinforcement learning can be cast as part
of the control theory world, yet it has much more of a computer science flavour to it, and
with the advent of deep learning, has essentially become part of the deep learning toolkit.
Nevertheless, today, ideas of reinforcement learning are part of the control theory world.
A schematic representation of reinforcement learning is in Figure 8.9. The basic setup is
that a system, or environment, is controlled by an agent. Other processes may be taking
place in the environment as well, perhaps not directly controlled by the agent, these are
modelled via randomness. As time progresses the agent sends their control decisions in the
form of actions to the environment, and in turn it receives reward from the environment as
well as observations.
32
i
i
i
i
i
i
i
i
8.4 Reinforcement Learning
Agent
Environment
Action
Reward
Observation
Figure 8.9: The setup in reinforcement learning is that an agent (or controller) applies actions to
an environment (also known as system). The agent’s decisions of which actions to take, are based
on reward obtained together with observations. One illustrative example of the environment is a
video game where multiple random things happen and the agent is a player.
Mathematically, we cast reinforcement learning in terms of an area called Markov decision
processes.
11
Here, as we explain in this section, the evolution of the environment being
controlled is modeled as a variant of a Markov chain, where the state evolution of the
chain also depends on control decisions made. A key goal in the area of Markov decision
processes is the study and application of algorithms for controlling the system in some
optimal manner. Reinforcement learning, aims to solve the same problem, with the difference
that in the Markov decision processes case we assume to have full knowledge of the system
evolution model, whereas in the reinforcement learning case, the system model is unknown
and hence needs to either be learned, or optimal control strategies need to be learned. With
reinforcement learning, the agent often learn incrementally while controlling actual systems.
In some cases learning is based on simulations of the system before deployment into the real
world, while in other cases, operational systems are controlled and learned simultaneously.
Our focus of this section is to first present the Markov decision processes framework and to
further discuss optimal strategies using Bellman equations. Then, after briefly discussing
solution methods for such equations we move onto the reinforcement learning case, where
the system model is unknown. There are multiple types of algorithms for reinforcement
learning, and we focus only on the Q-learning approach. After understanding the basic
Q-learning algorithm, we finally see how concepts of deep neural networks can be integrated
by replacing the so-called Q-function with a deep neural network. This presentation here
is merely an introduction to the field, and more readings are suggested at the end of the
chapter.
Markov Decision Processes
Assume some system evolving over discrete time
t
= 0
,
1
,
2
, . . .
, where at any time
t
, the
system state is
z
t
. This state may describe the location of a robot, or the vector of velocities
11
In certain cases the observations are only a partial description of the environment state, in which case
the formal framework is that of partially observable Markov decision processes. Yet in our brief exposition,
we only consider full state observations.
33
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
in multiple directions together with accelerations, or a discrete state of a game, or one of
many other possibilities depending on the application.
Let us first ignore decisions and system control. In this case we can model the evolution of
z
0
, z
1
, z
2
, . . .
as a Markov chain, where some transition kernel
p
t
(
z
t+1
|z
t
) determines the
probability of moving from a given state
z
t
to state
z
t+1
, between time
t
and time
t
+ 1. In
this form, the transitions appear to depend on the time
t
, yet we may also assume that the
evolution is time homogenous and does not depend on time
t
. In this case, common to this
section, the transition kernel can be presented with the notation
p
(
z
t+1
|z
t
), i.e., without
a subscript
t
. As already stated in Section 8.2 in the context of Markovian hierarchical
variational autoencoders, a Markov chain satisfies the Markov property. In our case here,
this property is best interpreted as implying that given any history prior to time
t
, if we are
given
z
t
, that history does not have an affect on the future. This then means that the state
information at time t, namely z
t
, is enough to determine probabilities of the future.
As a simple example, consider a scenario focusing on the engagement level of a student.
Assume that there are 10 levels of engagement, 1
,
2
, . . . ,
10, where at level 1 the student is
not engaged at all and at the other extreme, at level 10, the student is maximally engaged.
One may model the probability transitions of engagement as,
p(j |1) =
0.6 j = 1,
0.4 j = 2,
0 otherwise,
p(j |i) =
0.6 j = i 1,
0.2 j = i,
0.2 j = i + 1,
0 otherwise,
for i = 2, . . . , 9,
p(j |10) =
0.6 j = 9,
0.4 j = 10,
0 otherwise.
(8.66)
In this case, it is evident that at any time step, the student’s engagement can either improve
by one level, decrease by one level, or stay unchanged, with the probabilities specified by
p
(
·|·
) as above. If at onset at time 0 we start at some given level
z
0
{
1
, . . . ,
10
}
, as time
progresses, the level will fluctuate according to the Markov chain stochastic process. The
level of engagement is then the system or environment in this very simple case.
A Markov decision process is a generalization of a Markov chain where now some decision
(also known as or control) is used by the agent. In this simple student engagement example,
assume that at any time
t
we can choose to “stimulate” the student, for example by suggesting
a prize, or “not to stimulate”. Naturally, it is sensible to model the fact that stimulation will
improve transitions up towards higher levels. For example, one model for these transitions
34
i
i
i
i
i
i
i
i
8.4 Reinforcement Learning
under stimulation captures transition probabilities via p
(1)
(z
t+1
|z
t
), as,
p
(1)
(j |1) =
0.2 j = 1,
0.8 j = 2,
0 otherwise,
p
(1)
(j |i) =
0.1 j = i 1,
0.1 j = i,
0.8 j = i + 1,
0 otherwise,
for i = 2, . . . , 9,
p
(1)
(j |10) =
0.2 j = 9,
0.8 j = 10,
0 otherwise.
(8.67)
Compare now the original transition probabilities
(8.66)
with the revised transition proba-
bilities under “stimulation”
(8.67)
. With these particular values in the model that we chose,
transitions with “stimulation” generally have a higher probability of pushing engagement
level up. In a Markov decision process, in addition to the state evolution
z
0
, z
1
, z
2
, . . .
, we
also have an action chosen by the controller or agent for any time
t
. Specifically the sequence
a
0
, a
1
, a
2
, . . .
denotes the actions, where in this example we can set that
a
t
= 0 implies “not
to stimulate” at time t and that a
t
= 1 implies to “stimulate” at time t.
If we now define
p
(0)
(
j |i
) as the original probabilities
(8.66)
, then each of the probabilities,
p
(a)
(j |i), for a {0, 1}, i, j {1, . . . , 10}, (8.68)
constitutes the transition probability from
z
t
=
i
to
z
t+1
=
j
if action
a
t
=
a
is chosen at
time
t
. Hence
(8.68)
is a specification of the environment (or system) and of how the agent’s
actions affect that system.
Now with the action sequence,
a
0
, a
1
, a
2
, . . .
, the state sequence
z
0
, z
1
, z
2
, . . .
does not evolve
autonomously, but rather depends on the decisions made
a
0
, a
1
, a
2
, . . .
. So if for example
the first decision is
a
0
= 0, the second is
a
1
= 0, and the third is
a
2
= 1, then given some
initial
z
0
, the first probabilities are
p
(0)
(
·|z
0
) which determine a random
z
1
; then given this
value, the second transition probabilities are
p
(0)
(
·|z
1
) which determine a random
z
2
; and
the third transition probabilities are
p
(1)
(
·|z
2
) which determine a random
z
3
, and so forth.
Having the sequence of decisions
a
0
, a
1
, a
2
, . . .
fixed a-priori is called an open loop control
and in our context this is not common. Instead, we typically wish to determine the action
a
t
based on the state
z
t
. This is called closed loop control or feedback control, because the
control decision is based on the current state. A control policy, or just a policy
12
is a function
of the form,
g
policy
: {1, . . . , 10}
| {z }
State space
{0, 1}
|{z}
Action space
,
where in more general examples the state space and action space may be much more complex.
Any policy function,
g
policy
(
·
), induces a Markov chain specific for that policy. This is because
with
a
t
=
g
policy
(
z
t
), the transitions
p
(a
t
)
(
·|z
t
) are used for that Markov chain. Hence in
12
This is in fact called a deterministic Markovian policy.
35
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
this example a policy can be seen as a rule for mixing (8.66) and (8.67). With this specific
example there are only 2
10
= 1
,
024 possible policies, yet with more complex state spaces or
action spaces, the number of possible policies can be huge.
The activity of finding the best feedback control policy for a Markov decision process is the
act of solving a Markov decision process. To do so, our model requires a reward function to
be specified. This reward function applies to every time step
t
, and captures the benefit of
being in a given state, and choosing a given action at that time. The reward function
13
can
be viewed as part of the Markov decision process model specification, and is of the form,
r : {1, . . . , 10}
| {z }
State space
× {0, 1}
|{z}
Action space
R.
Hence for example,
r
(4
,
0) is the reward obtained at a time where the state is
z
t
= 4 and
the chosen action is a
t
= 0 (no stimulation).
For our specific example, let us assume a reward function,
r(z, a) = z 1.5a, (8.69)
where higher student engagement levels have higher reward with a linear increase via the
z
component, and further, we pay a price of 1
.
5 reward units every time we set
a
t
= 1. Hence
for example
r
(4
,
0) = 4 and
r
(4
,
1) = 2
.
5. From a modelling perspective, this latter reward
being lower can be viewed as due to the cost of the prize for stimulation. Observe that the
reward is always from the viewpoint of the agent controlling the system (the student in this
particular example is the environment).
We now want to find a policy
g
policy
(
·
) that is best in terms of this reward over the whole time
horizon. There are multiple ways to accumulate reward and in our exposition we consider
only the infinite horizon expected discounted reward objective. Here, we first set
γ
(0
,
1)
which a fixed hyper-parameter called a discount factor. Then the contribution at time
t
is
taken to be,
γ
t
r(z
t
, g
policy
(z
t
)
| {z }
a
t
).
These contributions are accumulated to form the infinite horizon expected discounted reward
objective, which depends on the initial state z
0
= z and is,
V
g
policy
(z) = E
X
t=0
γ
t
r(z
t
, g
policy
(z
t
)
| {z }
a
t
)
z
0
= z
. (8.70)
The role of the discount factor,
γ
, is to capture the importance of near present times vs. far
future times. With
γ
low (near 0), far future times
t
have little effect on this contribution.
Similarly with
γ
high (near 1), these far future times have much more of an affect on the
contribution.
Observe that this objective function,
(8.70)
, is parameterized by the policy,
g
policy
(
·
), because
the policy determines how actions
a
t
are chosen, given values of
z
t
. Our goal is to find an
optimal policy that maximizes
(8.70)
, yet it may appear that there are multiple objectives
13
Here we focus on the time-homogenous reward function which does not depend on the current time.
36
i
i
i
i
i
i
i
i
8.4 Reinforcement Learning
here because
(8.70)
depends for every initial state
z
0
=
z
. Nevertheless, a property of the
type of Markov decision processes that we are using is that there exists a policy that can
maximize V
g
policy
(z) for all initial states z
0
= z. Hence we seek,
g
policy
= argmax
g
policy
V
g
policy
(z), for all initial states z. (8.71)
In our simple student engagement example, one may even find such a policy by enumerating
all possible policies and for each policy evaluating the expectation in
(8.70)
either via
Monte-Carlo simulation or via analytic properties of Markov chains. For example, if the
discount factor is at γ = 0.6 then an optimal policy turns out to be,
g
policy
(z) =
0 z {1, 2},
1 z {3, 4, 5, 6, 7},
0 z {8, 9, 10}.
(8.72)
It is not obvious a-priori why this is the best policy, but considering it we see that for
low and high engagement levels (1, 2, 8, 9, and 10) it is not worth to pay the price for
stimulation, whereas otherwise for intermediate engagement levels (3
,
4
,
5
,
6, and 7), it is
worth stimulating the student. The strength of Markov decision processes is that they expose
such policies, where the optimal control for any time
t
and current state
z
t
implicitly takes
the future evolution into account via (8.70).
While such a Markov decision processes framework is in principle very powerful, we are
faced with two problems. The first problem is to have better means than exhaustive search
for solving
(8.71)
to find optimal policies, and we discuss such means shortly. The second
problem is the fact that models can seldom be specified as we did here, with probabilities that
correctly capture reality. That is, realistic scenarios would involve very complex transition
kernels
p
(a)
(
j |i
) in contrast to the simplistic specification of probabilities in
(8.66)
and
(8.67). This second problem is handled by reinforcement learning methods.
Bellman Equations, the Value Function, and the Q-function
In characterizing the solution of
(8.71)
an important object is the value function. This real
valued function, denoted as V
(z), where z is any element of the state space, is defined as,
V
(z) = max
g
policy
V
g
policy
(z), (8.73)
where
V
g
policy
(
z
) is from
(8.70)
. Hence the value function determines the optimal infinite
horizon expected discounted reward, when starting at a state z
0
= z.
A result in the study of Markov decisions processes is that we can characterize the value
function via a non-linear system of equations called the Bellman equations. For the case of
finite state and action spaces, these equations are,
V
(z) = max
a
r(z, a) + γ
X
z
p
(a)
(z
|z) V
(z
)
| {z }
Q
(z,a)
, for all states z. (8.74)
Here the maximum is over all actions
a
(e.g.,
{
0
,
1
}
in the student engagement example).
Further, inside the maximum on the right hand side, we have the Q-function,
Q
(
z, a
), which
37
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
captures the “quality” of choosing action
a
on state
z
. Note that the summation in
(8.74)
is
taken over all states z
(e.g., {1, . . . , 10} in the student engagement example).
One can informally derive the Bellman equations (8.74) via what is known as the dynamic
programming principle where the first term of
Q
(
z, a
) is the immediate reward,
r
(
z, a
) and
the second part is the expected next step reward, discounted by one time step and hence
multiplied by
γ
. Theoretically it can be shown that any function
V
(
·
) that satisfies the
Bellman equations is the value function as in (8.73).
With more complex state spaces that are not necessarily discrete, we can rewrite the
Q-function as,
Q
(z, a) = r(z, a) + γ E
V
(z
t+1
) |z
t
= z, a
t
= a
, (8.75)
while keeping in mind that the time-homogenous assumptions imply that
Q
(
z, a
) is the
same for every time
t
and hence we can use
z
0
,
a
0
, and
z
1
in place of
z
t
,
a
t
, and
z
t+1
respectively. With the formulation of the Q-function as in
(8.75)
, we have a more general
expression for the Bellman equation
(8.74)
, where
P
z
p
(a)
(
z
|z
)
V
(
z
) is replaced by
E
[
V
(
z
1
)
|z
0
=
z, a
0
=
a
]. This formulation encompasses states spaces that are not discrete.
Importantly, knowing either the Q-function or the value function presents us with an optimal
policy. If we know the Q-function,
Q
(
·, ·
), then we can determine an optimal policy
g
policy
(
·
)
as in (8.71) via,
g
policy
(z) = argmax
a
Q
(z, a). (8.76)
If we know the value function,
V
(
·
), then we can first evaluate the Q-function via
(8.75)
where we compute the expectation using the explicit model (sometimes using Monte Carlo
simulation if needed) and then with this computed Q-function we can evaluate (8.76).
The classic study of Markov decision processes deals with existence and optimality of
solutions to the Bellman equations. It also deals with algorithms for solving these equations
to find the value function and hence to find an optimal policy via
(8.75)
and
(8.76)
. We
briefly discuss such solution methods now.
Solving Bellman Equations
The two most common algorithms for solving Bellman equations are value iteration and
policy iteration. We focus on value iteration on discrete state spaces. The algorithm is based
on the recursion,
14
Q
(t+1)
(z, a) = r(z, a) + γ
X
z
p
(a)
(z
|z)
max
a
Q
(t)
(z
, a
)
, for all z and a. (8.77)
Value iteration starts with some initial or arbitrary guess
Q
(0)
(
·, ·
) and then with each step
t
we apply
(8.77)
on all states
z
and all actions
a
to get from
Q
(t)
(
·, ·
) to
Q
(t+1)
(
·, ·
). The
algorithm terminates when some distance measure applied to
Q
(t)
(
·, ·
) and
Q
(t+1)
(
·, ·
) is
below a specified small threshold.
14
Note that one often writes the recursion in terms of
V
(
·
) and not in terms of Q-functions as we did
here. However, the two formulations are equivalent and our representation is preferable for understanding
Q-learning in the sequel.
38
i
i
i
i
i
i
i
i
8.4 Reinforcement Learning
In quite general settings, convergence of repeated applications of
(8.77)
to the optimal
Q-function is guaranteed and a proof of this relies on the fact that the right hand side of
(8.77)
is a contraction mapping. We omit the details. Yet with value iteration we do not
have an indication at what iteration the optimal policy has been discovered. Hence one may
need to apply (8.77) many times.
Note that policy iteration, the other algorithm that we mentioned above, remedies the
situation. On finite state and action spaces, policy iteration needs to execute for only a finite
number of steps before discovering an optimal policy. We do not discuss the policy iteration
algorithm further here because Q-learning is based on value iteration.
Back to the simple student engagement example presented earlier, each
Q
(t)
(
·, ·
) can simply
be implemented as a table with 10
×
2 = 20 entries since the state space is
{
1
, . . . ,
10
}
and
the action space is
{
0
,
1
}
. One sometimes informally calls this a Q-table. If we were to use
value iteration for finding the optimal policy, we would first initialize
Q
(0)
(
·, ·
) with some 20
arbitrary values. We would then apply
(8.77)
for each
z {
1
, . . . ,
10
}
and
a {
0
,
1
}
. This
application would directly use the probabilities specified in
(8.66)
and
(8.67)
, and the reward
function specified in
(8.69)
. Applying such value iteration steps is thus straightforward to
execute recursively. After multiple steps, we can then determine the policy using
(8.76)
and
this is in fact how we obtained the example optimal policy
(8.72)
. However, more complex
examples are harder to implement and importantly in realistic scenarios we often do not
know the exact transition probabilities and reward function. Instead, we take the approach
of learning the Q-function, which we describe now.
Q-learning
The idea with Q-learning is to learn the Q-function
(8.75)
and obtain some estimate
ˆ
Q
(
z, a
)
for all states
z
and all actions
a
. Learning the Q-function does not mean learning, the
individual components,
r
(
·, ·
),
p
(·)
(
·|·
), and
V
(
·
). It rather means learning
ˆ
Q
(
·, ·
) as a whole.
One can then use this estimate in
(8.76)
in place of
Q
(
z, a
) to obtain a policy that is
approximately optimal via,
ˆg
policy
(z) = argmax
a
ˆ
Q(z, a). (8.78)
Before seeing how Q-learning learns
ˆ
Q
(
·, ·
), we mention that as a reinforcement learning
algorithm, one sometimes applies Q-learning in parallel to controlling a system, or controlling
a simulation of the system. This is different from other learning paradigms in this book
where the learning and production activities are often separate. Such a mix of controlling
and learning involves ongoing estimates of the Q-function,
ˆ
Q
(t)
(
·, ·
) where at any time
t
we
use
ˆ
Q
(t)
(·, ·) in place of
ˆ
Q(·, ·) for (8.78).
When carrying out such a mix of learning and controlling, we know that at time
t
, the latest
ˆ
Q
(t)
(
·, ·
) is only an estimate. Hence we also employ state exploration as part of the policy.
For this one typical approach known as the epsilon greedy approach, uses some pre-specified
decreasing sequence of probabilities,
ε
0
, ε
1
, ε
2
, . . .
, and at any time
t
, we control the system
with a randomized policy,
ˆg
(t)
policy
(z) =
argmax
a
ˆ
Q
(t)
(z, a), with probability 1 ε
t
,
random action a, with probability ε
t
.
(8.79)
39
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
With this epsilon greedy approach we know that at time
t
there is a chance of
ε
t
of selecting
an arbitrary action that is most likely sub-optimal. Yet it allows us to potentially navigate
the system to parts of the state space that would otherwise remain unexplored.
Now let us consider the Q-learning algorithm. Here the idea is to update some part of
ˆ
Q
(t)
(
·, ·
) at any time step based on the following available information: (i) the previous
state
z
t
; (ii) the new state
z
t+1
; (iii) the previous action chosen
a
t
; (iv) the observed reward
denoted as
r
t
which is the reward after applying the action at time
t
. For this, Q-learning
relies on hyper-parameters,
α
0
, α
1
, α
2
, . . .
, a pre-specified decreasing sequence of probabilities.
The recursion of Q-learning is,
ˆ
Q
(t+1)
(z, a) =
(1 α
t
)
ˆ
Q
(t)
(z, a) + α
t
r
t
+ γ max
a
ˆ
Q
(t)
(z
t+1
, a
)
| {z }
Single sample Bellman estimate
, if z
t
= z, a
t
= a
ˆ
Q
(t)
(z, a), otherwise.
(8.80)
Key to
(8.80)
is that we only update the Q-function estimate for one specific
z
t
,
a
t
pair at
any time step. Further observe that this update is a weighted average of the previous entry
ˆ
Q
(t)
(
z, a
) and a new component which we denote as the single sample Bellman estimate. This
term is directly motivated by the value iteration recursion
(8.77)
and we can observe that it
agrees with the right hand side of
(8.77)
if we were to ignore the probabilities
p
(a)
(
z
|z
).
This general form of approximation is called a stochastic approximation and its theoretical
analysis allows one to prove certain convergence properties of Q-learning. We omit these
details.
From an operational point of view, we can now integrate the epsilon-greedy control of
(8.79)
with the Q-learning recursion of
(8.80)
to develop a learning scheme that both controls the
system and learns how to control it as time progresses. With such a scheme, calibration
of the hyper-parameter sequences
ε
0
, ε
1
, ε
2
, . . .
and
α
0
, α
1
, α
2
, . . .
is often delicate, and one
often has to experiment with various hyper-parameter settings in order to get useful results.
A theoretical result for Q-learning is that under certain conditions we need the sequence
α
0
, α
1
, α
2
, . . . to satisfy,
X
t=0
α
t
= , and
X
t=0
α
2
t
< . (8.81)
Hence the probabilities need to decay quickly enough, but not too quickly. So for example
probabilities such as
α
t
= 1
/
(
t
+1) suffice since with these the first (harmonic) series diverges
while the second series converges.
While theoretical results based on the condition
(8.81)
help to place Q-learning on a rigorous
footing, from a practical perspective Q-learning on its own is difficult to use effectively.
Even in our simple student engagment example where we try to learn the 10
×
2 Q-table,
Q-learning can be challenging. At any time step we only update a single entry of this table.
This is already slow, and is further slowed down due to the fact that if for non small times
t
,
α
t
is a low probability, then the averaging in
(8.80)
would keep the new entry
ˆ
Q
(t+1)
(
z, a
)
not far from the previous entry
ˆ
Q
(t)
(
z, a
). In more complex and realistic scenarios where the
state and action spaces are big and sometimes non-discrete, we cannot even tabulate the
Q-function in a naive Q-table. Hence approximate Q-learning needs to be carried out. This
brings us to deep reinforcement learning.
40
i
i
i
i
i
i
i
i
8.4 Reinforcement Learning
Deep Reinforcement Learning
The key idea with deep reinforcement learning is to approximate the Q-function,
Q
(
z, a
),
with a neural network,
f
Q
θ
(
z, a
). Such a setup allows us to deal with highly complex state
spaces and action spaces. The parameters of this network are learned during a deep Q-
learning algorithm where as the algorithm evolves, we have a sequence of learned parameters
θ
(1)
, θ
(2)
, . . .
. With each such
θ
(t)
, we can still use an epsilon greedy policy similar to
(8.79)
where the control decision is taken as,
g
policy
(z, θ
(t)
) =
argmax
a
f
Q
θ
(t)
(z, a), with probability 1 ε
t
,
random action a, with probability ε
t
.
(8.82)
The key is now how to learn
f
Q
θ
(
·, ·
) and for this there are many variants of the deep
Q-learning algorithm of which we only outline the simplest form. To create a loss function
that will help to learn the parameters
θ
, we use a reinforcement learning concept called
temporal difference learning. Here the loss at time t is given by,
C
t
(θ ;z
t
, a
t
, r
t
, z
t+1
) =
r
t
+ γ max
a
f
Q
θ
(z
t+1
, a)
| {z }
Single sample Bellman estimate
f
Q
θ
(z
t
, a
t
)
2
, (8.83)
where the observed state is
z
t
, the system is controlled via the action
a
t
, a reward of
r
t
is
obtained, and the resulting next state is
z
t+1
. With such a temporal difference loss, if the
neural network parameters
θ
determine a good approximation for the actual Q-function,
then the loss would be generally low, whereas if
θ
does not describe the Q-function well,
then the loss is higher. To see this, revisit the value iteration equation (8.77).
Now with the neural network loss
(8.83)
defined, we have a very basic deep Q-learning
algorithm. We control the system using the control policy of
(8.82)
and at any time
step
t
we apply a single gradient descent update for the parameters
θ
based on the loss
C
t
(
·
;
z
t
, a
t
, r
t
, z
t+1
). The gradient descent update then modifies the Q-function estimate
from
f
Q
θ
(t)
(
·, ·
) to
f
Q
θ
(t+1)
(
·, ·
). Note that like the Q-learning algorithm (without a neural
network approximation) described above, in this framework every time involves both a
control decision and learning.
In practice, this basic deep Q-learning algorithm typically does not perform well. One
problem is the coupling of the gradient descent step and the control decision, both happening
only once at any time
t
. In practice we often seek an algorithm that can on the one hand
learn the Q-function approximation using multiple control steps, and on the other hand
perform multiple gradient descent steps. For this, one popular variant is to maintain two
copies of the network approximating the Q-function,
f
Q
θ
(
z, a
). One copy is updated only
every several time steps and is used for the control decisions
(8.82)
, and the other copy
is updated with every gradient descent step. Another concept often used is reply memory
where we control the system for some multiple time steps and use the combination of these
time steps for gradient based learning. We do not dive into these technical details here.
41
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
8.5 Graph Neural Networks
In this section we introduce and explore neural networks for graph objects, called graph
neural networks (GNNs). In a similar way to how convolutional neural networks are primarily
used for image data, and recurrent neural networks are primarily used for sequence data,
graph neural networks are applied on data organized as a (combinatorial) graph. The input
data is typically of the form
G
= (
V, E
) together with related feature data, where
V
is some
vertex set of the graph, and
E
is the edge set of that graph. We introduce basic concepts of
graphs in the sequel.
Abstractly, a graph neural network can be viewed as a function
f
θ
: G × F output, (8.84)
where
G
is the set of possible input graphs
G
,
F
is the set of possible input features, and
the output may be a graph, a classification vector, a scalar value, or similar. Specifically,
the features in
F
may include weights on edges, or much more complex features associated
with edges and nodes. In similar spirit to other deep learning models, graph neural networks
often implement
f
θ
(
·
) via a composition of a sequence of steps or layers, each denoted
f
[]
θ
[]
(
·
)
as in (??). At each such layer , the input graph and features are transformed.
Applications of Graph Neural Networks
In contrast to domains such as image classification using convolutional neural networks, or
natural language translation using sequence models, the applications associated with graph
neural networks often require some transformation of the given problem into a graphical
representation. Our focus in this section is not on such transformations for applications;
see the notes and references at the end of this chapter for specific reading suggestions. An
important point to highlight is that most applications of GNNs are not for mimicking human
level performance (e.g., image recognition or text understanding), but are rather for gaining
insights from complex data, that is otherwise hard to process. We now mention a few general
application domains. See Figure 8.10 for an illustration.
42
i
i
i
i
i
i
i
i
8.5 Graph Neural Networks
Emily
Kayley
Yarden
Adaya
?
?
(a)
(b)
C
O N
CH
H
H
C
N
C
N
C H
H
H
C
C
O
N
C
H
H
H
(c)
Figure 8.10: An illustration of graph structures arising in several application domains. (a)
Connections in social networks described via graphs. The presence of edges with question marks
is not known and can be predicted via a graph neural network. (b) Knowledge graphs capturing
relationships between entities. Learning about such relationships can be assisted with graph neural
networks. (c) Molecular bonds can be described via graphs. Thus classification of molecular structures
and the design and discovery of new structures, is also aided by GNN.
One key area of application is social network analysis, where GNNs assist in identifying
influential users, detecting communities, and predicting connections between individuals.
Additionally, in recommendation systems, GNNs enhance the accuracy and personalization of
suggestions by modelling user-item interactions within collaborative filtering settings. Another
prominent application is in fraud detection, particularly in financial and e-commerce contexts,
where GNNs uncover hidden relationships and anomalous patterns within transaction graphs.
In the fields of biology and chemistry, GNNs excel at modelling molecular structures
and interactions, offering valuable insights into drug discovery, protein-protein interaction
prediction, and chemical property estimation. GNNs are also widely used in traffic analysis
for optimizing transportation networks, predicting congestion, and enhancing routing and
scheduling.
All of the above applications are generally cases where the tasks are not supposed to replace
human level tasks, yet there are also situations where GNNs can enhance or replace human
level tasks. Specifically, GNNs have applications in image analysis, enabling tasks like image
segmentation, object tracking, and scene understanding, particularly when graphs represent
relationships between image regions or structures. Further, in the natural language processing
domain, GNNs enhance tasks such as named entity recognition, sentiment analysis, and text
classification by analyzing text data represented as dependency trees or semantic graphs.
Graph Structures
As alluded to in
(8.84)
the input to a GNN consists of a graph and features which are
elements of
G
and
F
respectively. We now discuss possible representations of such objects.
We begin with a brief outline of graph-theoretic terminology. See Figure 8.11 as a guide.
Recall that a graph
G G
, is often denoted
G
= (
V, E
). The graph is composed of a node
set
V
and an edge set
E
. The node set can be denoted as
V
=
{v
1
, v
2
, . . . , v
r
}
where each
v
i
43
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
is a node (also known as a vertex), and each of the
r
nodes can represent a distinct entity
in the application domain (e.g., a single person, an atom, etc.). The edge set
E
is a subset
of
V × V
, or in particular is composed of tuples of the form (
v
i
, v
j
) where each such tuple
represents an edge connecting
v
i
and
v
j
. In our terminology we do not allow the elements
(v
i
, v
i
) to be in E, that is, there are no self loops.
In some cases the graph is a directed graph, where (
v
i
, v
j
) is different from (
v
j
, v
i
) for
i
=
j
,
and the former represents an edge (arrow) from
v
i
to
v
j
, while the latter is in the opposite
direction. In other cases, the graph is an undirected graph in which case we can either treat
(v
i
, v
j
) as unordered, or more formally require that if (v
i
, v
j
) E then also (v
j
, v
i
) E. In
the undirected case, edges are not represented as arrows but rather as links between nodes.
v
1
v
2
v
3
v
4
v
5
v
6
v
7
v
8
0 0 1 0 0 0 0 0
0 0 1 0 1 0 0 0
1 1 0 1 0 1 0 0
0 0 1 0 0 0 0 0
0 1 0 0 0 1 1 0
0 0 1 0 1 0 0 1
0 0 0 0 1 0 0 0
0 0 0 0 0 1 0 0
1
2
3
4
5
6
7
8
1 2 3
4
5 6
7
8
(a)
v
1
v
2
v
3
v
4
v
5
v
6
v
7
v
8
0 0 1 0 0 0 0 0
0 0 0 0 1 0 0 0
0 1 0 1 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 0 0 1 1 0
0 0 1 0 0 0 0 1
0 0 0 0 1 0 0 0
0 0 0 0 0 1 0 0
1
2
3
4
5
6
7
8
1 2 3
4
5 6
7
8
(b)
Figure 8.11: Directed and undirected graphs each with their associated adjacency matrix. (a) An
undirected graph has a symmetric adjacency matrix. (b) A directed graph has an adjacency matrix
that is typically not symmetric.
44
i
i
i
i
i
i
i
i
8.5 Graph Neural Networks
For undirected graphs, the degree of a node
v
i
is the number of edges connected to it, or more
formally the number of nodes
v
j
such that (
v
i
, v
j
)
E
. For directed graphs we differentiate
between the out-degree of node v
i
and the in-degree. The former is the number of nodes v
j
such that (
v
i
, v
j
)
E
, while the latter is the number of nodes
v
j
such that (
v
j
, v
i
)
E
. In
the undirected case, both out-degree and in-degree are the same at each node.
In the context of undirected graphs, the neighbours of a node
v
i
are the nodes
v
j
that are
connected to
v
i
with an edge. We denote the set of indices of these neighbours via
N
(
v
i
).
Hence the degree is the number of elements in this set. See for example, the undirected
graph illustrated in Figure 8.11 (a) where
N
(
v
1
) = 1,
N
(
v
2
) = 2,
N
(
v
3
) = 4,
N
(
v
4
) = 1,
N(v
5
) = 3, etc.
A path between two nodes
v
i
and
v
j
is a sequence of nodes (
v
k
1
, v
k
2
, . . . , v
k
m
) where
v
k
1
=
v
i
,
v
k
m
= v
j
and (v
k
, v
k
+1
) E for = 1, . . . , m 1. A graph is said to be connected if there
is a path between any two nodes
v
i
, v
j
V
. One trivial graph (which is also connected)
is the complete graph (also known as the fully connected graph) where (
v
i
, v
j
)
E
for all
v
i
, v
j
V .
Mathematically, one way to represent a graph is via an adjacency matrix where we number
the nodes as 1, 2, . . . , r. Such an r × r matrix A has entries that are either 0 or 1 where,
A
ij
=
(
1 if (v
i
, v
j
) E,
0 otherwise.
(8.85)
For undirected graphs, the adjacency matrix is symmetric (
A
ij
=
A
ji
) whereas for directed
graphs it is not necessarily symmetric. In either case, since we do not allow self loops,
A
ii
= 0
for all
i
. When representing a graph in computer memory, sometimes an adjacency matrix is
suitable, yet at other times, when the graph has fewer edges, sparser representations called
adjacency lists are more suitable.
Since the node numbering is typically arbitrary, it is often useful to allow all permutations
of the node numbering to be valid. Mathematically, we can express this property with the
aid of an r ×r permutation matrix P with,
P
ij
=
(
1 if j replaces i in the permutation,
0 otherwise.
(8.86)
When
P
is applied to a vector
x
, namely
˜x
=
P x
, the result
˜x
has
x
j
in at index
i
. Similarly
when
P
is left multiplied to another matrix, the matrix’s rows are permuted according to the
permutation encoded in
P
. Also when
P
is right multiplied to another matrix, the matrix’s
columns are permuted as such. Finally, applying the permutation to the adjacency matrix
we have, that the new matrix
P AP
represents the adjacency matrix after transforming
the numbering of the nodes according to the permutation described by P .
A graph is called a weighted graph if for every edge (
v
i
, v
j
)
E
there is also an associated
weight
w
ij
R
with the edge. One can also associate weights with all elements of
V ×V
and consider a weight of 0 as the absence of the edge. In this case the adjacency matrix can
be enhanced to an adjacency weight matrix which we still denote as
A
, and have
A
ij
=
w
ij
.
45
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
(a) (b)
Figure 8.12: Transductive vs. inductive learning. (a) Transductive learning involves a single large
input graph. (b) Inductive learning involves multiple separate input graphs.
With this representation,
A
ij
=
(
w
ij
if (v
i
, v
j
) E,
0 otherwise.
One example application of a weighted graph is a map of cities where edges between cities
are roads and the weights are the distances in kilometers. Note that this is also a planar
graph since it can be “drawn” on a plane. Not all graphs are planar. We also mention that
a more general object than a graph is a multi-graph which permits multiple distinct edges
(e.g., roads) between vertices. Some graph neural networks deal with multi-graphs, yet we
do not go into this specialization in this section.
The definitions above associated with
G G
are general graph-theoretic concepts, not
specific to graph neural networks. Yet in graph neural networks, there are also additional
features, which we denote as elements of
F
. Specifically, each individual node
v
i
can carry
an associated feature vector
x
(i)
which we generally consider an element of
R
p
V
, for some
p
V
1. For example, in social networks applications where each individual node is a person,
x
(i)
denotes the
p
V
features associated with the person such as age, marital status, etc. We
call these features node level features.
Similarly to node level features, in certain cases we can also consider edge level features.
Here we have the features
x
(ij)
for edge (
v
i
, v
j
), and we assume
x
(ij)
R
p
E
, for some
p
E
1. One can treat a weighted graph as a very simple case where
x
(ij)
=
w
ij
and
p
E
= 1. However, in most cases that involve edge level features,
p
E
>
1. For example, in a
transportation case where edges are roads (or transportation links) we may have each
x
(ij)
represent multiple characteristics of the road such as the number of lanes, the speed limit,
toll information, etc.
In this section’s exposition of graph neural networks we generally ignore edge level features
and focus on data with node level features. In such a case, the features can be organized in a
r ×p
V
matrix
X
where each row represents the features of node
v
i
, namely
x
(i)
. That is,
F
is
the set of all possible feature matrices
15
X
. Note also that if we wish to apply a permutation
to the node numbering as encoded in a permutation matrix
P
, then the permuted feature
matrix is P X.
15
Note that if one was to organize edge level features in such an object then a tensor of dimension
r × r × p
E
would be appropriate.
46
i
i
i
i
i
i
i
i
8.5 Graph Neural Networks
The Structure of Input Data and Tasks
Similarly to the rest of the book, we generally denote the data via
D
. For graph neural
networks this data can come in various forms:
D =
G, X
, for transductive learning (i),

(G
(1)
, X
(1)
), y
(1)
, . . . ,
(G
(n)
, X
(n)
), y
(n)

, for inductive learning (ii),
e
X
(G,X)
, for graph embedding (iii).
(8.87)
The forms (i) and (ii) in
(8.87)
are for the two overarching graph neural network learning
paradigms, namely, transductive learning and inductive learning. In the case of transductive
learning the data is one (big) input graph together with the features. In the case of inductive
learning, the data consists of
n
different graphs (potentially each of them smaller), each with
its own features as well, and also potential labels,
y
(i)
. See Figure 8.12 for an illustration of
the difference between transductive and inductive learning.
To further understand the difference between transductive and inductive learning consider
the following illustrative examples. For transductive learning, consider the social networks
domain where we treat a big social network with some missing data as an input and then
learn a model for predicting potential connections between individuals (predicting missing
edges). As an inductive learning example consider classification of molecules where for
example
y
(i)
= 0 implies a non-toxic molecule and
y
(i)
= 1 implies that molecule is toxic.
The training data is then composed of many input graphs, some of which have
y
(i)
= 0
and some of which have
y
(i)
= 1. We then train a model that operates on an input graph
(molecule structure) and outputs
ˆy
which is the estimated probability of being toxic. It can
then be transformed into a decision rule, as with any binary classifier.
Form (iii) of the data in
(8.87)
is different. In this case the data is no longer a graph but
rather a transformation of graphical data into vector form using techniques called graph
embeddings. Here, in a similar nature to word embeddings in the context of natural language
processing (see Section
??
), input graph data and features (
G, X
) are pre-processed to create
a matrix
e
X
(G,X)
which summarizes the graph and the features via real valued vectors. In
contrast to the first two forms of data (i) and (ii) in
(8.87)
, which are used as input to
graph neural networks, the graph embedding form (iii) can be used as input to feedforward
networks or other (non graphical) models trained on
e
X
(G,X)
. Interestingly, some graph
embedding techniques themselves are based on graph neural networks; we omit the details.
16
Given graphical data of the form
(8.87)
, one can train a model as in
(8.84)
for various
different tasks. These can be dichotomized as tasks on nodes, tasks on edges, or tasks on
graphs. In the first case the output is associated with nodes and our main goal is to predict
outcomes, or impute missing features, for specific nodes in the graph. In the second case the
output is associated with edges and our goal is to determine the presence of certain edges
that were not originally available in the data, or similarly predict outcomes associated with
available edges. In the third case the output is associated with the whole graph and in this
case we predict properties of the graph, including classification of graphs, regression, and
similar. See Figure 8.13 for an illustration.
16
Various techniques for graph embedding include DeepWalk, node2vec, GraphSAGE, LINE (Large-scale
Information Network Embedding), and HOPE (High-Order Proximity preserved Embedding).
47
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
?
(a)
?
?
(b) (c)
Figure 8.13: Various forms of tasks for graph neural networks. (a) Tasks on nodes deal with
inference about individual nodes. For example classification of the type of node. (b) Tasks on edges
deal with inference about individual edges. For example the existence of an edge or not. (c) Tasks
on graphs deal with inference about the whole graph. For example classification of the graph.
Generally, tasks on graphs are carried out in a inductive setting as we require form (ii) of
the training data from
(8.87)
. In contrast, tasks on nodes or tasks on edges can be carried
out both in an inductive and a transductive level.
The General Structure of a Graph Neural Network Model
In general, a graph neural network model is a function of graph data
x
from
D
of
(8.87)
.
Like
(??)
of Chapter
??
we construct the neural network
f
θ
(
x
) of
(8.84)
with a recursive
computation of layers,
f
θ
(
x
) =
f
[L]
θ
[L]
(
f
[L1]
θ
[L1]
(
. . .
(
f
[1]
θ
[1]
(
x
))
. . .
)). Here
f
[]
θ
[]
(
·
) is the
-th layer,
and
θ
[]
are the parameters of that layer. As with other deep learning models, we denote
a
[]
as the result of
f
[]
θ
[]
(
a
[1]
) and
a
[0]
=
x
, keeping in mind that
x
contains both the graph
structure and features.
Some models are dynamic graph neural networks in which case the graph structure is modified
as it is processed by the neural network layers. Other models, are static graph neural networks
for which
G
= (
V, E
) is not modified across layers, and thus
G
has the same value within
each layer. In such a case, it is convenient to represent
a
[]
= (
h
[]
, G
) where
h
[]
is sometimes
called the hidden state. We set
h
[0]
as the input features from
x
and denote the neural
network function as
f
θ
(
h
[0]
, G
). With this representation, the function of each layer can be
denoted as
f
[]
θ
[]
(
h
[1]
;
G
). Note that
G
is fixed for all layers, but with each application
of the complete
f
θ
(
h
[0]
, G
), a different graph structure
G
is possible. As with feedforward
neural network models, the parameters of the model are θ = (θ
[1]
, . . . , θ
[L]
).
48
i
i
i
i
i
i
i
i
8.5 Graph Neural Networks
Our focus in this exposition is only on static graph neural networks for which the forward
pass
17
is
h
[1]
= f
[1]
θ
[1]
(
Input features
z}|{
h
[0]
; G)
h
[2]
= f
[2]
θ
[2]
(h
[1]
; G)
.
.
.
h
[L1]
= f
[L1]
θ
[L1]
(h
[L2]
; G)
ˆy
|{z}
output
= f
[L]
θ
[L]
(h
[L1]
; G)
The specific structure of
h
[]
can vary between applications and may be a matrix, or a higher
dimensional tensor. In our case, for simplicity, let us assume it is a matrix of dimension
r ×p
V
where the
i
-th row represents the node level features for node
v
i
in the graph, and
there are
r
nodes in total. In general one may have different column dimensions (number of
features) per layer
, yet for simplicity let us assume that this is fixed as
p
V
throughout the
layers.
An important requirement of the layer function
f
[]
θ
[]
(
h
[]
;
G
) is permutation invariance where
the order of nodes (and consequently, the order of entries in the adjacency matrix) does
not affect the network’s output. To understand this requirement, assume that we represent
G
via an adjacency matrix
A
, and hence the layer’s operation can be represented with
A
in place of
G
, namely we can denote the layer’s function as
f
[]
θ
[]
(
h
[1]
;
A
). Now for any
permutation on the nodes represented via P as in (8.86), we require that,
f
[]
θ
[]
( Ph
[]
|{z}
Permuted hidden state
;
Permuted adjacency matrix
z }| {
P AP
) = P f
[]
θ
[]
(h
[]
; A)
| {z }
Permuted hidden state at layer +1
. (8.88)
The permutation invariance requirement in
(8.88)
then ensures that node numbering is
indeed arbitrary and does not does not affect the operation of the model.
Let us now dive into the structure of a single layer except for the final layer. Namely,
let us see the structure of
f
[]
θ
[]
(
h
[1]
;
G
) for
= 1
, . . . , L
1. Here, similarly to the
convolutional neural networks of Chapter
??
, graph neural networks try to enforce locality
and translation invariance; see Section
??
. The translation invariance property is analogous
to the permutation invariance property of
(8.88)
. Locality is enforced by requiring that the
output of
f
[]
θ
[]
(
h
[1]
;
G
) for node
v
i
only depends on the neighbours of
v
i
. Specifically in
our data representation it means that only the rows with indices of neighbours
N
(
v
i
) as
well as
v
i
itself are used to compute the
i
-th row of
h
[]
. Permutation invariance is enforced
by using the same form of function (and same parameters) for each target output node, and
not considering the actual index of a node but rather only the graphical structure.
17
Compare with the feedforward network forward pass in (??).
49
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
Specifically, if we denote the
i
-th row of
h
[]
via
h
[]
(i)
then the computation of the layer
associated with node v
i
can be represented via,
h
[]
(i)
= f
[]
node
[]
h
[1]
(j)
for v
j
N(v
i
) {v
i
}
, (8.89)
where the function
f
[]
node
[]
(
·
) determines how the hidden state of a node in the next layer
is determined by the hidden state of the node and its neighbours in the previous layer.
The final layer
L
is typically different because the output
ˆy
is often not of the same dimension
as h
[]
. In this case, we just retain the final layer action via the general function f
[L]
θ
[L]
(·).
Message Passing Schemes
The operation of
(8.89)
is based on a so-called message passing scheme where two steps
called aggregate and update
18
are executed one after the other. These steps break up the
operation of the
f
[]
node
[]
(
·
) function via, a function
f
aggregate
(
·
) and a function
f
[]
update
[]
(
·
)
and are executed as,
m
[]
(i)
= f
aggregate
h
[1]
(j)
for v
j
N(v
i
) {v
i
}
,
h
[]
(i)
= f
[]
update
[]
h
[1]
(i)
, m
[]
(i)
.
Observe that the aggregate function is the same for all layers and is not parameterized, while
the parameters for layer , θ
[]
are associated only with the update function.
As is evident, the aggregation step utilizes all the hidden states of the neighbours of the
node, say
v
i
, to achieve a summary,
m
[]
(i)
, which we call the message. Note that in some cases
it also uses node
v
i
itself (self loops) and in other cases not (no self loops). The message
collects hidden state information of neighbouring nodes.
In our exposition, we assume that the message for node
i
is an element of
R
p
V
. Similarly
to the hidden state, for simplicity, we assume this dimension,
p
V
, does not depend on the
layer .
Typical aggregation functions are the sum, the mean, or the element-wise maximum. In our
exposition we focus on the sum as an illustrative simple case where,
m
[]
(i)
=
P
j : v
j
∈N (v
i
)
h
[1]
(j)
, for sum aggregation without self loops,
P
j : v
j
∈N (v
i
)
h
[1]
(j)
+ h
[1]
(i)
, for sum aggregation with self loops.
(8.90)
The aggregate and update steps are carried out for all nodes in the graph and thus one
may view this scheme as messages passing along neighbours within the graph. Note that
this type of architecture is sometimes called a message passing neural network (MPNN); see
Figure 8.14 for an illustration.
18
Sometimes this is called combine.
50
i
i
i
i
i
i
i
i
8.5 Graph Neural Networks
v
3
v
1
v
2
v
4
v
5
v
6
Target Node
(a)
v
3
Aggregate
v
1
v
2
v
4
Aggregate
Aggregate
Aggregate
v
3
v
2
v
3
v
1
v
6
v
3
v
5
(b)
Figure 8.14: Aggregation in a message passing scheme. (a) An input graph where the messages for
node
v
3
are considered. (b) The application of the aggregation via multiple layers (2 layers in this
case) yielding the message aggreagated for the node. Observe the neighbour of v
3
in each layer.
Considering the whole graph, note that like the
r × p
V
hidden state matrix
h
[]
, we can also
consider an
r × p
V
message matrix
m
[]
, where the
i
-th row is
m
[]
(i)
. Now focusing on the
unweighted case, using the graph’s adjacency matrix
A
from
(8.85)
, one can verify that the
message matrix resulting from the sum aggregation (8.90) is,
m
[]
=
A h
[1]
, for sum aggregation without self loops,
(A + I) h
[1]
, for sum aggregation with self loops.
(8.91)
The update step is where a neural network approach is used. In this step, the hidden state
of the node,
h
[1]
(i)
, and the message
m
[]
(i)
are used to determine the hidden state of the
node in the output of the
-th layer. The update function is typically composed of an affine
transformation together with a non-linear activation. A simple typical form of
f
[]
update
[]
(
·
)
is,
h
[]
(i)
= S
m
[]
(i)
W
[]
m
+ h
[1]
(i)
W
[]
u
+ b
[]
, (8.92)
where we treat the vectors as row vectors. Here
S
(
·
) is a vector activation function typically
formed of scalar activation functions such as
σ
Sig
(
·
), similarly to other neural networks.
With this form, the learned parameters
θ
[]
= (
W
[]
m
, W
[]
u
, b
[]
) include weight matrices and a
bias vector, and under our (simplifying) dimensionality assumptions, we have
W
[]
m
, W
[]
u
R
p
V
×p
V
, and
b
[]
R
p
V
(considered as a row vector). Note that more generally, dimensions
may vary across layers and hence the matrices may be non-square.
If we focus on sum aggregation without self loops, then (8.92) becomes,
h
[]
(i)
= S
X
j:v
j
∈N (v
i
)
h
[1]
(j)
W
[]
m
+ h
[1]
(i)
W
[]
u
+ b
[]
. (8.93)
51
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
Further, in this case, if we consider the whole graph, we can use
(8.91)
to represent
(8.93)
via a network wide equation,
h
[]
= S
Ah
[1]
W
[]
m
+ h
[1]
W
[]
u
+ B
[]
, (8.94)
where in similar nature to
(??)
of Chapter
??
,
B
[]
is a
r ×p
V
matrix with each row equal
to the bias vector
b
[]
. Note that here
S
(
·
) is taken as the activation function over the whole
matrix, typically element wise.
It is also of interest to note that in case where we restrict the parameters with
W
[]
m
=
W
[]
u
,
both denoted as W
[]
, then (8.94) is reduced to
h
[]
= S
(A + I)h
[1]
W
[]
+ B
[]
. (8.95)
This case of
W
[]
=
W
[]
m
=
W
[]
u
yields a similar update rule to what we would have if we
consider sum aggregation with self loops; see (8.91).
In practice, one can make a choice if to use the formulation with more parameters
(8.94)
or
the less parameterized formulation
(8.95)
. With the latter, there can be some restriction on
the expressivity of the graph neural network as there is no separation of the information
from the node and from its neighbours. Nevertheless in some cases, the less parameterized
case, (8.95) suffices.
Model Variants
We close this section with a few model variants of graph neural networks. Each of the variants
has its advantages and disadvantages, and the applicability of the variants for applications
is beyond our scope. Our purpose here is simply to explore the various basic ideas and
equations associated with each of the models. We consider graph convolutional networks,
spectral approaches, and the use of the attention mechanism.
In a graph convolutional network, (8.95) is modified to,
h
[]
= S
˜
D
1
2
(A + I)
˜
D
1
2
h
[1]
W
[]
+ B
[]
, (8.96)
where the matrix
˜
D
is a diagonal matrix where
˜
D
ii
is the degree of node
v
i
plus one. Thus
˜
D
1
2
is a diagonal matrix with entries that are inverse of the square root of the degree plus
one. At the node level, again representing vectors as row vectors, we can unpack (8.96) for
node v
i
as,
h
[]
(i)
= S
X
j:v
j
∈N (v
i
)
1
q
˜
D
ii
˜
D
jj
h
[1]
(j)
W
[]
+
1
˜
D
ii
h
[1]
(i)
W
[]
+ b
[]
. (8.97)
If we compare
(8.97)
with
(8.93)
, we see that a graph convolutional network is similar to sum
aggregation, yet with weighting in the summation proportional to the degrees. Specifically,
when considering the update for the hidden state
h
[]
(i)
of node
v
i
, the weight of the hidden
state of neighbouring nodes that have more neighbours than
i
is reduced, and conversely
52
i
i
i
i
i
i
i
i
8.5 Graph Neural Networks
the weight for nodes that have less neighbours is increased. This scaling acts as a form of
regularization in the network.
Graph convolutional networks can be extended with a spectral approach. Specifically let us
now outline key ideas of spectral graph neural networks, also known as spectral convolutional
graph neural networks. Generally, in the world of signal processing and mathematics, a
spectral approach deals with analyzing a transform of a signal in place of the signal itself.
For example in the context of time signals, as briefly discussed in Section
??
, instead of the
signal, one may sometimes consider the Fourier transform of the signal. Then based on the
so-called convolution theorem, the Fourier transform of the convolution between two signals,
as for example in equation
(??)
of Chapter
??
, can be represented as the product of the
Fourier transforms of the individual signals.
19
In the context of graphs, an analogy to the convolution theorem can be considered using
eigenvalue decompositions of matrices, which is the topic of study of an area called spectral
graph theory. Let us focus on undirected graphs that are connected, and thus the degree of
each node is at least one.
In spectral graph theory, an important matrix associated with a graph with
r
nodes is the
r ×r
Laplacian matrix, appearing in either the unnormalized form, or the normalized form,
and defined as,
L =
(
D A (unnormalized),
I D
1
2
AD
1
2
(normalized).
Here, similarly to
˜
D
from above,
D
a diagonal matrix where this time
D
ii
is the degree of
node v
i
.
One may consider a vector
u
[]
R
r
which represents some state values for each node in the
graph, and the operation
u
[+1]
= Lu
[]
, (8.98)
can encompass the effect of applying the Laplacian matrix of the graph (either unnormalized
or normalized) on the state
u
[]
to achieve the next state
u
[+1]
. Interpretations of this
operation for specific graph contexts are beyond our scope, yet we mention that in the
context of random walks on graphs, electrical networks (Kirchhoff laws), and other contexts,
this operation is common.
A key idea in spectral graph networks is to modify the operation (8.98) to,
u
[+1]
=
˜
Lu
[]
, (8.99)
where the modification from
L
to
˜
L
is the essence of the learned parameters in the model and
is further explained below. With this modification it is useful to use the spectral decomposition
of
L
and learn how to modify the eigenvalues of
L
effectively. This is analogous to learning
how to modify filters, as is done in Chapter
??
, yet unlike Chapter
??
, learning is on the
spectral domain (eigenvalues) and not on the time domain (convolutions).
Now looking at the spectral decomposition, since the graph is undirected,
L
, either in the
unnormalized or normalized form, is a symmetric matrix, and further it can be shown to be
19
Note that in Chapter
??
we actually do not discuss Fourier transforms, yet we point the reader there
for general context of signals and systems.
53
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
positive semidefinite. The spectral decomposition of L is,
L = U Λ U
, (8.100)
where
U R
r×r
has normalized eigenvectors of
L
as columns, and Λ
R
r×r
is a diagonal
matrix with corresponding eigenvalues, each of which is real and non-negative (due to being
positive semidefinite). With this spectral decomposition,
U
is an orthogonal matrix, namely
U
U
=
I
(the inverse of
U
is its transpose). Note that it is also customary to order the
eigenvalues in the diagonal of Λ in descending order (and the associated eigenvectors in the
columns of U are obviously ordered accordingly).
Now when we consider
(8.98)
and wish to modify it to
(8.99)
, we do so by transforming
the eigenvalues of
L
, appearing in the diagonal of Λ. For example with so called high pass
filtering we retain the high valued eigenvalues (above some threshold), while shrinking or
zeroing out the other eigenvalues. Similarly, low pass filtering works in the other direction,
retaining only low eigenvalues. In each such case, we may view the transformation of the
eigenvalues as some function
F
(
·
) which applied to the diagonal matrix Λ is denoted as
F
(Λ)
R
r×r
. With this we have the modified Laplacian matrix as
˜
L
=
UF
(Λ)
U
, and
thus (8.99) is represented as,
u
[+1]
= UF (Λ) U
u
[]
. (8.101)
Note that we may view
(8.101)
as first projecting
u
[]
onto the orthogonal eigenvector space
via the transformation
U
u
[]
, then applying individual (adjusted via
F
(
·
)) eigenvalues on
each coordinate of the basis via the left multiplication by the diagonal matrix
F
(Λ), and
finally, transforming back to the original basis via another left multiplication by U.
Having understood how the spectral decomposition can be used, we now return to the
general setup of graph neural networks where as before
h
[]
R
r×p
V
is the hidden state
matrix which is updated from layer
to layer
+ 1. Now also denote
h
[]
(|i)
as a vector in
R
r
which is the
i
-th column of this matrix. This vector represents the hidden state information
for each node in the network, based on the
i
-th hidden state feature at layer
. With this
notation, the update for this hidden state vector per feature
i
depends on all features in the
previous layer and is,
h
[+1]
(|i)
= S
p
V
X
k=1
U F
[]
(k,i)
U
h
[]
(|k)
+ b
[]
(i)
, for i = 1, . . . , p
V
. (8.102)
Here for each pair of features
k
and
i
,
F
[]
(k,i)
is an
r ×r
diagonal matrix of learned parameters
that we call spectral weights, and
b
[]
(i)
is a bias vector in
R
r
. As always,
S
(
·
) is a vector
activation function. For one such layer
, we can observe that the total number of learned
parameters for spectral weights is
r ×
(
p
V
)
2
and the total number of learned parameters for
the bias vectors is r × p
V
.
To get a better feel for the update equation
(8.102)
, compare it with
(8.101)
. In
(8.101)
we
see that
F
(Λ) is a filter applied to the eigenvalues whereas in
(8.102)
we represent a learned
F
(Λ) via
F
[]
(k,i)
. Similarly to the concept of channels in convolutional neural networks of
Chapter
??
, the summation over all updates
U F
[]
(k,i)
U
h
[]
(|k)
integrates all the features from
layer into the associated feature of layer + 1.
54
i
i
i
i
i
i
i
i
8.5 Graph Neural Networks
The story of spectral graph neural networks does not stop here because ideally one would
like to reduce the dimension of the learned parameters
F
[]
(k,i)
so that they do not depend
on the whole graph and are independent of the number of nodes in the graph
r
. The study
and application of spectral graph neural networks deals with such approaches and indeed
in certain cases it has been shown that spectral graph neural networks generalize well and
sometimes perform better than their convolutional graph neural networks counter parts.
The details are beyond our scope.
A different variant is graph attention networks where the aggregation step uses an attention
mechanism. Concepts of attention were heavily discussed in Chapter
??
in the context of
sequence models. Specifically, in Section
??
we introduced the general concept of attention
where attention weights are calculated as in
(??)
and are then used for linear combinations
of inputs in
(??)
. We then also used attention in the context of transformers as in Section
??
.
In the graph context, attention can be incorporated via the aggregation step. Specifically, we
can enhance the basic sum aggregation as in
(8.90)
, focusing here only on the case without
self loops, to be,
m
[]
(i)
=
X
j : v
j
∈N (v
i
)
α
[]
(i),j
h
[1]
(j)
. (8.103)
Here for node v
i
, and each neigbour v
j
N(v
i
), we have attention weight α
[]
(i),j
where
X
j : v
j
∈N (v
i
)
α
[]
(i),j
= 1.
Similarly to Chapter
??
, attention weights are calculated using a scoring mechanism via
an alignment function
s
:
R
p
V
× R
p
V
R
, where
s
(
h
[1]
(i
1
)
, h
[1]
(i
2
)
) measures the proximity
between the hidden states of nodes
v
i
1
and
v
i
2
at layer
1. The basic alignment function
is the inner product between
h
[1]
(i
1
)
and
h
[1]
(i
2
)
, yet more complex options with learned
parameters are also possible. One option with linear re-weighting is,
s(h
[1]
(i
1
)
, h
[1]
(i
2
)
) =
h
[1]
(i
1
)
W
[]
a

h
[1]
(i
2
)
W
[]
a
(8.104)
where
W
[]
a
R
p
V
×m
is a matrix of learned parameters for layer
with
m
set as some
dimension (recall here that our hidden state vectors are taken as rows).
Based on the alignment function, the attention weights are calculated for each node
v
i
. As
with all attention mechanisms, we use a softmax function over neighbours indices
j
to obtain
the attention weights,
α
[]
(i),j
=
e
s(h
[1]
(i)
,h
[1]
(j)
)
P
k : v
k
∈N (v
i
)
e
s(h
[1]
(i)
,h
[1]
(k)
)
,
which are then used in the aggregation step
(8.103)
. Once the message is computed, it can
be used in an update function as in (8.92).
Note that with our description of graph attention networks here, the learned parameters per
layer are
W
[]
a
for the alignment function
(8.104)
as well as
θ
[]
= (
W
[]
m
, W
[]
u
, b
[]
) used in the
update
(8.92)
. However, other options, typically with reduced parameters, reusing weight
matrices either across layers, or between the alignment function and the update equation,
55
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
are also possible. Also, similar to the transformer architecture, multi-head attention has also
been introduced and in certain graph neural network applications, such architectures are
very popular.
56
i
i
i
i
i
i
i
i
8.5 Graph Neural Networks
Notes and References
This chapter covered a broad range of specialized architectures and paradigms where each section
covers a major topic which could have in fact made a whole chapter. Hence in our notes and
references about the topics of this chapter we only summarize key references and developments in
each of the sub-fields. A further recent overarching text that we recommend is [
73
] with multiple
chapters, one per each of the topics covered here.
The field of generative modelling has multiple origins. Early models include hidden Markov models
and Gaussian mixture models with origins in the 1950’s and 1960’s; see chapters 11 and 17 of [
65
]
for background. Somewhat more recently, some authors consider the study of Boltzman machine
models introduced in the 1980’s in [
1
], and deep Boltzman machines in [
81
], as the initial meaningful
generative models in the context of deep learning. See also chapter 20 of [
25
] for an overview. A more
recent survey of generative models in machine learning is [
34
] and a comparison of deep generative
modelling approaches is in [12].
Up to 2014, while generative models were useful for some applications and certainly interesting,
in terms of images, they lacked the ability to create real life looking data. The big advance came
with the development of generative adversarial networks (GANs), in Goodfellow et al.’s work [
26
].
This opened up possibilities for creation of realistic looking images (and data) and is still a very
active topic. Variational autoencoders, initially introduced in [
52
], grew into multiple directions and
contemporary diffusion models such as [
39
], and those surveyed in [
101
], constitute the state of the
art in image generative modelling. As of the time of publishing of this book, diffusion models and
GANs still compete, with diffusion models generally able to produce more impressive images, while
GANs are much faster in production since they do not require multiple neural networks.
Ideas of variational autoencoders are rooted in modern developments of Bayesian statistics. See [
19
]
for an introductory general text on Bayesian statistics and [
91
] for an accessible review of the area.
Specifically, the variational Bayes methods, a well-known optimization-based approach in the field
of approximate Bayesian computation, captures the key ideas used in variational autoencoders. See
[
11
] and [
103
] for reviews of variational Bayes. This approach also falls in the realm of approximate
Bayesian computation and entails a method for approximating posterior distributions using simpler
surrogate distributions. See [
85
] for a collection of approximate Bayesian computation methods.
Specifically, for more details about variational autoencoders, see [53].
Our presentation of variational autoencoders was geared towards hierarchical Markovian variational
autoencoders of which diffusion models are a special case. Nevertheless, variational autoencoders
and their variants are interesting and useful in their own right. They have been applied to many
fields. In image processing, prediction of the trajectory of pixels of an image is tackled in [
95
] and
natural image modelling is in [
31
]. In the field of speech analysis, voice conversion is handled in
[
40
] and speech synthesis in [
3
]. In the area of text processing as in [
13
], reccurent neural network
based variational autoencoders for generating sentences are put forward and in [
41
], controlled
text generation is handled. Another field is graph-based data analysis as in [
54
] where learning on
graph-structured data is handled, and [
45
] which deals with molecular graph generation. As we
presented diffusion models as special cases of hierarchical variational autoencoders, the literature
on these models is also relevant. In particular, see [
76
] for an application in black box variational
inference and [87] for a variant called ladder variational autoencoder.
Diffusion models, initially introduced in [
86
], gained significant prominence following [
39
], which
showcased exceptional image synthesis results. These models were further improved with [
22
], where
for the first time the prolonged dominance of GANs was broken. For recent surveys of diffusion
models, refer to [
15
], [
21
], and [
101
]. In terms of applications in the realm of image processing,
diffusion models are utilized for tasks such as colorization, inpainting, uncropping, and restoration as
in [
78
]. Other image processing applications include super-resolution as in [
80
], and image editing as
in [
37
]. There is an extensive study on applications of diffusion models for text to image generation
such as for example the work in [
79
] which introduced Imagen. This system utilizes a transformer
based large language model which is used for understanding text combined with a diffusion model
used for image generation. The application of diffusion models extends to video data as well. Notable
contributions include [
35
] where an approach for long-duration video completions is put forward,
and [
38
] which introduced Imagen Video, a text-conditional video generation system based on a
cascade of video diffusion models.
57
i
i
i
i
i
i
i
i
8 Specialized Architectures and Paradigms
A related paradigm to variational autoencoders and diffusion models that we did not cover is
normalizing flows. This paradigm was first introduced in [
77
] for representing the posterior in
variational autoencoders. These generative models construct complex distributions by transforming
a distribution through a series of invertible mappings. See chapter 16 of [
73
] for an overview and
[71] for a recent survey.
As already mentioned, GANs were introduced in [
26
]. After the introduction of this paradigm,
multiple generalizations appeared. Particular early variants included C-GAN [
63
], and convolutional
versions of GANs as in [
75
]. The idea of NS-GAN was already introduced in [
26
]. The W-GAN
concept first appeared in [
6
] and was later developed in [
30
] where the gradient penalty approach was
introduced. See also [
2
] as well as the empirical comparison in [
58
] and the general GAN surveys [
29
]
and [42]. The ideas of AC-GAN were developed in [68] and the ideas of Info-GAN were developed
in [
16
]. The image to image paradigm in GANs is broad. A survey on this topic is [
5
] with initial
ideas in [
18
], [
43
], and [
96
]. Style-GAN was developed in [
50
], and further developments are in [
51
]
and [
49
]. Finally we mention a few general developments in GANs including sequence GAN from
[102], and EditGAN from [57].
Modern approaches to deep reinforcement learning are surveyed extensively in the texts [
23
] and
[
89
], whereas more dated accounts are [
10
] and [
46
]. At the more fundamental level, basics of Markov
decision processes are presented in [
8
], [
9
], and [
74
]. An expository overview of engineering control
theory is in [
4
]. The criterion for convergence of Q-learning as in our equation
(8.81)
is from [
98
].
The success of reinforcement learning in playing Atari video games is in [
64
], and later landmark
results in the game of Go are documented in [
84
]. Apart from games, reinforcement learning is
successfully used in several domains one of which is addressing hard combinatorial problems; see
[
60
] for a survey of such approaches. Nowadays, reinforcement learning is applied for fine tuning
large language models through human feedback as documented in [
70
] and [
100
]. Reinforcement
learning is also critical for some robotic tasks as described in the survey [
105
]. However, as one
would have thought initially that self driving cars are best handled as a system via reinforcement
learning, in practice other techniques have so far prevailed; see [
17
] for a general survey. A related
approach often used is imitation learning, see [
61
] for a survey of this technique in the context of
autonomous vehicles.
General texts about graph neural networks are [
59
], and [
33
], with recent survey papers in [
94
] and
[
99
] as well as a notable chapter in [
73
]. Early ideas in graph neural networks arose in [
88
] where a
concept called at the time the generalized recursive neuron was introduced. Further ideas of graph
neural networks were developed in [
27
] and [
82
]. These days graph neural networks are used in
multiple applications such as those discussed in [
55
], [
104
], and [
106
] among many others. There
are various techniques for graph embedding including DeepWalk [
72
], node2vec [
28
], GraphSAGE
[
32
], LINE (Large-scale Information Network Embedding) [
90
], and HOPE (High-order Proximity
preserved Embedding), [
69
]. For an introductory computer science overview of traditional graph
algorithms and data structures see [20].
General ideas of message passing schemes in graph neural networks were introduced in [
62
]. Ideas
of graph convolutional networks were developed in [
7
], [
24
], and [
67
]. Ideas of spectral convolutional
graph neural networks were developed in [
14
] and [
36
]. Graph attention networks were developed in
[
93
], motivated by the effective application of the attention mechanism to sequence models as in
[
92
]. Indeed many developments in deep learning cross architectures, domains, and sub-disciplines,
and the development of graph attention networks serve as one such example. We also close by
mentioning spatial-temporal graph neural networks which are designed to deal with dynamic graphs,
sometimes also with a spatial component. These models were introduced in [
56
] and [
83
], and are
further described in the recent review [44].
58
i
i
i
i
i
i
i
i
Bibliography
[1]
D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for boltzmann
machines. Cognitive Science, 1985.
[2]
J. Adler and S. Lunz. Banach Wasserstein GAN. Advances in Neural Information
Processing Systems, 2018.
[3]
K. Akuzawa, Y. Iwasawa, and Y. Matsuo. Expressive speech synthesis via modeling
expressions with variational autoencoder. arXiv:1804.02135, 2018.
[4] P. Albertos and I. Mareels. Feedback and control for everyone. Springer, 2010.
[5]
A. Alotaibi. Deep generative adversarial networks for image-to-image translation: A
review. Symmetry, 2020.
[6]
M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks.
In International Conference on Machine Learning, 2017.
[7]
J. Atwood and D. Towsley. Diffusion-convolutional neural networks. Advances in
neural information processing systems, 2016.
[8]
D. P. Bertsekas. Dynamic Programming and Optimal Control, Volume. II. Athena
Scientific, 3rd edition, 2007.
[9]
D. P. Bertsekas. Dynamic programming and optimal control: Volume I. Athena
Scientific, 2012.
[10] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. Athena Scientific,
1996.
[11]
D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for
statisticians. J. Amer. Statist. Assoc., 2017.
[12]
S. Bond-Taylor, A. Leach, Y. Long, and C. G. Willcocks. Deep generative modelling: A
comparative review of vaes, gans, normalizing flows, energy-based and autoregressive
models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
[13]
S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio.
Generating sentences from a continuous space. In 20th SIGNLL Conference on
Computational Natural Language Learning, CoNLL 2016, 2016.
[14]
J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally
connected networks on graphs. arXiv:1312.6203, 2013.
59
i
i
i
i
i
i
i
i
Bibliography
[15]
H. Cao, C. Tan, Z. Gao, Y. Xu, G. Chen, P. A. Heng, and S. Z. Li. A survey on
generative diffusion model. arXiv:2209.02646, 2022.
[16]
X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan:
Interpretable representation learning by information maximizing generative adversarial
nets. Advances in Neural Information Processing Systems, 2016.
[17]
P. S. Chib and P. Singh. Recent advancements in end-to-end autonomous driving
using deep learning: A survey. IEEE Transactions on Intelligent Vehicles, 2023.
[18]
Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo. Stargan: Unified generative
adversarial networks for multi-domain image-to-image translation. In Proceedings of
the IEEE conference on computer vision and pattern recognition, 2018.
[19] P. Congdon. Bayesian Statistical Modelling. John Wiley & Sons, 2007.
[20]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms.
MIT press, 2022.
[21]
F. A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah. Diffusion models in vision: A
survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[22]
P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. Advances
in Neural Information Processing Systems, 2021.
[23]
V. François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, and J. Pineau. An
introduction to deep reinforcement learning. Foundations and Trends in Machine
Learning, 2018.
[24]
J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message
passing for quantum chemistry. In International conference on machine learning, 2017.
[25] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
[26]
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y. Bengio. Generative adversarial nets. Advances in Neural Infor-
mation Processing Systems, 2014.
[27]
M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains.
In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005.,
2005.
[28]
A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Pro-
ceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery
and data mining, 2016.
[29]
J. Gui, Z. Sun, Y. Wen, D. Tao, and J. Ye. A review on generative adversarial networks:
Algorithms, theory, and applications. IEEE Transactions on Knowledge and Data
Engineering, 2021.
60
i
i
i
i
i
i
i
i
Bibliography
[30]
I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved
training of Wasserstein GANs. Advances in Neural Information Processing Systems,
2017.
[31]
I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville.
PixelVAE: A Latent Variable Model for Natural Images. In International Conference
on Learning Representations, 2016.
[32]
W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large
graphs. Advances in neural information processing systems, 2017.
[33]
W. L. Hamilton. Graph Representation Learning. Morgan & Claypool Publishers,
2020.
[34]
G. M. Harshvardhan, M. K. Gourisaria, M. Pandey, and S. S. Rautaray. A comprehen-
sive survey and analysis of generative models in machine learning. Computer Science
Review, 2020.
[35]
W. Harvey, S. Naderiparizi, V. Masrani, C. Weilbach, and F. Wood. Flexible diffusion
modeling of long videos. Advances in Neural Information Processing Systems, 2022.
[36]
M. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graph-structured
data. arXiv:1506.05163, 2015.
[37]
A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or.
Prompt-to-prompt image editing with cross attention control. arXiv:2208.01626, 2022.
[38]
J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole,
M. Norouzi, D. J. Fleet, et al. Imagen video: High definition video generation with
diffusion models. arXiv:2210.02303, 2022.
[39]
J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in
Neural Information Processing Systems, 2020.
[40]
C. C. Hsu, H. T. Hwang, Y. C. Wu, Y. Tsao, and H. M. Wang. Voice conversion from
unaligned corpora using variational autoencoding wasserstein generative adversarial
networks. Interspeech 2017, 2017.
[41]
Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing. Toward controlled
generation of text. In International Conference on Machine Learning, 2017.
[42]
G. Iglesias, E. Talavera, and A. Díaz-Álvarez. A survey on GANs for computer vision:
Recent research, analysis and taxonomy. Computer Science Review, 2023.
[43]
P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional
adversarial networks. In Proceedings of the IEEE conference on computer vision and
pattern recognition, 2017.
[44]
G. Jin, Y. Liang, Y. Fang, Z. Shao, J. Huang, J. Zhang, and Y. Zheng. Spatio-temporal
graph neural networks for predictive learning in urban computing: A survey. IEEE
Transactions on Knowledge and Data Engineering, 2023.
61
i
i
i
i
i
i
i
i
Bibliography
[45]
W. Jin, R. Barzilay, and T. Jaakkola. Junction tree variational autoencoder for
molecular graph generation. In International Conference on Machine Learning, 2018.
[46]
L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey.
Journal of Artificial Intelligence Research, 4, 1996.
[47]
M. Kang, J. Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, and T. Park. Scaling
up GANs for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2023.
[48]
T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved
quality, stability, and variation. arXiv:1710.10196, 2017.
[49]
T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila. Training
generative adversarial networks with limited data. Advances in Neural Information
Processing Systems, 2020.
[50] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative
adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition, 2019.
[51]
T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and
improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, 2020.
[52]
D. P. Kingma and M. Welling. Auto-encoding variational Bayes. arXiv:1312.6114,
2013.
[53]
D. P. Kingma and M. Welling. An introduction to variational autoencoders. Founda-
tions and Trends® in Machine Learning, 2019.
[54]
T. N. Kipf and M. Welling. Variational graph auto-encoders. arXiv:1611.07308, 2016.
[55]
J. Kuehn, S. Abadie, B. Liquet, and V. Roeber. A deep learning super-resolution
model to speed up computations of coastal sea states. Applied Ocean Research, 2023.
[56]
Y. Li, R. Yu, C. Shahabi, and Y. Liu. Diffusion convolutional recurrent neural network:
Data-driven traffic forecasting. arXiv:1707.01926, 2017.
[57]
H. Ling, K. Kreis, D. Li, S. W. Kim, A. Torralba, and S. Fidler. Editgan: High-precision
semantic image editing. Advances in Neural Information Processing Systems, 2021.
[58]
M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. Are GANs created
equal? A large-scale study. Advances in Neural Information Processing Systems, 2018.
[59] Y. Ma and J. Tang. Deep Learning on Graphs. Cambridge University Press, 2021.
[60]
N. Mazyavkina, S. Sviridov, S. Ivanov, and E. Burnaev. Reinforcement learning for
combinatorial optimization: A survey. Computers & Operations Research, 2021.
62
i
i
i
i
i
i
i
i
Bibliography
[61]
L. Mero, D. Yi, M. Dianati, and A. Mouzakitis. A survey on imitation learning
techniques for end-to-end autonomous vehicles. IEEE Transactions on Intelligent
Transportation Systems, 2022.
[62]
A. Micheli. Neural network for graphs: A contextual constructive approach. IEEE
Transactions on Neural Networks, 2009.
[63]
M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv:1411.1784,
2014.
[64]
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and
M. Riedmiller. Playing atari with deep reinforcement learning. arXiv:1312.5602, 2013.
[65] K. P. Murphy. Machine learning: a probabilistic perspective. MIT Press, 2012.
[66]
A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and
M. Chen. Glide: Towards photorealistic image generation and editing with text-guided
diffusion models. arXiv:2112.10741, 2021.
[67]
M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for
graphs. In International conference on machine learning, 2016.
[68]
A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier
GANs. In International Conference on Machine Learning, 2017.
[69]
M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu. Asymmetric transitivity preserving
graph embedding. In Proceedings of the 22nd ACM SIGKDD international conference
on Knowledge discovery and data mining, 2016.
[70]
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang,
S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions
with human feedback. Advances in Neural Information Processing Systems, 2022.
[71]
G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan.
Normalizing flows for probabilistic modeling and inference. The Journal of Machine
Learning Research, 2021.
[72]
B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social repre-
sentations. In Proceedings of the 20th ACM SIGKDD international conference on
Knowledge discovery and data mining, 2014.
[73] S. J. D. Prince. Understanding Deep Learning. MIT Press, 2023.
[74]
M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming.
John Wiley & Sons, 2014.
[75]
A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep
convolutional generative adversarial networks. arXiv:1511.06434, 2015.
[76]
R. Ranganath, D. Tran, and D. Blei. Hierarchical variational models. In International
conference on machine learning. PMLR, 2016.
63
i
i
i
i
i
i
i
i
Bibliography
[77]
D. Rezende and S. Mohamed. Variational inference with normalizing flows. In
International Conference on Machine Learning. PMLR, 2015.
[78]
C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi.
Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference
Proceedings, 2022.
[79]
C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour,
R. G. Lopes, B. K. Ayan, T. Salimans, et al. Photorealistic text-to-image diffusion
models with deep language understanding. Advances in Neural Information Processing
Systems, 2022.
[80]
C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi. Image
super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2022.
[81]
R. Salakhutdinov and G. Hinton. Deep boltzmann machines. In Artificial Intelligence
and Statistics, 2009.
[82]
F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph
neural network model. IEEE transactions on neural networks, 2008.
[83]
Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson. Structured sequence modeling
with graph convolutional recurrent networks. In Neural Information Processing: 25th
International Conference, ICONIP, 2018, Proceedings, Part I 25, 2018.
[84]
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche,
J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the
game of Go with deep neural networks and tree search. Nature, 2016.
[85]
S. A. Sisson, Y. Fan, and M. A. Beaumont, editors. Handbook of Approximate Bayesian
Computation. CRC Press, Boca Raton, FL, 2019.
[86]
J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsuper-
vised learning using nonequilibrium thermodynamics. In International Conference on
Machine Learning, 2015.
[87]
C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther. Ladder
variational autoencoders. Advances in neural information processing systems, 2016.
[88]
A. Sperduti and A. Starita. Supervised neural networks for the classification of
structures. IEEE Transactions on Neural Networks, 1997.
[89]
R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press,
2018.
[90]
J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale information
network embedding. In Proceedings of the 24th international conference on world wide
web, 2015.
64
i
i
i
i
i
i
i
i
Bibliography
[91]
R. van de Schoot, S. Depaoli, R. King, B. Kramer, K. Märtens, M. G. Tadesse,
M. Vannucci, A. Gelman, D. Veen, J. Willemsen, and C. Yau. Bayesian Statistics and
Modelling. Nature Reviews Methods Primers, 2021.
[92]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
and I. Polosukhin. Attention is all you need. In Advances in neural information
processing systems, 2017.
[93]
P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph
Attention Networks. International Conference on Learning Representations (ICLR),
2018.
[94]
L. Waikhom and R. Patgiri. A survey of graph neural networks in various learning
paradigms: methods, applications, and challenges. Artificial Intelligence Review, 2023.
[95]
J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting
from static images using variational autoencoders. In Computer Vision–ECCV 2016:
14th European Conference, 2016, Proceedings, Part VII 14, 2016.
[96]
T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image
synthesis and semantic manipulation with conditional GANs. In Proceedings of the
IEEE conference on computer vision and pattern recognition, 2018.
[97]
Z. Wang, L. Zhao, and W. Xing. Stylediffusion: Controllable disentangled style transfer
via diffusion models. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, 2023.
[98] C. J. Watkins and P. Dayan. Q-learning. Machine learning, 1992.
[99]
Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. A comprehensive survey on
graph neural networks. IEEE transactions on neural networks and learning systems,
2020.
[100]
A. Yang, B. Xiao, B. Wang, B. Zhang, C. Bian, C. Yin, C. Lv, D. Pan, D. Wang,
D. Yan, et al. Baichuan 2: Open large-scale language models. arXiv:2309.10305, 2023.
[101]
L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and
M. Yang. Diffusion models: A comprehensive survey of methods and applications.
ACM Computing Surveys, 2023.
[102]
L. Yu, W. Zhang, J. Wang, and Y. Yu. Seqgan: Sequence generative adversarial nets
with policy gradient. In Proceedings of the AAAI conference on artificial intelligence,
2017.
[103]
C. Zhang, J. Bütepage, H. Kjellström, and S. Mandt. Advances in variational inference.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
[104]
M. Zhang and Y. Chen. Link prediction based on graph neural networks. Advances in
neural information processing systems, 2018.
65
i
i
i
i
i
i
i
i
Bibliography
[105]
W. Zhao, J. P. Queralta, and T. Westerlund. Sim-to-real transfer in deep reinforcement
learning for robotics: a survey. In 2020 IEEE symposium series on computational
intelligence (SSCI), 2020.
[106]
J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun. Graph
neural networks: A review of methods and applications. AI open, 2020.
66