Mathematical Engineering

of Deep Learning

Chapter 8

Benoit Liquet, Sarat Moka and Yoni Nazarathy

February 28, 2024

Contents

Preface 3

1 Introduction 1

1.1 The Age of Deep Learning ............................ 1

1.2 A Taste of Tasks and Architectures ....................... 7

1.3 Key Ingredients of Deep Learning ........................ 12

1.4 DATA, Data, data! ................................ 17

1.5 Deep Learning as a Mathematical Engineering Discipline ........... 20

1.6 Notation and Mathematical Background .................... 23

Notes and References .................................. 25

2 Principles of Machine Learning 27

2.1 Key Activities of Machine Learning ....................... 27

2.2 Supervised Learning ............................... 32

2.3 Linear Models at Our Core ........................... 39

2.4 Iterative Optimization Based Learning ..................... 48

2.5 Generalization, Regularization, and Validation ................ 52

2.6 A Taste of Unsupervised Learning ....................... 62

Notes and References .................................. 72

3 Simple Neural Networ ks 75

3.1 Logistic Regression in Statistics ......................... 75

3.2 Logistic Regression as a Shallow Neural Network ............... 82

3.3 Multi-class Problems with Softmax ....................... 86

3.4 Beyond Linear Decision Boundaries ....................... 95

3.5 Shallow Autoencoders .............................. 99

Notes and References .................................. 111

4 Optimization Algorithms 113

4.1 Formulation of Optimization .......................... 113

4.2 Optimization in the Context of Deep Learning ................ 120

4.3 Adaptive Optimization with ADAM ...................... 128

4.4 Automatic Diﬀerentiation ............................ 135

4.5 Additional Techniques for First-Order Methods ................ 143

4.6 Concepts of Second-Order Methods ....................... 152

Notes and References .................................. 164

5 Feedforward Deep Networks 167

5.1 The General Fully Connected Architecture ................... 167

5.2 The Expressive Power of Neural Networks ................... 173

5.3 Activation Function Alternatives ........................ 180

5.4 The Backpropagation Algorithm ........................ 184

5.5 Weight Initialization ............................... 192

Contents

5.6 Batch Normalization ............................... 194

5.7 Mitigating Overﬁtting with Dropout and Regularization ........... 197

Notes and References .................................. 203

6 Convolutional Neural Networks 205

6.1 Overview of Convolutional Neural Networks .................. 205

6.2 The Convolution Operation ........................... 209

6.3 Building a Convolutional Layer ......................... 216

6.4 Building a Convolutional Neural Network ................... 226

6.5 Inception, ResNets, and Other Landmark Architectures ........... 236

6.6 Beyond Classiﬁcation ............................... 240

Notes and References .................................. 247

7 Sequence Models 249

7.1 Overview of Models and Activities for Sequence Data ............. 249

7.2 Basic Recurrent Neural Networks ........................ 255

7.3 Generalizations and Modiﬁcations to RNNs .................. 265

7.4 Encoders Decoders and the Attention Mechanism ............... 271

7.5 Transformers ................................... 279

Notes and References .................................. 294

8 Specialized Architectures and Paradigms 297

8.1 Generative Modelling Principles ......................... 297

8.2 Diﬀusion Models ................................. 306

8.3 Generative Adversarial Networks ........................ 315

8.4 Reinforcement Learning ............................. 328

8.5 Graph Neural Networks ............................. 338

Notes and References .................................. 353

Epilogue 355

A Some Multivariable Calculus 357

A.1 Vectors and Functions in R

........................... 357

A.2 Derivatives .................................... 359

A.3 The Multivariable Chain Rule .......................... 362

A.4 Taylor’s Theorem ................................. 364

B Cross Entropy and Other Expectations with Logarithms 367

B.1 Divergences and Entropies ............................ 367

B.2 Computations for Multivariate Normal Distributions ............. 369

Bibliography 399

Index 401

8 Specialized Architectures and

Paradigms

In each of the chapters 5, 6, and 7, we presented one concrete deep learning paradigm, namely

feedforward networks, convolutional neural networks, and sequence models respectively. Such

models are useful in their own right, yet in the world of deep learning one often integrates

them within more complex architectures for speciﬁc activities. For example the convolutional

neural networks of Chapter 6 may be inter-c onnected with sequence models of Chapter 7 for

applications that involve both images and text. In addition, other specialized architectures

and paradigms have also emerged where in each case, non-trivial ideas are employed to

create powerful models . In the current chapter we present such ideas emerging from diﬀerent

domains, yet all using deep neural networks. Some of these domains include generative

modelling, where we focus on diﬀusion models and generative adversarial networks, after an

overview of variational autoencoders. Other domains are in the area of automatic control and

decision making where we present concepts of reinforcement learning. Finally, we explore the

domain of graph neural networks, an area that is proving to be ever so useful for complex

problems that can be represented with graph structures. Without space constraints, each

of these topics deserves its own chapter or a sequence of chapters, yet within this single

chapter we hope that the reader gains an overarching view.

In Section 8.1 we introduce generative modelling principles. For this we introduce principles

of variational autoencoders which are the basis for diﬀusion models, the topic of Section 8.2.

In Section 8.3, we describe the ideas of generative adversarial networks, which also have

some pinnings in game theory. These two variants, namely diﬀusion models of Section 8.2

and generative adversarial networks of Section 8.3, have become the most popular means

of generative modelling to date. We continue in Section 8.4 where we outline principles

of reinforcement learning. Towards that end we ﬁrst deﬁne Markov decision processes

and disc uss principles of optimal control, and then tie the ideas to deep reinforcement

learning. Note that while the application domain of reinforcement learning diﬀers from the

generative modelling domain of the earlier sections, ideas of Markov chains used in diﬀusion

models, re appear in reinforcement learning of Section 8.4. Finally, graph neural networks

are introduced in Section 8.5. As we see, graph neural networks generalize the convolutional

neural networks of Chapter 6 while allowing us to have general graph structures within the

data in contrast to simple spatial connections.

8.1 Generative Modelling Principles

The ﬁeld of generative modelling deals with algorithms and models for creating (generating)

data such as fake images, generated text, or similar. In this space we often think probabilis-

tically and assume that the data has an underlying probability distribution. Our goal is

then to train models that generate random yet realistic data from that distribution, with

or without explicitly capturing the form of the distribution. Generative modelling can be

297

8 Specialized Architectures and Paradigms

applied both in the supervised case (features

and labels

) and the unsupervised case

(no labels

). In Chapter 2 towards the end of Section 2.2 we mentioned a few names of

supervised learning generative modelling approaches such as naive Bayes and others. In

contrast, now we only focus on unsupervised learning. In such a case observed data

composed of

(1)

,...,x

(n)

}

and we assume that all

(i)

are distributed according to a

probability distribution, which we simply denote as

(

). Note that in general

(i)

is a high

dimensional object such as a high resolution color image, and hence

(

) is a complicated

distribution.

There are many generative approaches, yet our focus in this chapter is on two approaches

that have become very popular due to their ability to generate data that appears realistic.

One approach that has recently become popular is the diﬀusion model approach, and this

is the fo cus of Section 8.2. Another approach is the generative adversarial network (GAN)

approach which is the foc us of Section 8.3. The ideas of the diﬀusion approach require

understanding of variational autoencoders. For this purpose, in the bulk of this section, we

introduce variational autoencoders, a class of generative models that is also interesting and

useful in their own right.

In Figure 8.1 we present a schematic of both the GAN approach and the diﬀusion approach.

In both cases, an underlying idea is to ﬁrst generate random noise. The space of this noise is

typically called the latent space, and it has a very simple distribution, denoted

(

), for

example a multivariate standard normal. In both approaches, the sample

is processed as

input to a model whose output is a point

, which is approximately distributed according

(

) and hence “looks realistic”. In the GAN case, one often uses

which is of lower

dimension than

; for example

may be a 3

◊

200

◊

200 dimensional color image while

may be a ve ctor of length 100. In contrast, in the diﬀusion models case, the latent variable

has the same dimension as the target sample x.

Both the diﬀusion approach and the GAN approach embody the conditional distribution

of the data given the noise, denoted as

(

x |z

). The GAN approach learns an approximate

algorithm for sampling from

(

x |z

) with a so called generator network, without explicitly

learning

(

x |z

). In contrast, diﬀusion models are probabilistic models which approximately

learn

(

x |z

) within a so-called decoder. One additional diﬀerence between the approaches is

that GANs generate

from

in one shot, i.e., via one application of the generator network.

In contrast diﬀusion models iterate over multiple steps of the decoder, starting with the

latent noise variable and eventually reaching the target output; see also Figure 8.4.

Both the diﬀusion decoder and the GAN generator use deep neural networks with learned

parameters. As illustrated in Figure 8.1, these parameters are learned by training the models

using the training data

. In both cases an auxiliary network (not illustrated in Figure 8.1)

also plays a part in training. In GANs this auxiliary network is called the discriminator and

to train the generator we train the discriminator network in parallel. For diﬀusion models

the auxiliary network is called an encoder and in this case, it is ﬁxed in advance and has no

learned parameters. More details are in Section 8.3 for GANs and Section 8.2 for diﬀusion

models. We begin the study of generative models with variational autoencoders which lay

down the foundations for understanding diﬀusion models.

Observe that in the context of this chapter we use

(

) in multiple ways, where the actual distribution

used should be inferred from the argument of the function. For example

(

) is the distribution of the data

while p(z) is the distribution of the latent space.

298

8.1 Generative Modelling Principles

2.084

0.412

0.84

0.232

0.424

0.173

0.259

0.985

1.808

0.112

GAN Generator

has learned to sample from p(x|z)

Diusion Decoder

has learned p(x|z)

Figure 8.1:

Generative adversarial networks (GANs) and diﬀusion models are diﬀerent approaches

for generative models. Both use a latent space

and are able to sample from the conditional

distribution

(

x | z

) to generate

. For image generation, GANs typically have a smaller dimensional

while diﬀusion models use

of the same dimension as the image. Both cases are trained using data

where the resulting model in GANs is called a generator, and the resulting model in diﬀusion

models is called a decoder.

Variational Autoencoders

In Section 3.5 we introduced basics of autoencoders within the context of shallow autoen-

coders. We now focus on variational autoencoders which are probabilistic enhancements of

autoencoders. We ﬁrst focus on a statistical perspective of variational autoencoders and

then see a general framework for generating data from p(x) after learning it.

299

8 Specialized Architectures and Paradigms

Like many statistical models, a variational autoencoder is a parametric model where we

assume some parameters

◊

determine

(

) and hence denote the distribution as

◊

(

).By

learning

◊

we can learn the distribution and then also generate samples from

◊

(

).Akey

principle for estimation of

◊

is maximum likelihood estimation (MLE). In Section 3.1,we

brieﬂy introduced this commonly used approach in the context of logistic regression.

With the maximum likelihood approach, similarly to

(3.10)

of Chapter 3, our estimate of

◊

for the data D is

‚

◊

MLE

:= argmax

◊

i=1

log p

◊

(i)

), (8.1)

where the optimized quantity (barring the 1

term) is the log-likelihood under the as-

sumption of independent and identically distributed elements of

. In practice, solving

(8.1)

requires information ab out the expression or form of p

◊

(x).

In variational autoencoders, as alluded to at the start of the section, an unseen latent

variable

plays a key role. More precisely we suppose that

◊

(

) is coupled with

which

has distribution

(

). The distribution

(

) is as sume d to be simple and does not depend on

the parameters ◊. The relationship involving z and x can then be represented as,

◊

(x, z)=p

◊

(x |z) p(z). (8.2)

Hence in summary, both the joint distribution

◊

(

x, z

) and the conditional distribution

◊

(x |z) are paramterized by ◊,butp(z) is not parameterized by ◊.

We can use (8.2) to obtain p

◊

(x) by marginalizing the joint distribution over z as,

◊

(x)=

⁄

◊

(x |z) p(z)dz. (8.3)

Such a representation is helpful in expressing complex

◊

(

) using relatively simple expressions

for

◊

(

x |z

) and

(

). However, even in this setting, the integral

(8.3)

is intractable and

hence indirect optimization using ELBO, deﬁned and described in the sequel, is used.

To parameterize

◊

(

x |z

) and describe

(

) we make use of multivariate normal distributions

where the probability density for a random vector

is represented via

(

;

µ,

) with

denoting the mean vector and  denoting the covariance matrix. A particular simple case is

the standard multivariate normal where

=0(zero vector) and  =

(identity matrix). A

slightly more general case is setting  =

‡

, where the positive scalar

‡

determines the

variance shared among all coordinates.

We sometimes reparametrize

a covariance matrix  œ R

p◊p

via

 = 

€

, where  œ R

p◊p

. (8.4)

Covariance matrices are always positive semi deﬁnite, and characterizing this constraint on

the individual entries of the matrix can be diﬃcult. In contrast, with reparameterizations to

, one may end up with an unconstrained matrix  which is easier to work with.

For example  in  = 

€

can be obtained via a Cholesky factorization in which case  is an

unconstrained lower triangular matrix. Other factorizations such as the spectral decomposition (or singular

value decomposition) can also be used.

300

8.1 Generative Modelling Principles

Starting with

◊

(

x |z

) we assume that this conditional distribution is multivariate normal

where the mean vector and covariance matrix are functions of

that are both parameterized

◊

. In particular, keeping in mind that

is the dimension of

, and

is the dimension

x, we assume that we have learned functions,

◊

: R

æ R

and 

◊

: R

æ R

p◊p

, (8.5)

each parameterized by ◊. With these, we have the conditional distribution set as

◊

(x |z)=N

x ; µ

◊

(z), 

◊

(z)

. (8.6)

In this case with

(8.3)

and

(8.6)

◊

(

) is a Gaussian mixture model

with mixture components

◊

(

x |z

) indexed by the values of

, and corresponding mixture weights provided by

(

).It

is well known that any distribution can essentially be closely approximated by a Gaussian

mixture model if the mixture components and mixture weights are properly selected.

While the general form of Gaussian mixture models is very versatile, in variational autoen-

coders we make some simplifying assumptions. First and most importantly we assume that

the distribution of the latent variable

is multivariate standard normal. Namely, the mixture

weights are,

p(z)=N(z ;0,I). (8.7)

Further, it is common to reduce the complexity of the mixture components

(8.6)

and assume

that the covariance function is simply

‡

where

‡

is a pre-determined (not learned)

hyper-parameter. Thus, (8.6)isreducedto

◊

(x |z)=N

x ; µ

◊

(z), ‡

, (8.8)

and now

(8.8)

together with

(8.3)

implies that

◊

parameterizes the distribution of

only

via the mean function

◊

(

). In the context of deep learning, this mean function is taken

as a neural network with

◊

representing the weights and biases. However in the general

framework of variational autoencoders the mean function

◊

(

) could be any model and not

necessarily a deep neural network.

With the model deﬁned, suppose now that based on data

we manage to approximate the

maximum likelihood estimate

(8.1)

and learn

◊

for

◊

(

). We can now use the model as a

generative model similar to the spirit of Figure 8.1. In particular to generate a new random

data sample

, we ﬁrst generate a random latent sample

from the standard multivariate

normal distribution. We then compute

◊

(

) using the learned model, and then generate

using

(8.8)

where the mean taken is

◊

(

). In this sense, every random

yields its

own

◊

(

) which in turn yields a random

. This generative process can be illustrated as

follows:

◊

(·)

≠≠ ≠ ≠ ≠ ≠ ≠æ µ

◊

)

◊

(x | z)

≠≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠æ x

. (8.9)

Note that the diﬀusion models alluded to in Figure 8.1 are a variant of variational autoen-

coders and we cover the details of this powerful class of generative models the Section 8.2.

In case x is an image we vectorize it and assume the dimension is p.

AGaussianmixturemodelisaprobabilitymixturemodelwhereacollectionGaussiandistributionsare

“weighted” to create a new probability distribution. As with any mixture model, it is composed of mixture

components,andmixture weights,bothofwhichappearundertheintegralsignasin(8.3).

Mathematically this prop erty can be phrased as the fact that Gaussian mixture models are dense in the

space of probability distributions.

301

8 Specialized Architectures and Paradigms

As already mentioned, variational autoencoders are related to the standard autoencoders

presented in Section 3.5, yet standard autoencoders are deterministic while variational

autoencoders are probabilistic. It is this this diﬀerence that allows variational autoencoders

to be used as generative models while standard autoencoders on their own cannot. To see

this distinction, assume we were to try and carry out a generative procedure similar to

(8.9)

with a decoder trained on a standard autoencoder. This can be represented as follows:

Decoder

≠≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠æ x

where again we would sample

from some predetermined distribution. However the standard

autoencoder on its own does not capture the distribution in the latent space where meaningful

samples of

are present. To see this, consider for example Figure 3.14 in Chapter 3, and

observe that

would be some random point in the latent space and then mapped to a

random output

via the decoder. With this, there is no guarantee to get any realistic

because the model does not capture the region of interest in the latent space. Hence

standard autoencoders, while useful for other purposes as discussed in Section 3.5, are not

useful as generative models on their own right.

The Encoder-Decoder Architecture for Variational Auto encoders

As alluded to above, a variational autoencoder has both an encoder and a decoder. The main

object used for generative models,

◊

(

x |z

), is represented via the decoder and as described

above this distribution is parameterized by the mean function

◊

(

) as well as sometimes by

the covariance function 

◊

(

). However, for le arning the parameters

◊

,werequirethefull

(variational) encoder-decoder architecture as illustrated in Figure 8.2.

Training

Production



(z | x)

⇤

✓

(x | z

⇤

)

⇤

Data

sample

Encoder

Latent

variable

Decoder

Generated

data sample

Figure 8.2:

The variational autoencoder architecture comprised of an encoder

„

(

z | x

) and a

deco der p

◊

(x | z).

Like the decoder, the encoder can be a neural network parameterized by weights and biases

denoted as

„

. Also like the decoder, the encoder is probabilistic in the sense that it describe s

a probability distribution, denoted as

„

(

z |x

) and known as the variational posterior of the

latent variable, that provides a distribution of the latent variable

given the data sample

That is, the encoder transforms an input data sample

x œ R

from

to a latent variable

302

8.1 Generative Modelling Principles

z œ R

. Like the decoder, the encoder has a mean function and a covariance function which

we denote respectively as,

„

: R

æ R

and 

„

: R

æ R

m◊m

, (8.10)

and the conditional distribution of the latent variable

given a data point

is assumed to

be normally distributed. Namely,

„

(z |x)=N

z ; µ

„

(x), 

„

(x)

. (8.11)

Hence compare the encoder’s

(8.10)

and

(8.11)

with the decoder’s

(8.5)

and

(8.6)

respectively.

Taking all

x œ D

under consideration, the latent variable sample marginal distribution is

given by,

„,D

(z)=

xœD

„

(z |x). (8.12)

Ideally, like

(8.7)

, we would like

„,D

(

) to be a standard normal distribution in which case

it no longer depends on

„

and

. This is then set as a training goal when jointly training

the encoder (learning

„

) and decoder (learning

◊

) in a variational autoencoder. To achieve

such a goal, we aim to minimize a loss function of the general form,

C(„, ◊ ; D)=C

PriorMatching

(„)+C

Reconstruction

(„, ◊). (8.13)

In general this loss depends on the data

, yet for simplicity we omit this relationship

for the two terms on the right hand side. Minimization of the ﬁrst term,

PriorMatching

(

„

aims to set the latent distribution

(8.12)

to be as close as possible to a standard normal. It

is called prior matching because when treating the variational autoencoder as a Bayes ian

inference model, the latent distribution can be viewed as a prior distribution. Minimization of

the second term,

Reconstruction

(

„, ◊

), aims to capture maximum likelihood estimation with

respect to both encoder and decoder parameters and hence achieve optimal reconstruction

of the output distribution. Details of implementing these loss functions are below, after we

introduce the concept of the evidence lower bound.

Relations to Maximal Likelihood and ELBO

Our aim with training variational autoencoders is that minimization of

(8.13)

will act as

a means for maximum likelihood es timation as in

(8.1)

. In our exposition here we focus

on a simple stochastic gradient descent setting and thus we consider a single datapoint

x œ D

. Extending to multiple datapoints, or mini-batches is straightforward. Our goal is to

optimize,

max

◊œR

log p

◊

(x). (8.14)

For this goal, let us decomp ose

◊

(

) where in our decomposition, we use the encoder

distribution

„

(

z |x

) as well as distributions associated with the joint distribution of the

decoder from

(8.2)

. Namely we use the joint distribution

◊

(

x, z

), as well as

◊

(

z |x

) which

uses the decoder to describe the distribution of the latent variable

given

. This distribution

can be obtained via,

◊

(z |x)=

◊

(x, z)

◊

(x)

. (8.15)

303

8 Specialized Architectures and Paradigms

Note that

◊

(

x, z

) and

◊

(

z |x

) are not operationally used (in a learning or inference

algorithms) but rather only appear theoretically for the decomposition. A key quantity in

the decomposition is called the evidence lower bound (ELBO), and is deﬁned as,

ELBO(◊, „ ; x)=

⁄

„

(z |x) log

◊

(x, z)

„

(z |x)

dz. (8.16)

Note that

ELBO

is an expectation of

log

◊

(x,Z)

„

(Z | x)

where

is distributed according to

„

(

·|x

With ELBO deﬁned, the decomp os ition of the log-likelihood log p

◊

(x) is,

log p

◊

(x)=ELBO(◊, „ ; x)+D

„

(z|x) Îp

◊

(z|x)) , (8.17)

where

„

(z|x) Îp

◊

(z|x))

denotes the KL-divergence (see

(B.6)

in Appendix B)whichis

a measure of how

„

(

z |x

) is diﬀerent from

◊

(

z |x

). Note that the KL-divergence is always

non-negative, and equal to zero if and only if both distributions are identical. To see

(8.17)

observe that,

⁄

„

(z |x) log

◊

(x, z)

„

(z |x)

¸ ˚˙ ˝

ELBO(◊,„ ; x)

⁄

„

(z |x) log

„

(z |x)

◊

(z |x)

¸ ˚˙ ˝

„

(z|x) Î p

◊

(z|x))

⁄

„

(z |x) log

◊

(x, z)

◊

(z |x)

dz = log p

◊

(x),

where moving from left to right, in the ﬁrst step we combine the log terms and in the second

step we use (8.15).

A consequence of the decompos ition

(8.17)

as well as the non-negativity of the KL-divergence

is that ELBO is a lower bound on the log-likelihood. Namely,

log p

◊

(x) Ø ELBO(◊, „ ; x), (8.18)

hence the name “lower bound” in

ELBO

. Note that the log-likelihood is sometimes called

the evidence, and hence the term “evidence” in

ELBO

. Equality holds in

(8.18)

if and only if

„

(z |x) and p

◊

(z |x) are the same.

Now let us assume that the encoder model is ﬂexible enough to yield any mean and covariance

function as in

(8.10)

. With this assumption on the ﬂexibility of the encoder, there exists

some

„

such that

„

(

z |x

) is equal to

◊

(

z |x

).Since

log p

◊

(

) does not depend on

„

we have

from (8.18) and (8.17) that such a „ maximizes the evidence lower bound. Namely,

log p

◊

(x) = max

„œR

ELBO(◊, „ ; x). (8.19)

With such a representation of

log p

◊

(

) for any

◊

, we can also maximize

(8.19)

over

◊

obtain,

max

◊œR

log p

◊

(x) = max

◊œR

, „œR

ELBO(◊, „ ; x). (8.20)

We thus see that the maximum likelihood estimate

(8.1)

, when constrained to a s ingle

x œ D

, can be obtained by maximization of

ELBO

in terms of both the encoder parameters

„

and the decoder parameters

◊

. This general idea of indirect likelihood optimization via

maximization of

ELBO

(with the additional

„

set of parameters for the encoder) is the crux

of training variational autoencoders. The importance of this approach is that approximate

maximization of

ELBO

is computationally feasible in contrast to direct maximum likelihood

estimation, where the integration in (8.3) is a computational barrier.

304

8.1 Generative Modelling Principles

Details of the Loss Function

We now see how minimization of the encoder-decoder loss function

(8.13)

can be constructed

for approximate maximization of ELBO. For this let us expand the expression in (8.16) as,

ELBO(◊, „ ; x)=

⁄

„

(z |x) log

◊

(x |z) p(z)

„

(z |x)

⁄

„

(z |x) log p

◊

(x |z)dz

˚˙ ˝

E log p

◊

(x | Z)

≠

⁄

„

(z |x) log

„

(z |x)

p(z)

¸ ˚˙ ˝

„

(z|x) Î p(z))

. (8.21)

In the ﬁrst equality we used the representation of the joint distribution

◊

(

x, z

) from

(8.2)

. Then in the second equality we expand the logarithm. The resulting ﬁrst term is an

expectation of the function

log p

◊

(

x |Z

) when

is distributed according to

„

(

·|x

). For the

second term, keeping in mind that

(

) as in

(8.7)

is a parameter-free multivariate standard

normal, we have a KL-divergence that measures how close the encoder distribution is to the

standard normal. Keep in mind that this a diﬀerent application of the KL-divergence to the

one used for lower bounding the evidence in (8.17).

With the

ELBO

expression present, we can now ﬁll in the details for the terms in the loss

expression

(8.13)

, such that m inimization of this loss works towards maximization of

ELBO

Our focus is on a sto chastic gradient descent approach as in Section 4.2 where at any gradient

descent step, instead of considering the whole data

, we consider a single arbitrary

x œ D

In such a case, the loss components on the right hand side of

(8.13)

can be deﬁned based

on a single data sample

x œ D

where, as shown below, for the second component we also

generate a random latent variable z

œ R

from (8.11).

The idea of minimizing the ﬁrst term of

(8.13)

PriorMatching

(

„

), is to drive the parameters

such that

„,D

(

) of

(8.12)

is approximately a multivariate standard normal as in

(8.7)

. For

this we set,

PriorMatching

(„)=µ

„

(x)

€

„

(x) ≠ log det



„

(x)

+ tr



„

(x)

, (8.22)

where

det

(

) is the determinant of a matrix and

(

) is the trace of a matrix. Minimization of

this expression is equivalent to minimization of the KL-divergence where the ﬁrst argument

is a multivariate normal distribution with mean vector

„

(

) and covariance matrix 

„

(

and the second argument is an

-dimensional standard multivariate normal distribution.

This is because,

„

(z|x) Îp(z)) =

PriorMatching

(„) ≠

as can be veriﬁed with (B.11) of Appendix B.

Moving onto the second term in

(8.13)

, using a single

and a single

, we construct this

term as,

Reconstruction

(„, ◊)=

x ≠ µ

◊

)

€



◊

)

≠1

x ≠ µ

◊

)

+ log det



◊

)

. (8.23)

This term aims to capture the

E log p

◊

(

x |Z

) term in

(8.21)

. Observe that

Reconstruction

(

„, ◊

)

is a random variable with distribution inﬂuenced by

„

through the distribution of

.The

form of

(8.23)

arises from the log-density expression in

(B.8)

of Appendix B. We do not

have an explicit expression for the expectation

E log p

◊

(

x |Z

) term of

(8.21)

, therefore, as

305

8 Specialized Architectures and Paradigms

a simple estimate we use a single sample

of the latent variable from the distribution

„

(z |x) and rely on the approximate relationship,

E log p

◊

(x |Z) ¥ log p

◊

(x |z

), (8.24)

as a proxy for the optimization. Now considering (8.23)wehave,

log p

◊

(x |z

)=≠

Reconstruction

(„, ◊)+

log(2ﬁ).

Hence minimization of

Reconstruction

(

„, ◊

) maximizes

log p

◊

(

x|z

), which approximates

maximization of E log p

◊

(x |Z) via (8.24).

We have thus seen that the encoder-decoder architecture, learned via stochastic gradient

descent with loss function

(8.13)

and individual terms

(8.22)

and

(8.23)

approximates

maximization of ELBO, and thus facilitates maximum likelihood estimation of ◊.

The Reparameterization Trick

During training of the variational autoencoder, we need to compute the gradient of

the loss function

(8.13)

. With standard backpropagation we can compute the gradient

of the

PriorMatching

(

„

) component in a straight forward manner. However, since the

Reconstruction

(

„, ◊

) component is based on a random sample

, it is not obvious how

to use backpropagation through the number random generator. Luckily, via a reparametriza-

tion of the random vector

as an aﬃne function of a standard random vector

‘

, we can

overcome this diﬃculty. Speciﬁcally, if ‘

is a standard m-dimensional multivariate normal

random vector, then we may set,

= µ

„

(x)+

„

(x) ‘

, (8.25)

where as before

„

(

) is the learned mean function of the encoder. With the reparam-

eterization

(8.25)

, the desired learned covariance function 

„

(

) is decompos ed using



„

(

)=

„

(

) 

„

(

)

€

, as in

(8.4)

. Now 

„

(

) is learned in the encoder neural network in

place of 

„

(

) of

(8.10)

. With this, the distribution of

is as required, and further there

are no longer issues with enforcing the positive deﬁnite constraint that is required for the

covariance matrix.

Importantly, backpropagation can now be carried out, because the random numbers generated

‘

are now generated independently of the parameters whose gradients we seek. In some

cases, the desired covariance matrix is diagonal. In such cases, the reparameterization is

especially simple since 

„

(

) is also diagonal with each entry being the square root of the

associated diagonal entry of 

„

(x).

8.2 Diﬀusion Models

Diﬀusion models are a class of generative models that have shown great promise in creating

both realistic looking image s, as well as highly impressive surreal artwork. Figure 8.3 presents

a few examples with several paradigms of application including image generation, colorization,

style transfer, and text to image c reation. The ease of use from a user’s perspective of such

platforms, and the quality of the images created is impressive.

306

8.2 Diﬀusion Models

(a) (b)

Figure 8.3:

Images generated via various diﬀusion architectures

with various types of paradigms.

(a) A 256

◊

256 generated images based on a labe l. (b) Image-to-image generation (colorization). (c)

Applying a style (left column) to an image (middle column). (d) A text-to-image application.

307

8 Specialized Architectures and Paradigms

The principles underlying diﬀusion models hinge on variational autoencoders presented in the

previous section, as well as on a generalisation of such models, called hierarchical variational

autoencoders and in particular Markovian hierarchical variational autoencoders.Diﬀusion

models are a special case of such models, hence in exploring diﬀusion models, we ﬁrst study

Markovian hierarchical variational autoencoders. We then construct diﬀusion models as a

special case of Markovian hierarchical variational autoenco ders, based on auto-regressive

Gaussian processes.

Hierarchical Variational Autoencoders

In the autoencoders of Section 8.1 the encoder acts as a noising mechanism that converts

the input

into a noisy latent variable

. On the other hand the decoder can be viewed as

a denoising mechanism for converting a noisy latent variable

to meaningful output. The

main idea of hierarchical variational autoencoders is to break up this process into multiple

levels. In the encoder, the input

is noised slightly to c reate

,then

is further noised to

create

up until some ﬁnal level

with a fully noisy latent variable

.Inthedecoder

the reverse process takes place where each step from

t≠1

enacts a slight denoiseing

operation.

Hence in hierarchical variational autoencoders, instead of a s ingle latent variable

,wehave

a sequence of

levels with latent variables,

,...,z

. Such models also enforce a speciﬁc

dependence structure within this sequence and are called “hierarchical”, since in the encoder

each

can be generated based on the values of the lower-level latent variables

,...,z

t≠1

and in the decoder each latent variable

can be generated based on the values of the higher

latent variables

t+1

,...,z

. In such a model, one can view the input

x œ D

as the 0-th

level, namely the complete sequence of levels is x, z

,...,z

, where we can treat x as z

The most common hierarchical variational autoencoders are Markovian in which case the

sequence

x, z

,...,z

is a Markov chain, and the model is called a Markovian hierarchical

variational autoencoder. That is, the sequence follows the Markov property which can be

deﬁned in multiple ways. One way is that for any

,...,T ≠

1, given the value of

,the

sequence x, z

,...,z

t≠1

and the sequence z

t+1

,...,z

are independent.

Considering the decoder, a consequence of the Markov property is that the joint distribution

◊

(

x, z

,...,z

) can be described as a product of one step transition probabilities.Wehave,

◊

(x, z

,...,z

)=p

◊

(x |z

◊

) ···p

◊

T ≠1

)p(z

), (8.26)

where

◊

,...,◊

) are the decoder parameters, and each

◊

(

t≠1

) is the decoder

conditional one step transition probability from level

to level

t ≠

1. Note that in this

expression,

(

), the distribution of level

, is parameter free. Since it is assumed to be

fully noisy, it is often taken as a multivariate standard normal similar to

(8.7)

from the

(non-hierarchical) variational autoencoder.

In such models, the one step transition probabilities

◊

(

·|z

) are generally represented

as multivariate Gaussian distributions where for level

, we learn neural networks for the

mean function

◊

(

) and potentially for the covariance function 

◊

(

). Hence the learned

parameters

◊

describe the distribution at level

t ≠ 1

given the value at the previous level,

Image (a) is thanks to Ho, et al. [

183

]. Image (b) is thanks to Saharia, et. al. [

358

]. Image (c) is thanks

to Wang, et. al. [418]. Image (d) is thanks to Nichol, et. al. [307].

308

8.2 Diﬀusion Models

Observe that the Markovian hierarchical variational autoencoder joint distribution in

(8.26)

is the parallel of the joint distribution

(8.2)

for the non-hierarchical case. A potential beneﬁt

of using

(8.26)

is that by breaking up the parameterization of the decoder into

smaller

steps, each with its own learned parameters, more expressive models can be achieved.

A decoder parameterized by

◊

,...,◊

) implements a generative model, similarly

to the non-hierarchical case. First

is drawn from the distribution

(

), and then for

each transition we compute

◊

(

) and 

◊

(

) and then draw

t≠1

from the transition

probability

◊

(

·|z

) parametrized by these values. Hence while a non-hierarchical variational

autoencoder draws

once and then generates

,herewedraw

,then

T ≠1

, and so forth

until attaining a generated data sample x

= z



|x)

✓

(x|z

)

t1



t1

)

✓

t1

)

t+1



t+1

)

✓

t+1

)

T 1



T 1

)

✓

T 1

)

Encoder

Decoder

Figure 8.4:

A Markovian hierarchical variational autoencoder has a latent space

,...,z

with

. The encoder transforms from

t+1

(adding noise) and the decoder works in the

opposite direction (removing noise).

Like non-hierarhcial variational autoencoders, in Markovian hierarchical variational autoen-

coders, the encoder serves as an aid for training the generative model (the decoder). The

encoder also incorporates a Markovian structure within levels. The encoder parameters

are

„

,...,„

T ≠1

), where this time for

,...,T ≠

1, we have conditional one step

transition probabilities,

„

(

t+1

) describing the transition from level

to level

+1. Like

the decoder, in the encoder we can assume Gaussian transition probabilities which are this

time parameterized by µ

„

) and 

„

). See Figure 8.4.

Considering the encoder, it is useful to represent the conditional distribution of

,...,z

given

for any

,...,T

, denoted as

„

(

,...,z

). Due to the Markovian assumption

we have,

„

,...,z

|x)=q

„

|x)q

„

) ···q

„

t≠1

), for t =1,...,T. (8.27)

Further, the distribution of

for a given

is denoted as

„

(

) which may formally be

represented as a marginal distribution from q

„

,...,z

|x), and obtained via,

„

|x)=

⁄

···

⁄

„

,...,z

|x) dz

···dz

t≠1

. (8.28)

309

8 Specialized Architectures and Paradigms

In the speciﬁc diﬀusion mode ls below this distribution is constructed with an explicit form,

yet for now the general expression (8.28) is appropriate.

With this notation and the construction of the encoder and decoder in place, we can seek

to approximate maximum likelihood estimation of

◊

. This is done in a similar manner to

the non-hierarchical variational autoencoder case outlined in Section 8.1. Speciﬁcally we

maximize an ELBO term over both „ and ◊.

The

ELBO

term in this case is similar to

(8.16)

yet can be represented via

,...,z

in place

of z. Namely,

ELBO(◊, „ ; x)=

⁄

„

,...,z

|x) log

◊

(x, z

,...,z

)

„

,...,z

|x)

···dz

. (8.29)

Now using the Markovian structure of

(8.26)

(8.27)

, and after some extensive manipulation,

the following expansion of ELBO(◊ , „ ; x) arises :

⁄

„

|x) log p

◊

(x |z

)dz

¸ ˚˙ ˝

E log p

◊

(x | Z

)

(Reconstruction)

≠

⁄

„

|x) log

„

|x)

p(z

)

¸ ˚˙ ˝

„

|x) Î p(z

))

(Prior matching) (8.30)

≠

t=2

⁄

„

|x)

⁄

„

t≠1

,x) log

„

t≠1

,x)

◊

t≠1

)

¸ ˚˙ ˝

E D

(

„

t≠1

,x) Î p

◊

t≠1

)

(Denoising matching)

With this expansion

ELBO

is now represented in terms of three types of terms, namely a

reconstruction term,aprior matching term, and denoising matching terms.Theﬁrsttwo

terms are also present in non-hierarchical variational autoencoders whereas the denoising

matching terms arise due to the multi-level structure of the latent space.

The reconstruction term in

(8.30)

is an expectation of the conditional log-likelihood with

respect to

, distributed according to

„

(

). Maximization of this term drives maximum

likelihood estimation. The prior matching term in

(8.30)

is the KL-divergence between the

standard normal distribution

(

) and the encoder based last step distribution

„

(

Minimization of this KL-divergence attempts to enforce that

at the exit of the encoder is

distributed approximately as a standard normal.

In the additional

T ≠

1 terms, we use

„

(

t≠1

) which can be viewed as a denoising

distribution in the encoder. Note that in contrast to the direction of the encoder (

t≠1

this distribution is in the opposite direction. The expected KL-divergence in each denoising

matching term is between

„

(

t≠1

) and the decoder’s one step transition

◊

(

t≠1

Ideally both the encoder and the decoder are similar at level

and minimization of the

expected KL-divergence enforces the denoising to be as close as possible. Note that when

represented as an expectation, the expectation is with respect to the random variable

distributed as q

◊

|x).

310

8.2 Diﬀusion Models

As is evident, the

ELBO

(

◊, „

;

) expressions rely on

„

(

) which in general needs to

be computed via

(8.27)

and

(8.28)

. It also relies on

„

(

t≠1

), which is generally

diﬃcult to compute. We now see, that in the special case of a diﬀusion model, with the

assumptions imposed, the expressions for

„

(

) and

„

(

t≠1

) simplify and are

eﬃcient to compute.

The process of learning parameters

◊

for Markovian hierarchical variational autoencoders is in

principle similar to learning variational autoencoders as presented in Section 8.1. Speciﬁcally,

we approximately maximize

ELBO

(

◊, „

;

) over b oth

◊

and

„

using estimates obtained

via samples of the latent variables. The resulting decoder with parameters

◊

emerges and

approximates maximum likelihood estimation as in the non-hierarchical case.

The Diﬀusion Model Assumptions

A diﬀusion model is a Markovian hierarchical variational autoencoder with a few speciﬁc

assumptions. All latent variables are of the same dimension of the data samples. Fur-

ther, recalling that

, the one step encoder and decoder transition probabilities are

respectively,

q(z

t+1

)=N(z

t+1

;



1 ≠ —

, —

I) for t =0,...,T ≠ 1, (8.31)

◊

t≠1

)=N(z

t≠1

; µ

◊

), ‡

I) for t = T, . . . , 1. (8.32)

Here we have two sequences of scalar hyper-parameters. The encoder hyper-parameters are

—

,...,—

T ≠1

, each in the range [0

1]. The decoder hyper-parameters are

‡

,...,‡

, each a

positive number. Note that the encoder has no learned parameters and thus there are no

„

subscripts for

(

t+1

). The decoder learned parameters,

◊

,...,◊

, are used for the

mean functions

◊

(

)

,...,µ

◊

(

), but not for the one step covariance matrices that are of

the form

‡

I, . . ., ‡

. Each of these mean functions,

◊

(

), is a neural network, and hence

the model has T learned neural networks.

A key aspect of the diﬀusion model is the structure of the one step transition probabilities of

the encoder

(8.31)

. Since the encoder steps have a conditional mean vector and covariance

matrix of

1 ≠ —

and

—

respectively, they describe an auto-regressive stochastic sequence

of the form,

t+1



1 ≠ —



—

‘

, (8.33)

where

‘

is a multivariate standard normal vector. Conditioned on the value of

,the

expected value of

t+1

resulting from

(8.33)

1 ≠ —

and the covariance matrix is

—

Hence, this stochastic recursion then follows the next step distribution

(8.31)

. The name

“diﬀusion model” can be attributed to the fact that if

(8.33)

was in continuous time, then

the process is a type of a stochastic diﬀusion process.

A strength of the recursion

(8.33)

, is that it also enables close d form expressions of the

multi-step transition probabilities

„

(

). Such probabilities are heavily used in the

ELBO

expression

(8.30)

. For this purpose it is useful to deﬁne the products

“

for

,...,T

with

“

=(1≠ —

) ···(1 ≠ —

t≠1

311

8 Specialized Architectures and Paradigms

With this notation, standard recursive computations of means and variances involving

geometric sums yield,

q(z

|x)=N(z

;

“

x, (1 ≠ “

)I) for t =1,...,T. (8.34)

Hence for such an auto-regressive stochastic sequence, the distribution of

given the initial

value x has a mean vector

“

x and a covariance matrix (1 ≠ “

)I.

It is of further interest to have explicit expressions for

(

t≠1

), a probability appearing

in the denoising matching term in

(8.30)

.Thediﬀusion model assumptions enable us to

obtain this conditional probability as well. Namely,

q(z

t≠1

,x)=

q(z

t≠1

,x) q(z

t≠1

|x)

q(z

|x)

q(z

t≠1

) q(z

t≠1

|x)

q(z

|x)

= N

t≠1

;

(1 ≠ “

t≠1

)



1 ≠ —

t≠1

1 ≠ “

“

t≠1

—

t≠1

1 ≠ “

¸ ˚˙ ˝

Mean vector

—

t≠1

(1 ≠ “

t≠1

)

1 ≠ “

¸ ˚˙ ˝

Covariance Matrix

(8.35)

The ﬁrst equality follows from Bayes’ rule of conditional probability, maintaining the

condition on

. In the second equality we use the Markovian structure which implies that

(

t≠1

(

t≠1

). Then to obtain the ﬁnal expression we manipulate normal

densities as given by

(8.31)

and

(8.34)

. Note that this ﬁnal manipulation requires a few

algebraic steps involving completion of the squares; we omit the details. Hence we have

explicit expressions for the conditional (on

and

) mean vector and covariance matrix of

t≠1

Loss Function

Training a diﬀusion model involves learning the parameter vectors

◊

,...,◊

.Thisisthe

learning of

distinct neural networks together. While it is a special case of training general

Markovian hierarchical variational autoencoders, the fact that there are no learned parameters

in the encoder, and the fact that the decoder covariance functions are also without learned

parameters, eases training.

With any Markovian hierarchical variational autoencoder, we would ideally like to maximize

ELBO

represented in

(8.30)

, yet in the special cas e of a diﬀusion model, the closed form

diﬀusion expressions

(8.31)

(8.32)

(8.34)

, and

(8.35)

simplify the training process. Some of

the general parameters and associated probabilities in

(8.30)

are now captured by closed

form expressions and hyper-parameters in the diﬀusion model case.

Considering the ELBO expression

(8.30)

we see that for diﬀusion models we do not need a

prior matching term, since this term only depends on the encoder parameters

„

, and diﬀusion

models do not have learned parameters for the enco der. With this, a single observation loss

function for the diﬀusion model c an then be represented as,

C(◊

...,◊

; x)=C

Reconstruction

(◊

t=2

DenoisingMatching

(◊

), (8.36)

312

8.2 Diﬀusion Models

where

x œ D

. The loss component

Reconstruction

(

◊

) relates to the reconstruction term

(8.30)

, and the loss components

DenoisingMatching

(

◊

) relates to the de noising matching

terms in

(8.30)

. Note that for brevity we omit the dependance on

in the terms on the

right hand side. These loss components are implemented as,

Reconstruction

(◊

‡

Îx ≠ µ

◊

)Î

, (8.37)

DenoisingMatching

(◊

‡

(1 ≠ “

t≠1

)



1 ≠ —

t≠1

1 ≠ “

“

t≠1

—

t≠1

1 ≠ “

x ≠ µ

◊

)

, (8.38)

for t =2,...,T. Here each z

is generated randomly using (8.34) for given data sample x.

Minimization of the terms

(8.37)

and

(8.38)

is approximately equivalent to ELBO maxi-

mization, in a similar nature to the variational autoencoder. The reconstruction loss

(8.37)

is similar to the standard variational autoencoder reconstruction loss

(8.23)

, based on the

log-density expression

(B.8)

of Appendix B. Here we exploit the fact that there are no learned

covariance parameters. Similarly to the standard variational autoencoder of Section 8.1,

minimization of this reconstruction loss serves to approximately maximize

E log p

◊

(

x |Z

)

in (8.30).

The denoising matching loss terms

(8.38)

are based on the KL-divergence expression

(B.10)

of Appendix B and in fact for every level t,

q(z

t≠1

,x)

◊

t≠1

)

DenoisingMatching

(◊

)+constant,

where the constant on the right hand side depends only on the (non-learnable) hyper-

parameters

—

t≠1

“

t≠1

, and

‡

. From this expression, it is clear that minimization of the

KL-divergence is equivalent to minimization of the Euclidean distance between the mean

vectors of the distributions

(

t≠1

) and

◊

(

t≠1

). Further, this term serves as an

estimate of

E D

(q(z

t≠1

,x) Îp

◊

t≠1

))

from

(8.30)

, where the mean and covariance

expressions of

(8.35)

are used for the inner integral, and the single

drawn from

(

)

serves as an approximation of the outer integral.

This loss formulation in

(8.36)

(8.37)

, and

(8.38)

presents us with a simple mechanism

for learning the parameters of the neural networks

◊

(

)

,...,µ

◊

(

) using gradient based

optimization where at each epoch we draw random

,...,z

according to

(8.34)

. Similar to

other deep learning cases, we can also integrate mini-batch based learning and other gradient

descent variations. As we show now, these loss expressions can even b e further simpliﬁed

using the reparameterization trick.

The Reparameterization Trick and Simpliﬁed Loss

(8.25)

of Section 8.1,wesawthereparameterization trick in the context of variational

autoencoders. This allows us to carry out model training using backpropagation. A similar

reparameterization trick can be applied to diﬀusion models. In the context of diﬀusion

models, this trick simpliﬁes the loss function

(8.36)

and often yields better numerical

stability during training. Such simpliﬁcation is achieved by replacing the neural networks

◊

(

)

,...,µ

◊

(

), which are used for sequentially denoising the latent variables to obtain

313

8 Specialized Architectures and Paradigms

data sample

,withadiﬀerent collection of neural networks

ˆµ

◊

(

)

,...,ˆµ

◊

(

). Each such

ˆµ

◊

) is trained to predict a standard noise comp onent as constructed below.

Similar to the variational autoencoder case, the reparameterization trick we use is applied

on the samples of the latent variables. In particular, based on

(8.34)

we can represent

“

x +



1 ≠ “

‘

, (8.39)

where

‘

denotes an independent multivariate standard normal vector with dimension equal

to the dimension of

. Using this reparameterization trick, as we show be low, at a given

data sample x, the loss function can b e of the form,

C(◊

...,◊

; x)=

t=1

Reparameterized

(◊

), (8.40)

where each C

Reparameterized

(◊

) is deﬁned as,

Reparameterized

(◊

—

t≠1

‡

(1 ≠ “

)(1 ≠ —

t≠1

)

‘

≠ ˆµ

◊

“

x +



1 ≠ “

‘

¸ ˚˙ ˝

, (8.41)

and the dependence on

is notationally suppressed. Before delving into the mathematical

formulation of obtaining the loss component

(8.41)

, let us focus on how the diﬀusion model

is trained.

When using

(8.40)

and

(8.41)

we train the neural networks

ˆµ

◊

(

) either with a single

x œ D

or with a mini-batch approach. Importantly, in each gradient e valuation during training, we

generate a standard multivariate normal

‘

which is mapped to

using

(8.39)

.Wethen

obtain the gradient of

ˆµ

◊

(

) at

and this allows us to compute the gradient of

Reparameterized

(◊

In the remaining part of the section, we provide a derivation of the loss expressions and then

discuss how the trained diﬀusion model is used in the production. With

(8.39)

, a sample of

can be obtained by ﬁrst generating a sample of

‘

and then using a data sample

to get

. Thus, if the values of ‘

and the corresponding z

are given, we can represent x as,

x =

“

≠

1 ≠ “

“

‘

Using this, we can now manipulate the mean vector expression in

(8.35)

. After some

simpliﬁcation, based on the fact that “

/“

t≠1

=1≠ —

t≠1

, we obtain

(1 ≠ “

t≠1

)



1 ≠ —

t≠1

1 ≠ “

“

t≠1

—

t≠1

1 ≠ “

x =



1 ≠ —

t≠1

≠

—

t≠1



(1 ≠ “

)(1 ≠ —

t≠1

)

‘

Consequently, the denoising loss component in (8.38) can be expressed as

DenoisingMatching

(◊

‡



1 ≠ —

t≠1

≠

—

t≠1



(1 ≠ “

)(1 ≠ —

t≠1

)

‘

≠ µ

◊

)

. (8.42)

314

8.3 Generative Adversarial Networks

Hence the neural network

◊

(

) attempts to predict

1≠—

t≠1

≠

—

t≠1

(1≠“

)(1≠—

t≠1

)

‘

. Now

the essence of the reparameterization trick is to use a diﬀerent neural network, denoted as

ˆµ

◊

(

), that takes

to predict the noise

‘

. Note that only for the notational convenience

we use the same notation ◊

to denote the parameters of new neural network as well.

With the new neural network,

ˆµ

◊

(

), we replace

◊

(

) in

(8.42)

with a new random latent

variable,



1 ≠ —

t≠1

≠

—

t≠1



(1 ≠ “

)(1 ≠ —

t≠1

)

ˆµ

◊

), (8.43)

where

is obtained from

‘

using

(8.39)

. Now with basic manipulation the

DenoisingMatching

(

◊

)

expression in (8.42) can be replaced by C

Reparameterized

(◊

) of (8.41).

With a similar argument, we can show that for

=1,

Reconstruction

(

◊

) in

(8.37)

can be

replaced with

Reparameterized

(◊

—

‡

(1 ≠ —

)

Î‘

≠ ˆµ

◊

)Î

As a consequence of these neural network replacements and the fact that

“

=(1

≠ —

),the

loss component for each

must be of the form

(8.41)

and hence the ﬁnal loss function is

given by (8.40).

In production, once the diﬀusion model is trained, to generate a realistic looking sample

, we start at

with a sample

from a multivariate standard normal. Iteratively, as

we decrement

, for each step we generate a diﬀerent standard multivariate normal

‘

and

compute ˆµ

◊

).Byusing

t≠1



1 ≠ —

t≠1

≠

—

t≠1



(1 ≠ “

)(1 ≠ —

t≠1

)

ˆµ

◊

)

¸ ˚˙ ˝

ˆz

+ ‡

‘

, (8.44)

we obtain the (denoised) input to the next iteration, where

ˆz

is given by

(8.43)

. We repeat

until

yields a realistic looking generated data sample. Observe that randomness of

this sample is due to the initial random

as well as to the

‘

that was generated in each

iteration.

Notice from

(8.44)

that

t≠1

follows a multivariate normal distribution with the conditional

(on

) mean vector being

ˆz

and the covariance matrix being

‡

. Since with the reparam-

eterization trick,

ˆz

is a replacement of

◊

(

), in production, the distribution of

t≠1

approximately equal to p

◊

(·|z

) which is what we aim to achieve.

8.3 Generative Adversarial Networks

Generative adversarial networks or GANs, are generative deep learning architectures that

have made great impact on the ﬁeld of synthetic data generation. For example in Figure 8.5 we

can see synthetic images, all created via various GAN architectures, with several paradigms

of application. In this section our aim is to highlight the key ideas of generative adversarial

networks, focusing on mathematical principles. The basic setup of using such a generative

model to approximately create samples from the distribution

(

x |z

) as was illustrated in

Figure 8.1. We now present details of generative adversarial network architectures.

315

8 Specialized Architectures and Paradigms

(a) (b) (c)

Figure 8.5:

Images generated via various GAN architectures

with various types of paradigms.

(a) Faces with 1024

◊

1024 resolution. (b) Picture to picture generation. (c) A text to image

application.

The GAN Approach to Generative Modelling

The general idea of GAN modelling is to train a neural network, called the generator and

denoted here as

◊

(

). This model applied to a noise vector

, generates a data point

that ideally appears similar to other data points

x œ D

. Training is motivated by an

adversarial game like approach, where we simultaneously train both the generator

◊

(

and another network called the discriminator, and denoted here as

„

(

). Hence training

implies learning the generator parameters

◊

while in parallel also learning the discriminator’s

parameters „.

The generator’s purpose is to create “fake” data samples and the discriminator’s purpose is

to try and distinguish between fake and real samples. As such, the discriminator is a binary

classiﬁer, which when presented with some data point

˜x

, would ideally b e able to output a

probability value near 1 if ˜x œ D (real) and a probability value near 0 if ˜x = x

(fake).

While in a normal classiﬁcation setting, having a well performing disciriminator

„

(

) is

desirable, in the GAN setting, the discriminator is actually used to help train the generator

◊

(

). Once training is complete, we would generally have that

„

(

˜x

) is on average near

2 for both real

˜x œ D

and fake

˜x

,with

„

(

) . As such, during training of a

GAN, we can expect to see a sequence of network parameters,

(„

(1)

, ◊

(1)

)

¸ ˚˙ ˝

iteration 1

, („

(2)

, ◊

(2)

)

¸ ˚˙ ˝

iteration 2

,..., (8.45)

such that as training progresses to high iterations

◊

(t)

are the parameters of a generator

that creates “sensible fake samples”, while the discriminator parameters

„

(t)

, are no longer

able to discriminate between “fake” and “real”, and then yield performance near 1/2.

Empirically it has been experienced that one needs to train both the generator and discrimi-

nator together, in order to achieve the desired behaviour. The alternative of starting with

an already trained discriminator, and then training a generator is not feasible, ﬁrst because

Image (a) is thanks to Karras, et. al. [

222

]. Image (b) is thanks to Isola, et. al. [

204

]. Image (c) is thanks

to Kang, et. al. [219].

316

8.3 Generative Adversarial Networks

Production

Training

⇤

x 2 D

✓

(·)



(·)

Fake/Real

⇤

Figure 8.6:

The GAN architecture for learning the generator

◊

(

). A random

passes through

the generator to produce

. The generator is trained simultaneously with the discriminator

„

(

whose role is to determine if its input is fake, x

, or real, x œ D.

such a trained discriminator is not available without the generator trained, and secondly

because the joint training provides a loss landscape for the generator parameters

◊

that

is suitable for gradient based learning. In this sense, training a GAN, follows a dynamic

equilibrium approach which can often be posed as a mathematical game between two players.

In fact, as we present below, at the equilibrium point of this gam e, one can properly use

mini-max game analysis to analyze some properties of the optimization.

Figure 8.6 illustrates the general GAN architecture where in training we use both the

generator and discriminator networks, while in production only the generator network is

used for creating samples from

(

x |z

). Note that the latent space size of

, denoted

,is

typically in the order of several dozen to several hundred dimensional and it is typically

taken to be a multivariate standard normal distribution as in the previous generative models

of this chapter.

Training a GAN

The key to training a GAN with steps of a sequence such as

(8.45)

is to alternate be tween

learning

„

for the disc riminator, and learning

◊

for the ge nerator. For the discriminator,

training involves improving the binary classiﬁer

„

(

) and for the generator, training is

adversarial since it involves seeking parameters

◊

that yield the “worst possible” generator

◊

(

) for a given discriminator

„

(

). We cast these alternative goals as a minimax objective,

a notion that we now describe.

First note that as in Section 8.1 we use

to denote the set of data points and assume that

there are

such data points. In addition we ass ume some set of random latent variable

samples which we denote as

with each

œ Z

being an

-dimensional random vector.

In practice, one does not need to randomly sample all of

in one go, but can rather

resample from a standard multivariate normal distribution every time we seek an arbitrary

element of

. Yet for simplicity of the exposition and analysis we assume that the number

of elements in Z

is the same as in D,i.e.,n.

317

8 Specialized Architectures and Paradigms

With this notation, based on

and

, for ﬁxed generator parameters

◊

, the objective for

the discriminator parameters

„

can be taken as the classic binary cross entropy objective;

recall

(3.11)

of Chapter 3. Speciﬁcally, from the discriminator’s perspec tive we want

x œ D

to be considered as a positive example and we want

◊

(

) for

œ Z

to be considered

as a negative example. That is, the discriminator applied to

should be as close to 1 as

possible, and the discriminator applied to

◊

(

) should be as close to 0 as p os sible. Now

given the generator’s parameters

◊

, together with

and

, a binary cross entropy loss

function (scaled by a factor of 2) can b e formulated as,

◊

(„ ; D, Z

)=≠

xœD

log

„

(x)

œZ

log

1 ≠ f

„

◊

)

. (8.46)

Hence the objective of learning the discriminator parameters is,

min

„

◊

(„ ; D, Z

). (8.47)

The generator’s learning process is adversarial as the generator wishes for

◊

(

„

;

D, Z

)

to be as high as possible. Speciﬁcally, from the generator’s point of view, we seek to ﬁnd

parameters

◊

that are the worst poss ible parameters (maximization of

◊

) for the optimal

choice of the discriminator. That is, we seek to solve

max

◊

min

„

◊

(„ ; D, Z

)

. (8.48)

The joint objective

(8.48)

is generally called a minimax objective as it captures the competing

goals of both players in the game.

To implement

(8.48)

we use an iterative algorithm that goes through iterates as in

(8.45)

and in each iteration alternates between

„

and

◊

, with multiple steps within the iteration for

each. Spec iﬁcally, at iteration

we ﬁrst carry out mini-batch based gradient descent steps

on the loss function

(8.46)

where

◊

(t≠1)

is ﬁxed and we improve

„

, starting with

„

(t≠1)

in the iteration. The result of these mini-batch gradient descent steps for iteration

„

(t)

In the second part of iteration

, we se ek to maximize

(8.46)

over

◊

while keeping

„

(t)

ﬁxed.

Now since the ﬁrst term in

(8.46)

does not depend on

◊

, this maximization is equivalent to

gradient descent steps (minimization) for,

min

◊

œZ

log

1 ≠ f

„

(t)

◊

)

. (8.49)

Such iterative gradient based learning, ideally approximates a solution of

(8.48)

. In practice

one often tunes the number of mini-batch disc riminator steps and the number of

samples

for the generator steps, carried out per iteration

. One problem that sometimes arise s

during training is called mode collapse and it is a situation where the generator is stuck at a

point in the parameter space of

◊

where it generates samples that do not cover the breadth

of the data

. In general, when carrying out diagnostics of GAN training, we expect the

performance of the discriminator to converge to about 1/2.

318

8.3 Generative Adversarial Networks

Minimization of the Jensen-Shannon Divergence

The training procedure outlined above can be cast in a theoretical setting. Motivated by

the discriminator loss

(8.46)

we can describe a theoretical construct which abstracts what

a GAN training procedure actually optimizes. Speciﬁcally, we now shift to probabilistic

thinking, similar to the notions of Section 8.1 in the context of variational autoencoders and

Section 8.2 in the context of diﬀusion models.

For this let us deﬁne three probability distributions used in the analysis. First, the distribution

of the data

is assumed to be captured by

(

). Let us assume this is a probability

distribution covering all of

. Then the distribution of each of the latent variables

is assumed be captured by

Zú

(

). Let us assume that this a probability distribution over

which typically is multivariate standard Gaussian and hence covers the whole latent

space. Finally, the distribution of the output of the generator is captured by

(

). This last

distribution can in theory be obtained by applying the generator

◊

(

) as a transformation

on the random variables in

. Like the distribution of the data, this is a distribution over

and we assume it also covers the whole space.

Ideally we would like a GAN to learn the generator parameters

◊

such that

,i.e.,

that the distribution of the generator is the same as the distribution of the data. To capture

such a goal, it turns out that in the context of GANs, a good measure of the distance between

the two distributions is the Jenson-Shannon divergence, deﬁned in

(B.7)

of Appendix B.

Speciﬁcally, let us denote,

◊

= JSD(p

Î p

)+D

Î p

)

, where p

(u)=

(u)+p

(u)

and

(

·Î·

) is the KL-divergence. Hence for generator parameters

◊

, the divergence value

◊

captures how far oﬀ our generator is from the actual data. In the ideal situation of

we have that

◊

=0, otherwise it is greater than 0. Note that upon representing

the KL-divergences as integrals and manipulating the 2 constant we have,

◊

⁄

(u) log

(u)

(u)+p

(u)

du + p

(u) log

(u)

(u)+p

(u)

du + log 2. (8.50)

Consider now the discriminator loss

(8.46)

and assume that

n æŒ

. In this case, due to

laws of large numbers, the loss can be rephrased in terms of expectations and expressed as,

C(„, ◊ ; p

)=≠E

log

„

(X)

≠ E

log

1 ≠ f

„

◊

(Z)

, (8.51)

where the ﬁrst expectation is of the random variable X distributed according to p

(·) and

the second expectation is with respect to the latent random variable

distributed according

to p

(·). This expected loss can further be manipulated as,

C(„, ◊ ; p

)=≠E

log

„

(X)

≠ E

log

1 ≠ f

„

(W )

= ≠

⁄

(u) log

„

(u)

du ≠

⁄

(u) log

1 ≠ f

„

(u)

= ≠

⁄

(u) log

„

(u)

du + p

(u) log

1 ≠ f

„

(u)

du,

where in the ﬁrst equation we set

◊

(

) which is a random variable distributed

according to

(

). Hence the second expectation in the ﬁrst equation is with respect to the

319

8 Specialized Architectures and Paradigms

distribution

(

). Moving from the ﬁrst equation to the second step we simply represent

the expectations as integrals and then since both integrations are on the same domain we

can move from the second step to the third step.

Now let us identify a pattern inside the ﬁnal integral of the form

a log

(

b log

≠y

),where

(

), and

„

(

). It is easy to check that for

a, b >

0, the function

g(y)=a log(y)+b log(1 ≠ y) is maximized at y = a/(a + b). Hence, for any y,

a log(y)+b log(1 ≠ y) Æ a log(

a + b

)+b log(

a + b

) (8.52)

with the inequality being an equality at y = a/(a + b).

We can now use

(8.52)

to bound the integrand in the ﬁnal expression for

(

„, ◊

;

)

as follows,

≠

⁄

(u) log

(u)

(u)+p

(u)

du + p

(u) log

(u)

(u)+p

(u)

du Æ C(„, ◊ ; p

Or, using the expression for J

◊

in (8.50)wehave

2 log 2 ≠ 2J

◊

Æ C(„, ◊ ; p

), (8.53)

where at the best theoretical generator parameters,

◊

, we have that

◊

=0and the inequality

is an equality with C = 2 log 2 ¥ 1.386.

Let us note some parallels between this Jensen-Shannon based analysis for GANS and the

ELBO

analysis of Section 8.1 for variational autoencoders. In the variational autoencoder

case our inherent implicit goal is maximum likelihood estimation of

◊

and since we cannot

achieve such estimation in a computationally eﬃcient manner, we resort to (approximate)

optimization of the lower bound,

ELBO

. Nevertheless, a theoretical inequality such as

(8.18)

and the additional optimization over the encoder’s

„

(8.20)

ensures that once we have

optimized

ELBO

both in terms of

◊

and

„

, then the desired maximum likelihood e stimation

is also achieved.

In our GAN case, the inequality

(8.53)

helps justify a similar idea. If we apply a minimax

style optimization on it, as in (8.48) then we obtain,

2 log 2 = max

◊

min

„

C(„, ◊ ; p

)

which is the best possible value.

Variations to the Objective Function

While the ideas of GANs presented up to now are useful in their own right, there is much

room for architectural improvements. On the one hand one may consider diﬀerent structures

for

„

(

) and

◊

(

), such as for example setting the discriminator and generator neural

networks to be convolutional networks. On the other hand, one may revise the objective

function

(8.46)

as well as its expected value formulation

(8.51)

, to yield b etter training

performance. We now focus on this approach, where variations of the objective function are

considered.

320

8.3 Generative Adversarial Networks

(·)



(t)

(·)

Figure 8.7:

A schematic of a potential scenario during early phases of GAN training. The data

distribution

(

) and the generator distribution

(

) are far while the discriminator

„

(t)

(

) is

already capable of separating the two distributions well. This reduces the “signal” for generator

training.

In general, problems may occur in initial training iterations when generator parameter values,

◊

(t)

, are nearly arbitrary. In such a case there is often extreme separation between

(

) and

(

), because the latter distribution is nearly arbitrary. Such a diﬃcult situation requires a

“signal” from the discriminator that will force learning to improve subsequent

◊

(t)

values.

However, an objective such as

(8.46)

, does not always cater for such learning. In particular,

see Figure 8.7, where for simplicity we assume the data is one dimensional and we plot

data points together with

(

) as well as the generator distribution

(

) associated with

some arbitrary

◊

(1)

. In such a case, after only a few iterations, the discriminator parameters

„

(t)

may quickly be trained to separate between the two distributions. The problem with

this is that at that point, when considering gradients for minimization of the generator, as

(8.49)

, there will often be a “lack of signal” (or nearly zero gradients) for learning the

generator parameters

◊

(t)

. This situation is quite c ommon in practice, and often prohibits

eﬀective learning of GANs.

As a consequence of such scenarios, multiple alternative objective formulations have been

prop os ed, some of which yield much better training performance. We now outline a few

of these ideas where we ﬁrst describe a simple modiﬁcation called non-saturating GAN

(NS-GAN). We then describe a deeper idea using Wasserstein distances which yields a

framework called Wasserstein GAN (W-GAN), and ﬁnally we discuss improvements to

W-GAN, which involve regularization concepts.

The ﬁrst idea of non-saturating GAN (NS-GAN) is simply to modify the generator objective

(8.49)touse≠log(u) instead of log(1 ≠ u). Namely, the NS-GAN generator objective is,

min

◊

≠

œZ

log

„

(t)

◊

)

, (8.54)

and NS-GAN retains the same discriminator objective

(8.46)

. The motivation for such

a modiﬁcation is simply the fact that at

u ¥

0, the derivative of

log

≠ u

) (as in the

original formulation) is approximately

≠

1, while the derivative of derivative of

≠log

(

) (as

in NS-GAN) is nearly negative inﬁnite. This may yield an improvement for initial phases of

training and is particularly useful for cases when

(

) is highly separated from

(

) and

the discriminator parameters already yield good separation as in Figure 8.7. In this case,

„

(t)

◊

(t)

(

)

0 and thus the NS-GAN generator objective

(8.54)

yields much higher

magnitude gradients than the original generator objective

(8.49)

, due to the nearly negative

inﬁnite derivative.

For both the original GAN and NS-GAN, as training progresses and the expected value of

„

(t)

◊

(

)

approaches 1

2, the objectives

(8.54)

and

(8.49)

behave similarly. Hence in

principle, the NS-GAN modiﬁcation should not hamper with training. However, while the

321

8 Specialized Architectures and Paradigms

simple idea of NS-GAN may appear like a good improvement, in practice it has shown to

yield unstable gradient updates. We chose to present NS-GAN here, merely as simple idea

where modiﬁcation of the objective can potentially yield training performance improvement.

A more fundamental way to improve GANs is based on the Wasserstein distance.Thisis

a concept used to determine the “distance” between two probability distributions, with a

similar goal of JS-divergence, yet with a completely diﬀerent approach that yields diﬀerent

properties of the distance metric.

Formally when presented with two probability distributions, in our case,

(

), and

(

the Wasserstein distance, W(·, ·), may be represented as

W (p

)= inf



œ(p

)

(x,˜x)≥p



Îx ≠ ˜xÎ. (8.55)

Let us unpack

(8.55)

by assuming that our data and generator are each in

. Here 

)

is the space of all probability distributions over

where the ﬁrst

coordinates are for the

data and the following

coordinates are for the generator. This space of distributions is

deﬁned such that each element distribution





)

has a marginal distribution of

the ﬁrst

coordinates of



that agrees with

, and likewise the marginal distribution of the

remaining

coordinates of



agrees with

. That is, for every distribution





)

,...,x

···



,...,x

,...,u

) du

···du

(˜x

,...,˜x

···



,...,u

, ˜x

,...,˜x

) du

···du

(8.56)

The inﬁmum in

(8.55)

acts “like a minimum” and formally seeks the greatest lower bound.

This minimization is over the expectation of the Euclidean norm

Îx ≠ ˜xÎ

with

and

˜x

each

elements of

. Thus the pair (

x, ˜x

)

œ R

is distributed according to



and henc e

(marginally) distributed according to

and

˜x

is (marginally) distributed according to

For clarity, note that the expectation in (8.55) can be represented as,

EÎx ≠ ˜xÎ =

⁄

······

⁄

i=1

≠ ˜x

)



,...,x

, ˜x

,...,˜x

) dx

···dx

d˜x

···d˜x

(8.57)

As an extreme example, consider the case where

. Then one particular





)

is a distribution that only has support on elements

,...,u

p+1

,...,u

where

(

,...,u

)=(

p+1

,...,u

), and the density over each s uch element is

(

,...,u

) which

is the same as

(

p+1

,...,u

). In this case, using



(8.57)

we get 0 because probability

mass (or density) is only concentrated on points (

x, ˜x

)

œ R

where

˜x

. Hence this



minimizes

(8.57)

, and hence the Wass erste in distance is 0. However, when

”

, one can

show that the Wasserstein distance is strictly positive.

In general, the joint distribution



captures an “earth moving” plan which describes how

to shift mass from one distribution

to the other

. To get a feel for this interpretation,

let us grossly simplify the situation and assume that

and

are discrete one dimensional

(

=1) distributions on a ﬁnite domain with values, such as 1, 2, and 3. In this case, the

joint distribution



is on the 3

◊

3 grid, capturing the probability mass over the 9 possible

joint locations. In this simplistic case, the marginal distribution assumptions

(8.56)

can be

322

8.3 Generative Adversarial Networks

written as,

(x)=

˜x=1



(x, ˜x) for x =1, 2, 3,

(˜x)=

x=1



(x, ˜x) for ˜x =1, 2, 3,

(8.58)

where p



(·, ·), p

(·), and p

(·) are probability masses.

Now we may interpret



(

x, ˜x

) as a “plan” of how much mass (or earth) to shift from location

to location

˜x

. For this, the marginal distribution assumptions

(8.58)

serve as constraints

where the ﬁrst constraints imply that mass is shifted properly out of

and the second

constraints imply that all resulting mass ends up creating

. The Wasserstein distance is

then the expectation of

Îx ≠ ˜xÎ

for the optimal plan, or optimal joint distribution. Note

that in such a discrete ﬁnite case, one may actually formulate and solve this optimization

problem using linear programming. While such an approach is not important in deep learning

practice, ideas of duality theory from linear programming play a role as we describe below.

One important attribute of the Wasserstein distance is that it is sensitive to diﬀerences

in the “location” of the distributions and captures such diﬀerences in a much better way

than the JS-divergence (based on the KL-divergence). As an illustration let us focus on

a case with

=1where we can easily plot

and

as marginals. Consider Figure 8.8

where we plot a red pair (

) with the distributions concentrated at quite far away

locations, and a green pair (

) with the distributions closer together. We cannot exactly

see the Wasserstein distance because it requires minimization over all possible 

)

distributions. However, in this example any possible





)

is concentrated around

the respective red or green cloud. Hence we roughly see the Wasserstein distance as marked

in the ﬁgure, representing the diﬀerence between the center of the cloud and the diagonal

˜x

line. Importantly, the distance for the red pair is much higher than that of the gree n

pair. Now in contrast, if we were to consider the JS-divergence, then in both the red and

the green pair cases, the divergence would be very close

to 0.

An important result for the Wasserstein distance, called the Kantorovich-Rubinstein duality

theorem, provides us with a dual representation of the minimization problem in

(8.55)

.This

result implies the following representation of W (p

), for any K>0,

W (p

sup

ÎhÎ

ÆK

x≥p

h(x) ≠ E

˜x≥p

h(˜x)

. (8.59)

Let us unpack

(8.59)

.Thesupremumin

(8.59)

acts “like a maximum” and formally seeks the

lowest upp e r bound. Continuing to assume that

and

are distributions over

,this

maximization searches over functions

æ R

that also satisfy the

-Lipschitz property,

denoted by ÎhÎ

Æ K. A function h(·) is K-Lipschitz if

Îh(u) ≠ h(v)ÎÆKÎu ≠ vÎ, for all u, v œ R

This implies that the function does not ascend or descend in any direction with steepness

greater than

. Note that in stating the Kantorovich-Rubinstein duality theorem using

K =1is common, yet for our GAN purposes keeping K free is preferable.

The elegance and applicability of

(8.59)

is that instead of searching over all possible joint

probability distributions (earth moving plans) as in

(8.55)

, we now only need to search over

In fact, one needs to assume that the support of the

and

distributions overlaps for the JS-

divergence to be properly deﬁned.

323

8 Specialized Architectures and Paradigms

W (p

)

W (p

)

Figure 8.8:

Two pairs of distributions where in the red pair, the distributions are farther away

than in the green pair. While we cannot ex actly see the Wasserstein distance in this case, we get an

approximate feeling for it using the distances marked with vertical lines between the centers of the

point clouds and the diagonal line.

all

-Lipschitz functions. It turns out, that in the context of generative adversarial networks,

the latter formulation is much easier to implement. We omit further details of the duality

derivation between (8.55) and (8.59), and now show how to use (8.59) in a GAN setting.

Since

is determined by the generator

◊

(

), with a Wasserstein distance approach our

overall goal in a generative setting is to ﬁnd generator parameters

◊

that minimize

(

Replacing the supremum in

(8.59)

by a maximum, and dropping the 1

constant term, we

have and overall objective,

min

◊

max

ÎhÎ

ÆK

x≥p

h(x) ≠ E

˜x≥p

h(˜x)

¸ ˚˙ ˝

Depends on ◊

. (8.60)

The key idea of W-GAN is now to treat the function

(

) as a discriminator, even though it is

no longer a binary classiﬁer. Following the previous notation, we still denote this discriminator

„

(

) with parameters

„

, even though now the architecture of this discriminator allows it

to output any potential re al value, not just in [0

1].Withthis,

(8.60)

can be approximated

with the objective,

min

◊

max

„œ

xœD

„

(x) ≠

œZ

„

◊

)

, (8.61)

where now 

is some space of discriminator parameters that approximately enforces a

-Lipstchitz condition on

„

(

). That is, with every

„ œ



we have that

„

(

) is

324

8.3 Generative Adversarial Networks

Lipstchitz. Hence the discriminator objective, posed as a minimization problem constrained

on 

(and dropping the 1/n terms) is,

min

„œ

≠

xœD

„

(x)+

œZ

„

◊

)

. (8.62)

In training a W-GAN we iterate between discriminator and generator steps, and hence in the

discriminator steps we carry out mini-batch gradient descent steps for

(8.62)

. T he constraint

of keeping

„ œ



is discussed below. The outcome of such a discriminator iteration is

„

(t)

and then in carrying out generator steps for (8.61) we carry out iterations for

min

◊

≠

œZ

„

(t)

◊

)

. (8.63)

Compare the W-GAN discriminator problem

(8.62)

with the original GAN discriminator

problem

(8.47)

, and similarly compare the W-GAN generator problem

(8.63)

with the original

GAN generator problem

(8.49)

(similarly we may compare to NS-GAN with generator

problem as in

(8.54)

). Quite surprisingly, if we ignore the 

constraint, the diﬀerences are

only very minor where the W-GAN does not use logarithms. In practice, W-GAN generally

overcomes the problems illustrated in Figure 8.7 by enhancing the algorithm with a much

better discrimination signal.

We still need to specify how to enforce the 

-Lipstchitz constraint in

(8.62)

. For this

there are multiple techniques. The most basic technique is called weight clipping where we

constrain the weights for the neural network

„

(

) to lie in some range, e.g., [

≠

05].

For almost all standard deep learning architectures, this will mean that

„

(

) will be

Lipstchitz for some

that depends on the architecture of the network and on the clipping

half-range 0.05.

Adiﬀerent approach which in practice has generally shown better performance is gradient

penalty. This approach relies on the following property of

(8.59)

. If a function

æ R

attains the supremum in

(8.59)

, then this function has with probability one, a unit gradient

norm on any random point

that is a convex combination between a point drawn from

and a point drawn from

. That is, if we take

as a random point from

and

as a random point from

(for some generator), then for any

÷ œ

1], the point

= ÷x +(1≠ ÷)x

satisﬁes,

ÎÒh

)Î =1, with probability 1. (8.64)

We omit the proof, yet we use this property algorithmically to approximately enforce the

constraint

„ œ



. To do so, the gradient penalty approach uses

(8.64)

as an optimization

constraint in place of

„ œ



. This constraint is then integrated in the objective using an

approximate Lagrange multiplier approach.

Speciﬁcally for some tunable ⁄ > 0, we modify the discriminator objective (8.62)to

min

„

≠

xœD

„

(x)+

œZ

„

◊

)

+ ⁄

œD

ÎÒf

„

)Î≠1

, (8.65)

where the set of random points

is a set of points based on pairs

x œ D

and

œ Z

where each

has as single matching

, and for each pair we generate a uniformly random

÷ œ [0, 1] and set x

= ÷x +(1≠ ÷)f

◊

325

8 Specialized Architectures and Paradigms

In practice the gradient penalty approach works very well and out of all of the GAN objective

variations that we presented, it is the most common approach used. In a typical algorithm

that implements mini-batch gradient descent steps for (8.65), we would sample a subset of

the data

and a matching sized sample of

followed by a s ample of

with each

element determined by a uniform ÷.

Beyond Dat a Generation with GANs

Up to now we considered the application of generative adversarial networks for generative

modelling where an unlabelled dataset

(1)

,...,x

(n)

}

is used to train a generator

◊

(

) and then this generator can create samples similar to those in

. This is already useful

for a variety of applications, one of which is data augmentation, where we can then use

additional generated (fake) samples to train other models. Yet, there are many more tasks

where generative adversarial networks can be employed. We now outline a few paradigms

that extend the basic GAN architecture and allow us to handle additional tasks. We outline

the conditional generation paradigm,theimage to image paradigm, and the style transfer

paradigm.

With the conditional generation paradigm the resulting generator from training is not just

a function of the latent space, but is also a function of additional variables. Namely we

can describe the generator as

◊

(

z,w

) where

is still a random latent variable and the

newly introduced

contains some additional information. One such example of conditional

generation, called a conditional generative adversarial network (C-GAN) is where the data is

labeled and is of the form

{

(

(1)

)

...,

(

(n)

)

}

. Here we can use the additional

information

as some attribute vector that potentially encodes the particular label. Hence

for example in a case of the MNIST digits dataset, our GAN

◊

(

z,w

) will create random

digits, where

may indicate which digit to create using one-hot encoding. The basic way in

which such a GAN is trained, is by also supplying the attribute vector to the discriminator

during training. We omit further details.

A slightly diﬀerent form of conditional generation is also described in the auxiliary classiﬁer

generative adversarial network (AC-GAN) architecture. Here the generator is similar to

the C-GAN case where

is sp ec iﬁcally a class of the desired sample to generate, yet in

training, the discriminator has two outputs, one of which is the binary classiﬁcation fake/real

as before, and another is a probability vector determining which class is detected. Such a

training process, often results in a better architecture for conditional generation than the

original C-GAN.

Ideas of conditional generation can be extended in multiple ways, where one notable way is

the interpretable representation learning generative adversarial network (Info-GAN) approach.

Here we are back to unlabeled data such as

(1)

,...,x

(n)

}

, and we use the GAN

training pro ce ss to separate some elements of

“out of

”, where typically

is of smaller

dimension than

. The resulting generator,

◊

(

z,w

),thenuses

for tweaking speciﬁc

attributes of the image, where these attributes are not controlled apriori, but are rather

discovered during the learning process. As a concrete example, again in the case of MNIST

digits, we may have

w œ R

where after the training process it turns out that the ﬁrst

coordinate of

controls the stroke widths of the digits, the second coordinate controls the

slant of the digit, and other two coordinates of

either have some interpretation or not.

The beauty of Info-GAN is in the self discovery of these properties. The general idea of

training in such a framework is that the discriminator outputs additional variables that are

326

8.3 Generative Adversarial Networks

meant to match the inputs

. The analysis involves mutual information with basic ideas

from information theory, and is beyond our scope.

With the image to image paradigm, or more generally dat a to data paradigm our goal is

not to generate random general images or data examples, but rather to be able to enhance

images or data. Assume a training dataset such as

{

(

(1)

out

)

,...,

(

(n)

out

)

}

,where

each

(i)

has lower information content than the corresponding

(i)

out

. For example

(i)

may

be a black and white image of a portion a stree t map, while

(i)

out

may be a c olor satellite

image of the same geographic location.

The goal of an image to image generator, say

is to operate on input images similar to

(i)

in structure, and create the corresponding

enhanced output image. Then when presented with a new input image

˜x

, we would have

that the trained generator applied to it,

◊

(

˜x

), is the enhanced image. Hence for example

in the geographic case, we can generate fake satellite images, based on street map images.

Other applications include the color enhancement of images or ﬁlms, and almost any domain

where datasets such as D = {(x

(1)

out

),...,(x

(n)

out

)} appear.

The basic training architecture in the image to image paradigm is based on a discriminator

that distinguishes between real pairs (

(i)

out

), and fake pairs (

(i)

ú(i)

out

),whereinthe

ﬁrst pair,

(i)

out

matches

(i)

from the data, while the second pair,

ú(i)

out

is generated. That is,

the discriminator, is the function

„

(

˜x

, ˜x

out

) that tries to determine if the pair it is fed

with is completely real or if the second image fed to it is generated. In this sense, the image

to image paradigm is similar to the C-GAN paradigm. Key diﬀerences include the fact, that

in the image to image case, one would often want a complicated generator network such as

a U-Net.

In deep learning, style transfer, sometimes called neural style transfer, is an area broader

than generative adversarial networks, mostly focusing on image data. The general theme of

style transfer is to modify input images such that they appear to resemble a certain style.

Examples that have been popularized include the recreation of arbitrary image s using the

distinctive drawing style of Vincent van Gogh. Within the context of generative adversarial

networks, the style transfer paradigm uses GANs for style transfer and for similar image

modiﬁcations.

A notable GAN architecture speciﬁcally focused on style transfer is the style-GAN. Here,

multiple new ideas are introduced in the context of the generator network, sp ec iﬁcally for

image data using convolutional architectures. One of the notable features of StyleGAN is its

ability to generate high-resolution images with remarkable detail and diversity. Style-GAN

has been widely used in applications such as image synthesis, artistic expression, and creating

realistic faces with customizable attributes.

Several of the key ideas of Style-GAN are speciﬁc to convolutional architectures and include

upsampling, as well as adaptive instance normalization which is somewhat similar to batch

normalization of Section 5.6 and layer normalization, used in Section 7.5. We do not focus

on these ideas here but rather comment on the general structure of the generator which

diﬀers from generators used in other paradigms. The Style-GAN generator is com posed of

two distinct functions, a mapping network and a synthesis network, which we denote as

◊

) and f

◊

,w) respectively.

For this speciﬁc example and other examples of image to image generation, it is not hard to create such

training data.

See the notes and references in Chapter 6 for background on the U-Net architecture.

327

8 Specialized Architectures and Paradigms

The mapping network’s role is to convert a latent space noise vector

into a more meaningful

representation

◊

(

) which is an intermediate variable that is later amenable to

manipulation. The synthesis network’s role is somewhat similar to the Info-GAN generator

where

serves as noise, while

captures style information. One way to use a trained

generator without any style modiﬁcations is via,

= f

◊

)

¸ ˚˙ ˝

where

and

are noise ve ctors, and

is the resulting random output image. However,

in more advanced applications we may keep one latent noise vector ﬁxed, while perturbing

the other noise vector, and/or varying the intermediate variable

. Variations of this form

enable Style-GAN to create meaningful modiﬁcations of images.

8.4 Reinforcement Learning

The topic of reinforcement learning deals with controlling dynamic processes over time. This

is diﬀerent than most tasks presented in the book, which on their own are typically applied

at one instant of time. For example, classiﬁcation is based on a static input

at some time,

to determine an output

ˆy

relevant for that time. Yet the classiﬁcation task does not care

about previous or future classiﬁcation decisions or the time at which it is carried out. In

contrast, with reinforcement learning, we have the task of making decisions as time evolves,

where our decisions often aﬀect the evolution of the system and thus our decisions need to

take planning ahead into consideration.

Typical applications of reinforcement learning include decisions for automatic pilots, robot

control, playing strategy games, ﬁnancial management, and other applications that involve

decisions over a time horizon. Indeed, in the second decade of this century, the deep

learning variant of reinforcement learning, namely deep reinforcement learning, made big

headlines with the game of Go, where the world’s bes t Go players were eventually beaten

by an engineered deep reinforcement learning platform. Earlier, it was shown that deep

reinforcement learning systems can be trained to automatically play classic arcade Atari

video games.

In general, the ﬁeld of control theory is the engineering ﬁeld where automatic control systems

are designed and analyzed. This ﬁeld has a rich history, including many advances made

during the space programs of the 1950’s and the 1960’s, including concepts such as Kalman

ﬁltering and many related ideas. The area of reinforcement learning can be cast as part

of the control theory world, yet it has much more of a computer science ﬂavour to it, and

with the advent of deep learning, has essentially become part of the deep learning toolkit.

Nevertheless, today, ideas of reinforcement learning are part of the control theory world.

A schematic representation of reinforcement learning is in Figure 8.9. The basic setup is

that a system, or environment, is controlled by an agent. Other processes may b e taking

place in the environment as well, perhaps not directly controlled by the agent, these are

modelled via randomness. As time progresses the agent sends their control decisions in the

form of actions to the environment, and in turn it receives reward from the environment as

well as observations.

328

8.4 Reinforcement Learning

Agent

Environment

Action

Reward

Observation

Figure 8.9:

The setup in reinforcement learning is that an agent (or controller) applies actions to

an environment (also known as system). The agent’s decisions of which actions to take, are based

on reward obtained together with observations. One illustrative example of the environment is a

video game where multiple random things happen and the agent is a player.

Mathematically, we cast reinforcement learning in terms of an area called Markov decision

processes.

Here, as we explain in this sec tion, the evolution of the environment being

controlled is modeled as a variant of a Markov chain, where the state evolution of the

chain also depends on control decisions made. A key goal in the area of Markov decision

processes is the study and application of algorithms for controlling the system in some

optimal manner. Reinforcement learning, aims to solve the same problem, with the diﬀerence

that in the Markov decision processes case we assume to have full knowledge of the system

evolution model, whereas in the reinforcement learning case, the system model is unknown

and hence needs to either b e learned, or optimal control strategies need to be learned. With

reinforcement learning, the agent often learn incrementally while controlling actual systems.

In some cases learning is based on simulations of the system before deployment into the real

world, while in other cases, operational systems are controlled and learned simultaneously.

Our focus of this section is to ﬁrst present the Markov decision processes framework and to

further discuss optimal strategies using Bellman equations. Then, after brieﬂy discussing

solution methods for such equations we move onto the reinforcement learning case, where

the system model is unknown. There are multiple types of algorithms for reinforcement

learning, and we focus only on the Q-learning approach. After understanding the basic

Q-learning algorithm, we ﬁnally see how concepts of deep ne ural networks can be integrated

by replacing the so-c alled Q-function with a deep neural network. This presentation here

is merely an introduction to the ﬁeld, and more readings are suggested at the end of the

chapter.

Markov Decision Processes

Assume some system evolving over discrete time

,...

, where at any time

,the

system state is

. This state may describe the location of a robot, or the vector of velocities

In certain cases the ob servations are only a part ial descr iption of the environment state, in which case

the formal framework is that of partial ly observable Markov decision processes.Yetinourbriefexposition,

we only consider full state observ ations.

329

8 Specialized Architectures and Paradigms

in multiple directions together with accelerations, or a discrete state of a game, or one of

many other possibilities depending on the application.

Let us ﬁrst ignore decisions and system control. In this case we can model the evolution of

,...

as a Markov chain, where some transition kernel

(

t+1

) determines the

probability of moving from a given state

to state

t+1

,betweentime

and time

+1.In

this form, the transitions appear to depend on the time

, yet we may also assume that the

evolution is time homogenous and does not depend on time

. In this case, common to this

section, the transition kernel can be presented with the notation

(

t+1

), i.e., without

a subscript

. As already stated in Section 8.2 in the context of Markovian hierarchical

variational auto e ncoders, a Markov chain satisﬁes the Markov property. In our case here,

this property is best interpreted as implying that given any history prior to time

, if we are

given

, that history does not have an aﬀect on the future. This then means that the state

information at time t, namely z

, is enough to determine probabilities of the future.

As a simple example, consider a scenario focusing on the engagement level of a student.

Assume that there are 10 levels of engagement, 1

,...,

10, where at level 1 the student is

not engaged at all and at the other extreme, at level 10, the student is maximally engaged.

One may model the probability transitions of engagement as,

p(j |1) =

]

[

0.6 j =1,

0.4 j =2,

0 otherwise,

p(j |i)=

]

[

0.6 j = i ≠ 1,

0.2 j = i,

0.2 j = i +1,

0 otherwise,

for i =2,...,9,

p(j |10) =

]

[

0.6 j =9,

0.4 j = 10,

0 otherwise.

(8.66)

In this case, it is evident that at any time step, the student’s engagement can either improve

by one level, decrease by one level, or stay unchanged, with the probabilities speciﬁed by

(

·|·

) as above. If at onset at time 0 we start at som e given level

œ {

,...,

}

, as time

progresses, the level will ﬂuctuate according to the Markov chain stochastic process. The

level of engagement is then the system or environment in this very simple case.

A Markov decision process is a generalization of a Markov chain where now some decision

(also known as or control) is used by the agent. In this simple student engagement example,

assume that at any time

we can choose to “stimulate” the student, for example by suggesting

a prize, or “not to stimulate”. Naturally, it is sensible to model the fact that stimulation will

improve transitions up towards higher levels. For example, one model for these transitions

330

8.4 Reinforcement Learning

under stimulation captures transition probabilities via p

(1)

t+1

), as,

(1)

(j |1) =

]

[

0.2 j =1,

0.8 j =2,

0 otherwise,

(1)

(j |i)=

]

[

0.1 j = i ≠ 1,

0.1 j = i,

0.8 j = i +1,

0 otherwise,

for i =2,...,9,

(1)

(j |10) =

]

[

0.2 j =9,

0.8 j = 10,

0 otherwise.

(8.67)

Compare now the original transition probabilities

(8.66)

with the revised transition proba-

bilities under “stimulation”

(8.67)

. With these particular values in the model that we chose,

transitions with “stimulation” generally have a higher probability of pushing engagement

level up. In a Markov decision process, in addition to the state evolution

,...

,we

also have an action chosen by the controller or agent for any time

. Speciﬁcally the sequence

,...

denotes the actions, where in this example we can set that

=0implies “not

to stimulate” at time t and that a

=1implies to “stimulate” at time t.

If we now deﬁne

(0)

(

j |i

) as the original probabilities

(8.66)

, then each of the probabilities,

(a)

(j |i), for a œ {0, 1},i,jœ {1,...,10}, (8.68)

constitutes the transition probability from

t+1

if action

is chosen at

time

. Hence

(8.68)

is a speciﬁcation of the e nvironment (or system) and of how the agent’s

actions aﬀect that system.

Now with the action sequence,

,...

, the state sequence

,...

does not evolve

autonomously, but rather depends on the decisions made

,...

. So if for example

the ﬁrst decision is

=0, the second is

=0, and the third is

=1, then give n some

initial

, the ﬁrst probabilities are

(0)

(

·|z

) which determine a random

; then given this

value, the second transition probabilities are

(0)

(

·|z

) which determine a random

; and

the third transition probabilities are

(1)

(

·|z

) which determine a random

, and so forth.

Having the sequence of decisions

,...

ﬁxed a-priori is called an open loop control

and in our context this is not common. Instead, we typically wish to determine the action

based on the state

. This is called closed loop control or feedback control, because the

control decision is based on the current state. A control policy, or just a policy

is a function

of the form,

policy

: {1,...,10}

¸ ˚˙ ˝

State space

æ {0, 1}

¸˚˙˝

Action space

where in more general examples the state space and action space may be much more complex.

Any policy function,

policy

(

), induces a Markov chain speciﬁc for that policy. This is because

with

policy

(

), the transitions

)

(

·|z

) are used for that Markov chain. Hence in

This is in fact called a deterministic Markovian policy.

331

8 Specialized Architectures and Paradigms

this example a policy can be seen as a rule for mixing (8.66) and (8.67). With this speciﬁc

example there are only 2

024 possible policies, yet with more complex state spaces or

action spaces, the number of possible policies can be huge.

The activity of ﬁnding the b es t feedback control policy for a Markov decision process is the

act of solving a Markov decision process. To do so, our model requires a reward function to

be speciﬁed. This reward function applies to every time step

, and captures the beneﬁt of

being in a given state, and choosing a given action at that time. The reward function

can

be viewed as part of the Markov decision process model speciﬁcation, and is of the form,

r : {1,...,10}

¸ ˚˙ ˝

State space

◊ {0, 1}

¸˚˙˝

Action space

æ R.

Hence for example,

0) is the reward obtained at a time where the state is

=4and

the chosen action is a

=0(no stimulation).

For our speciﬁc example, let us assume a reward function,

r(z,a)=z ≠ 1.5a, (8.69)

where higher student engagement levels have higher reward with a linear increase via the

component, and further, we pay a price of 1

5 reward units every time we set

=1. Hence

for example

0) = 4 and

1) = 2

5. From a modelling perspective, this latter reward

being lower can be viewed as due to the cost of the prize for stimulation. Observe that the

reward is always from the viewpoint of the agent controlling the system (the student in this

particular example is the environment).

We now want to ﬁnd a policy

policy

(

) that is best in terms of this reward over the whole time

horizon. There are multiple ways to accumulate reward and in our exposition we consider

only the inﬁnite horizon expected discounted reward objective. He re, we ﬁrst set

“ œ

which a ﬁxed hyper-parameter called a discount factor. Then the contribution at time

taken to be,

“

r(z

policy

)

¸ ˚˙ ˝

These contributions are accumulated to form the inﬁnite horizon expected discounted reward

objective, which depends on the initial state z

= z and is,

policy

(z)=E

t=0

“

r(z

policy

)

¸ ˚˙ ˝

)

= z

. (8.70)

The role of the discount factor,

“

, is to capture the importance of near present times vs. far

future times. With

“

low (near 0), far future times

have little eﬀect on this contribution.

Similarly with

“

high (near 1), these far future times have much more of an aﬀect on the

contribution.

Observe that this objective function,

(8.70)

, is parameterized by the policy,

policy

(

), because

the policy determines how actions

are chosen, given values of

. Our goal is to ﬁnd an

optimal policy that maximizes

(8.70)

, yet it may appear that there are multiple objectives

Here we focus on the tim e-hom ogenous reward function which does not depend on the current time.

332

8.4 Reinforcement Learning

here because

(8.70)

depends for every initial state

. Nevertheless, a property of the

type of Markov decision processes that we are using is that there exists a policy that can

maximize V

policy

(z) for all initial states z

= z. Hence we seek,

policy

= argmax

policy

(z), for all initial states z. (8.71)

In our s imple student engagement example, one may even ﬁnd such a policy by enumerating

all possible policies and for each policy evaluating the expectation in

(8.70)

either via

Monte-Carlo simulation or via analytic properties of Markov chains. For example, if the

discount factor is at “ =0.6 then an optimal policy turns out to be,

policy

(z)=

]

[

0 z œ {1, 2},

1 z œ {3, 4, 5, 6, 7},

0 z œ {8, 9, 10}.

(8.72)

It is not obvious a-priori why this is the best policy, but considering it we see that for

low and high engagement levels (1, 2, 8, 9, and 10) it is not worth to pay the price for

stimulation, whereas otherwise for intermediate engagement levels (3

6, and 7), it is

worth stimulating the student. The strength of Markov decision processes is that they expose

such policies, where the optimal control for any time

and current state

implicitly takes

the future evolution into account via (8.70).

While such a Markov decision processes framework is in principle very powerful, we are

faced with two problems. The ﬁrst problem is to have better means than exhaustive search

for solving

(8.71)

to ﬁnd optimal policies, and we discuss such means shortly. The second

problem is the fact that models can seldom be speciﬁed as we did here, with probabilities that

correctly capture reality. That is, realistic scenarios would involve very complex transition

kernels

(a)

(

j |i

) in contrast to the simplistic speciﬁcation of probabilities in

(8.66)

and

(8.67). This second problem is handled by reinforcement learning methods.

Bellman Equations, the Val ue Function, and the Q-function

In characterizing the solution of

(8.71)

an important object is the value function. This real

valued function, denoted as V

(z),wherez is any element of the state space, is deﬁned as,

(z) = max

policy

(z), (8.73)

where

policy

(

) is from

(8.70)

. Hence the value function determines the optimal inﬁnite

horizon expected discounted reward, when starting at a state z

= z.

A result in the study of Markov decisions processes is that we can characterize the value

function via a non-linear system of equations called the Bellman equations. For the case of

ﬁnite state and action spaces, these equations are,

(z) = max

)

r(z,a)+“

(a)

|z) V

)

¸ ˚˙ ˝

(z,a)

, for all states z. (8.74)

Here the maximum is over all actions

(e.g.,

{

}

in the student engagement example).

Further, inside the maximum on the right hand side, we have the Q-function,

(

z,a

),which

333

8 Specialized Architectures and Paradigms

captures the “quality” of choosing action

on state

. Note that the summation in

(8.74)

taken over all states z

(e.g., {1,...,10} in the student engagement example).

One can informally derive the Bellman equations (8.74) via what is known as the dynamic

programming principle where the ﬁrst term of

(

z,a

) is the immediate reward,

(

z,a

) and

the second part is the expected next step reward, discounted by one time step and hence

multiplied by

“

. Theoretically it can be shown that any function

(

) that satisﬁes the

Bellman equations is the value function as in (8.73).

With more complex state spaces that are not necessarily discrete, we can rewrite the

Q-function as,

(z,a)=r(z,a)+“ E

t+1

) |z

= z, a

= a

, (8.75)

while keeping in mind that the time-homogenous assumptions imply that

(

z,a

) is the

same for every time

and hence we can use

, and

in place of

, and

t+1

respectively. With the formulation of the Q-function as in

(8.75)

, we have a more general

expression for the Bellman equation

(8.74)

,where

(a)

(

)

(

) is replaced by

[

(

)

z,a

]. This formulation encompasses states spaces that are not discrete.

Importantly, knowing either the Q-function or the value function presents us with an optimal

policy. If we know the Q-function,

(

·, ·

), then we can determine an optimal policy

policy

(

)

as in (8.71) via,

policy

(z) = argmax

(z,a). (8.76)

If we know the value function,

(

), then we can ﬁrst evaluate the Q-function via

(8.75)

where we compute the expectation using the explicit model (sometimes using Monte Carlo

simulation if needed) and then with this computed Q-function we can evaluate (8.76).

The classic study of Markov decision processes deals with existence and optimality of

solutions to the Bellman equations. It also deals with algorithms for solving these equations

to ﬁnd the value function and hence to ﬁnd an optimal policy via

(8.75)

and

(8.76)

.We

brieﬂy discuss such solution methods now.

Solving Bellman Equations

The two most common algorithms for solving Bellman equations are value iteration and

policy iteration. We focus on value iteration on discrete state spaces. The algorithm is based

on the recursion,

(t+1)

(z,a)=r(z, a)+“

(a)

|z)

max

(t)

)

, for all z and a. (8.77)

Value iteration starts with some initial or arbitrary guess

(0)

(

·, ·

) and then with each step

we apply

(8.77)

on all states

and all actions

to get from

(t)

(

·, ·

) to

(t+1)

(

·, ·

).The

algorithm terminates when some distance measure applied to

(t)

(

·, ·

) and

(t+1)

(

·, ·

) is

below a speciﬁed small threshold.

Note that one often writes the recursion in terms of

(

) and not in terms of Q-functions as we did

here. However, the two formulations are equivalent and our representation is preferable for understanding

Q-learning in the sequel.

334

8.4 Reinforcement Learning

In quite general settings, convergence of repeated applications of

(8.77)

to the optimal

Q-function is guaranteed and a proof of this relies on the fact that the right hand side of

(8.77)

is a contraction mapping. We omit the details. Yet with value iteration we do not

have an indication at what iteration the optimal policy has been discovered. Hence one may

need to apply (8.77) many times.

Note that policy iteration, the other algorithm that we mentioned above, remedies the

situation. On ﬁnite state and action spaces, policy iteration needs to execute for only a ﬁnite

number of steps before discovering an optimal policy. We do not discuss the policy iteration

algorithm further here be cause Q- learning is based on value iteration.

Back to the simple student engagement example presented earlier, each

(t)

(

·, ·

) can simply

be implemented as a table with 10

◊

2 = 20 entries since the state space is

{

,...,

}

and

the action space is

{

}

. One sometimes informally calls this a Q-table.Ifweweretouse

value iteration for ﬁnding the optimal policy, we would ﬁrst initialize

(0)

(

·, ·

) with some 20

arbitrary values. We would then apply

(8.77)

for each

z œ {

,...,

}

and

a œ {

}

.This

application would directly use the probabilities speciﬁed in

(8.66)

and

(8.67)

, and the reward

function speciﬁed in

(8.69)

. Applying such value iteration steps is thus straightforward to

execute recursively. After multiple steps, we can then determine the policy using

(8.76)

and

this is in fact how we obtained the example optimal policy

(8.72)

. However, more complex

examples are harder to implement and importantly in realistic scenarios we often do not

know the exact transition probabilities and reward function. Instead, we take the approach

of learning the Q-function, which we describe now.

Q-learning

The idea with Q-learning is to learn the Q-function

(8.75)

and obtain some estimate

(

z,a

)

for all states

and all actions

. Learning the Q-function does not mean learning, the

individual components,

(

·, ·

(·)

(

·|·

), and

(

). It rather means learning

(

·, ·

) as a whole.

One can then use this estimate in

(8.76)

in place of

(

z,a

) to obtain a policy that is

approximately optimal via,

ˆg

policy

(z) = argmax

Q(z,a). (8.78)

Before seeing how Q-learning learns

(

·, ·

), we mention that as a reinforcement learning

algorithm, one sometimes applies Q-learning in parallel to controlling a system, or controlling

a simulation of the system. This is diﬀerent from other learning paradigms in this book

where the learning and production activities are often separate. Such a mix of controlling

and learning involves ongoing es timates of the Q-function,

(t)

(

·, ·

) where at any time

use

(t)

(·, ·) in place of

Q(·, ·) for (8.78).

When carrying out such a mix of learning and controlling, we know that at time

, the latest

(t)

(

·, ·

) is only an estimate. Hence we also employ state exploration as part of the policy.

For this one typical approach known as the epsilon greedy approach, uses some pre-speciﬁed

decreasing sequence of probabilities,

, Á

,...

, and at any time

, we control the system

with a randomized p olicy,

ˆg

(t)

policy

(z)=

]

[

argmax

(t)

(z,a), with probability 1 ≠ Á

random action a, with probability Á

(8.79)

335

8 Specialized Architectures and Paradigms

With this epsilon greedy approach we know that at time

there is a chance of

of selecting

an arbitrary action that is most likely sub-optimal. Yet it allows us to potentially navigate

the system to parts of the state space that would otherwise remain unexplored.

Now let us consider the Q-learning algorithm. Here the idea is to update some part of

(t)

(

·, ·

) at any time step based on the following available information: (i) the previous

state

; (ii) the new state

t+1

; (iii) the previous action chosen

; (iv) the observed reward

denoted as

which is the reward after applying the action at time

. For this, Q-learning

relies on hyper-parameters,

–

, –

,...

, a pre-speciﬁed decreasing sequence of probabilities.

The recursion of Q-learning is,

(t+1)

(z,a)=

]

[

(1 ≠ –

)

(t)

(z,a)+–

+ “ max

(t)

t+1

)

¸ ˚˙ ˝

Single sample Bellman estimate

, if z

= z, a

= a

(t)

(z,a), otherwise.

(8.80)

Key to

(8.80)

is that we only update the Q-function estimate for one speciﬁc

pair at

any time step. Further observe that this update is a weighted average of the previous entry

(t)

(

z,a

) and a new component which we denote as the single sample Bellman estimate.This

term is directly motivated by the value iteration recursion

(8.77)

and we can observe that it

agrees with the right hand side of

(8.77)

if we were to ignore the probabilities

(a)

(

This general form of approximation is called a stochastic approximation and its theoretical

analysis allows one to prove certain convergence properties of Q-learning. We omit these

details.

From an operational point of view, we can now integrate the epsilon-greedy control of

(8.79)

with the Q-learning recursion of

(8.80)

to develop a learning scheme that both controls the

system and learns how to control it as time progresses. With such a scheme, calibration

of the hyper-parameter sequences

, Á

,...

and

–

, –

,...

is often delicate, and one

often has to experiment with various hyper-parameter settings in order to get use ful results.

A theoretical result for Q-learning is that under certain conditions we need the sequence

–

, –

,... to satisfy,

t=0

–

= Œ, and

t=0

–

< Œ. (8.81)

Hence the probabilities need to decay quickly enough, but not too quickly. So for example

probabilities such as

–

(

+1) suﬃce since with these the ﬁrst (harmonic) series diverges

while the second series converges.

While theoretical results based on the condition

(8.81)

help to place Q-learning on a rigorous

footing, from a practical perspective Q-learning on its own is diﬃcult to use eﬀectively.

Even in our simple student engagment example where we try to learn the 10

◊

2 Q-table,

Q-learning can be challenging. At any time step we only update a single entry of this table.

This is already slow, and is further slowed down due to the fact that if for non small times

–

is a low probability, then the averaging in

(8.80)

would keep the new entry

(t+1)

(

z,a

)

not far from the previous entry

(t)

(

z,a

). In more complex and realistic scenarios where the

state and action spaces are big and sometimes non-discrete, we cannot even tabulate the

Q-function in a naive Q-table. Hence approximate Q-learning needs to be carried out. This

brings us to deep reinforcement learning.

336

8.4 Reinforcement Learning

Deep Reinforcement Learning

The key idea with deep reinforcement learning is to approximate the Q-function,

(

z,a

with a neural network,

◊

(

z,a

). Such a setup allows us to deal with highly complex state

spaces and action spaces. The parameters of this network are learned during a deep Q-

learning algorithm where as the algorithm evolves, we have a sequence of learned parameters

◊

(1)

, ◊

(2)

,...

. With each such

◊

(t)

, we can still use an epsilon greedy policy similar to

(8.79)

where the control decision is taken as,

policy

(z,◊

(t)

]

[

argmax

◊

(t)

(z,a), with probability 1 ≠ Á

random action a, with probability Á

(8.82)

The key is now how to learn

◊

(

·, ·

) and for this there are many variants of the deep

Q-learning algorithm of which we only outline the simplest form. To create a loss function

that will help to learn the parameters

◊

, we use a reinforcement learning concept called

temporal diﬀerence learning. Here the loss at time t is given by,

(◊ ; z

t+1

+ “ max

◊

t+1

,a)

¸ ˚˙ ˝

Single sample Bellman estimate

≠ f

◊

)

, (8.83)

where the observed state is

, the system is controlled via the action

, a reward of

obtained, and the resulting next state is

t+1

. With such a temporal diﬀerence loss, if the

neural network parameters

◊

determine a good approximation for the actual Q-function,

then the loss would be generally low, whereas if

◊

does not describe the Q-function well,

then the loss is higher. To see this, revisit the value iteration equation (8.77).

Now with the neural network loss

(8.83)

deﬁned, we have a very basic deep Q-learning

algorithm. We control the system using the control policy of

(8.82)

and at any time

step

we apply a single gradient descent update for the parameters

◊

based on the loss

(

;

t+1

). The gradient descent update then modiﬁes the Q-function estimate

from

◊

(t)

(

·, ·

) to

◊

(t+1)

(

·, ·

). Note that like the Q-learning algorithm (without a neural

network approximation) described above, in this framework every time involves both a

control decision and learning.

In practice, this basic deep Q-learning algorithm typically does not perform well. One

problem is the coupling of the gradient desc ent step and the control decision, both happening

only once at any time

. In practice we often seek an algorithm that can on the one hand

learn the Q-function approximation using multiple control steps, and on the other hand

perform multiple gradient descent steps. For this, one popular variant is to maintain two

copies of the network approximating the Q-function,

◊

(

z,a

). One copy is updated only

every several time steps and is used for the control decisions

(8.82)

, and the other copy

is updated with every gradient descent step. Another concept often used is reply memory

where we control the system for some multiple time steps and use the combination of these

time steps for gradient based learning. We do not dive into these technical details here.

337

8 Specialized Architectures and Paradigms

8.5 Graph Neural Networks

In this section we introduce and explore neural networks for graph objects, called graph

neural networks (GNNs). In a similar way to how convolutional neural networks are primarily

used for image data, and recurrent neural networks are primarily used for sequence data,

graph neural networks are applied on data organized as a (combinatorial) graph. The input

data is typically of the form

V,E

) together with related feature data, where

is some

vertex set of the graph, and

is the edge set of that graph. We introduce basic concepts of

graphs in the sequel.

Abstractly, a graph neural network can be viewed as a function

◊

: G ◊ F æ output, (8.84)

where

is the set of possible input graphs

is the set of possible input features, and

the output may be a graph, a classiﬁcation vector, a scalar value, or similar. Speciﬁcally,

the features in

may include weights on edges, or much more complex features associated

with edges and nodes. In similar spirit to other deep learning models, graph neural networks

often implement

◊

(

) via a composition of a sequence of steps or layers, e ach denoted

[¸]

◊

[¸]

(

)

as in (5.1). At each such layer ¸, the input graph and features are transformed.

Applications of Graph Neural Networks

In contrast to domains such as image classiﬁcation using convolutional neural networks, or

natural language translation using sequence models, the applications associated with graph

neural networks often require some transformation of the given problem into a graphical

representation. Our foc us in this section is not on such transformations for applications;

see the notes and references at the end of this chapter for speciﬁc reading suggestions. An

important point to highlight is that most applications of GNNs are not for mimicking human

level performance (e.g., image rec ognition or text understanding), but are rather for gaining

insights from complex data, that is otherwise hard to process. We now mention a few general

application domains. See Figure 8.10 for an illustration.

338

8.5 Graph Neural Networks

Emily

Kayley

Yarden

Adaya

(a)

eat

is is

Sheep

Animal

Herb

Plant

Being

Frog

(b)

O N

C H

(c)

Figure 8.10:

An illustration of graph structures arising in several application domains. (a) Connec-

tions in so cial networks described via graphs. The presence of edges with question marks is not known

and can be predicted via a graph neural network. (b) Knowledge graphs capturing relationships

between entities. Learning about such relationships can be assisted with graph neural networks. (c)

Molecular bonds can be described v ia graphs. Thus classiﬁcation of molecular structures and the

design and discovery of new structures, is also aided by GNN.

One key area of application is social network analysis, where GNNs assist in identifying

inﬂuential users, detecting communities, and predicting connections between individuals.

Additionally, in recommendation systems, GNNs enhance the accuracy and personalization of

suggestions by modelling user-item interactions within collaborative ﬁltering settings. Another

prominent application is in fraud detection, particularly in ﬁnancial and e-commerce contexts,

where GNNs uncover hidden relationships and anomalous patterns within transaction graphs.

In the ﬁelds of biology and chemistry, GNNs excel at modelling molecular structures

and interactions, oﬀering valuable insights into drug discove ry, protein-protein interaction

prediction, and chemical property estimation. GNNs are also widely used in traﬃc analysis

for optimizing transportation networks, predicting congestion, and enhancing routing and

scheduling.

All of the above applications are generally cases where the tasks are not supposed to replace

human level tasks, yet there are also situations where GNNs can enhance or replace human

level tasks. Speciﬁcally, GNNs have applications in image analysis, enabling tasks like image

segmentation, object tracking, and scene understanding, particularly when graphs represent

relationships between image regions or s tructures. Further, in the natural language processing

domain, GNNs enhance tasks such as named entity recognition, sentiment analysis, and text

classiﬁcation by analyzing text data represented as dependency trees or semantic graphs.

Graph Structures

As alluded to in

(8.84)

the input to a GNN consists of a graph and features which are

elements of

and

respectively. We now discuss p os sible representations of such objects.

We begin with a brief outline of graph-theoretic terminology. See Figure 8.11 as a guide.

Recall that a graph

G œ G

, is often denoted

V,E

). The graph is composed of a node

set

and an edge set

. The node set can be denoted as

,...,v

}

where each

339

8 Specialized Architectures and Paradigms

is a node (also known as a vertex), and each of the

nodes can represent a distinct entity

in the application domain (e.g., a single person, an atom, etc.). The edge set

is a subset

V ◊ V

, or in particular is composed of tuples of the form (

) where each such tuple

represents an edge connecting

and

. In our terminology we do not allow the elements

) to be in E, that is, there are no self loops.

In some cases the graph is a directed graph,where(

) is diﬀerent from (

) for

i ”

and the former represents an edge (arrow) from

, while the latter is in the opposite

direction. In other cases, the graph is an undirected graph in which c ase we can either treat

) as unordered, or more formally require that if (v

) œ E then also (v

) œ E.In

the undirected case, edges are not represented as arrows but rather as links between nodes .

0 0 1 0 0 0 0 0

0 0 1 0 1 0 0 0

1 1 0 1 0 1 0 0

0 0 1 0 0 0 0 0

0 1 0 0 0 1 1 0

0 0 1 0 1 0 0 1

0 0 0 0 1 0 0 0

0 0 0 0 0 1 0 0

1 2 3

5 6

(a)

0 0 1 0 0 0 0 0

0 0 0 0 1 0 0 0

0 1 0 1 0 0 0 0

0 0 1 0 0 0 0 0

0 0 0 0 0 1 1 0

0 0 1 0 0 0 0 1

0 0 0 0 1 0 0 0

0 0 0 0 0 1 0 0

1 2 3

5 6

(b)

Figure 8.11:

Directed and undirected graphs each with their asso ciated adjacency matrix. (a) An

undirected graph has a symmetric adjacency matrix. (b) A directed graph has an adjacency matrix

that is typically not symmetric.

340

8.5 Graph Neural Networks

For undirected graphs, the degree of a node

is the number of edges connected to it, or more

formally the number of nodes

such that (

)

œ E

. For directed graphs we diﬀerentiate

between the out-degree of node v

and the in-degree. The former is the number of nodes v

such that (

)

œ E

, while the latter is the number of nodes

such that (

)

œ E

.In

the undirected case, both out-degree and in-degree are the same at each node.

In the context of undirected graphs, the neighbours of a node

are the nodes

that are

connected to

with an edge. We denote the s et of indices of these neighbours via

(

Hence the degree is the number of elements in this set. See for example, the undirected

graph illustrated in Figure 8.11 (a) where

(

)=1,

(

)=2,

(

)=4,

(

)=1,

N(v

)=3,etc.

A path between two nodes

and

is a sequence of no des (

,...,v

) where

= v

and (v

¸+1

) œ E for ¸ =1,...,m≠ 1. A graph is said to be connected if there

is a path between any two nodes

œ V

. One trivial graph (which is also connected)

is the complete graph (also known as the fully connected graph)where(

)

œ E

for all

œ V .

Mathematically, one way to represent a graph is via an adjacency matrix where we number

the nodes as 1, 2,...,r. Such an r ◊ r matrix A has entries that are either 0 or 1 where,

1 if (v

) œ E,

0 otherwise.

(8.85)

For undirected graphs, the adjacency matrix is symmetric (

) whereas for directed

graphs it is not necessarily symmetric. In either case, since we do not allow self loops,

for all

. When representing a graph in computer memory, sometimes an adjacency matrix is

suitable, yet at other times, when the graph has fewer edges, sparser representations called

adjacency lists are more suitable.

Since the no de numbe ring is typically arbitrary, it is often useful to allow all permutations

of the node numbering to be valid. Mathematically, we can express this property with the

aid of an r ◊ r permutation matrix P with,

1 if j replaces i in the permutation,

0 otherwise.

(8.86)

When

is applied to a vector

, namely

˜x

,theresult

˜x

has

in at index

. Similarly

when

is left multiplied to another matrix, the matrix’s rows are permuted according to the

permutation encoded in

. Also when

€

is right multiplied to another matrix, the matrix’s

columns are permuted as such. Finally, applying the permutation to the adjacency matrix

we have, that the new matrix

P AP

€

represents the adjacency matrix after transforming

the numbering of the nodes according to the permutation described by P .

A graph is called a weighted graph if for e very edge (

)

œ E

there is also an associated

weight

œ R

with the edge. One can also associate weights with all elements of

V ◊ V

and consider a weight of 0 as the absence of the edge. In this case the adjacency matrix can

be enhanced to an adjacency weight matrix which we still denote as

, and have

341

8 Specialized Architectures and Paradigms

(a) (b)

Figure 8.12:

Transductive vs. inductive learning. (a) Transductive learning involves a single large

input graph. (b) Inductive learning involves multiple separate input graphs.

With this representation,

if (v

) œ E,

0 otherwise.

One example application of a weighted graph is a map of cities where edges between cities

are roads and the weights are the distances in kilometers. Note that this is also a planar

graph since it can be “drawn” on a plane. Not all graphs are planar. We also mention that

a more general object than a graph is a multi-graph which permits multiple distinct edges

(e.g., roads) between vertices. Some graph neural networks deal with multi-graphs, yet we

do not go into this specialization in this section.

The deﬁnitions above assoc iated with

G œ G

are general graph-theoretic concepts, not

speciﬁc to graph neural networks. Yet in graph neural networks, there are als o additional

features, which we denote as elements of

. Sp e ciﬁcally, each individual node

can carry

an associated feature vector

(i)

which we generally consider an ele ment of

, for some

1. For example, in s ocial networks applications where each individual node is a person,

(i)

denotes the

features associated with the person such as age, marital status, etc. We

call these features node level features.

Similarly to node level features, in certain cases we can also consider edge level features.

Here we have the features

(iæj)

for edge (

), and we assume

(iæj)

œ R

, for some

1. One can treat a weighted graph as a very simple case where

(iæj)

and

=1. However, in most cases that involve edge level features,

1. For example, in a

transportation case where edges are roads (or transportation links) we may have each

(iæj)

represent multiple characteristics of the road such as the number of lanes, the speed limit,

toll information, etc.

In this section’s exposition of graph neural networks we generally ignore edge level features

and focus on data with node level features. In such a case, the features can be organized in a

r ◊p

matrix

where each row represents the features of node

, namely

(i)

. That is,

the set of all possible feature matrices

. Note also that if we wish to apply a permutation

to the node numbering as encoded in a permutation matrix

, then the permuted feature

matrix is PX.

Note that if one was to organize edge level features in such an object then a tensor of dimension

r ◊ r ◊ p

would be appropriate.

342

8.5 Graph Neural Networks

The Structure of Input Data and Tasks

Similarly to the rest of the book, we generally denote the data via

. For graph neural

networks this data can come in various forms:

D =

]

[

G, X

, for transductive learning (i),

(1)

),y

(1)

,...,

(n)

),y

(n)

, for inductive learning (ii),

(G,X)

, for graph embedding (iii).

(8.87)

The forms (i) and (ii) in

(8.87)

are for the two overarching graph neural network learning

paradigms, namely, transductive learning and inductive learning. In the case of transductive

learning the data is one (big) input graph together with the features. In the case of inductive

learning, the data consists of

diﬀerent graphs (potentially each of them smaller), each with

its own features as well, and also potential labels,

(i)

. See Figure 8.12 for an illustration of

the diﬀerence between transductive and inductive learning.

To further understand the diﬀerence between transductive and inductive learning consider

the following illustrative examples. For transductive learning, consider the s ocial networks

domain where we treat a big social network with some missing data as an input and then

learn a model for predicting potential connections between individuals (predicting missing

edges). As an inductive learning example consider classiﬁcation of molecules where for

example

(i)

=0implies a non-toxic molecule and

(i)

=1implies that molecule is toxic.

The training data is then composed of many input graphs, some of which have

(i)

and some of which have

(i)

=1. We then train a model that op e rates on an input graph

(molecule structure) and outputs

ˆy

which is the estimated probability of being toxic. It can

then be transformed into a decision rule, as with any binary classiﬁer.

Form (iii) of the data in

(8.87)

is diﬀerent. In this case the data is no longer a graph but

rather a transformation of graphical data into vector form using techniques called graph

embeddings. Here, in a similar nature to word embeddings in the context of natural language

processing (see Section 7.1), input graph data and features (

G, X

) are pre-processed to

create a matrix

(G,X)

which summarizes the graph and the features via real valued vectors.

In contrast to the ﬁrst two forms of data (i) and (ii) in

(8.87)

, which are used as input to

graph neural networks, the graph embedding form (iii) can be used as input to feedforward

networks or other (non graphical) models trained on

(G,X)

. Interestingly, some graph

embedding techniques themselves are based on graph neural networks; we omit the details.

Given graphical data of the form

(8.87)

, one can train a mo de l as in

(8.84)

for various

diﬀerent tasks. These can be dichotomized as tasks on nodes, tasks on edges, or tasks on

graphs. In the ﬁrst case the output is associated with nodes and our main goal is to predict

outcomes, or impute missing features, for speciﬁc nodes in the graph. In the second case the

output is associated with edges and our goal is to dete rmine the presence of certain edges

that were not originally available in the data, or similarly predict outcomes associated with

available edges. In the third case the output is associated with the whole graph and in this

case we predict properties of the graph, including class iﬁcation of graphs, regression, and

similar. See Figure 8.13 for an illustration.

Various techniques for graph emb edding include DeepWalk, no de2vec, GraphSAGE, LINE (Large-scale

Information Network Embedding), and HOPE (High-Order Proximity preserved Embeddi n g).

343

8 Specialized Architectures and Paradigms

(a)

(b) (c)

Figure 8.13:

Various forms of tasks for graph neural networks. (a) Tasks on nodes deal with

inference about individual nodes. For example classiﬁcation of the type of node. (b) Tasks on edges

deal with inference about individual edges. For example the existence of an edge or not. (c) Tasks

on graphs deal with inference about the whole graph. For example classiﬁcation of the graph.

Generally, tasks on graphs are carried out in a inductive setting as we require form (ii) of

the training data from

(8.87)

. In contrast, tasks on nodes or tasks on edges can be carried

out both in an inductive and a transductive level.

The General Structure of a Graph Neural Network Model

In general, a graph neural network model is a function of graph data

from

(8.87)

(5.1)

of Chapter 5 we construct the neural network

◊

(

) of

(8.84)

with a recursive

computation of layers,

◊

(

[L]

◊

[L]

(

[L≠1]

◊

[L≠1]

(

...

(

[1]

◊

[1]

(

))

...

)). Here

[¸]

◊

[¸]

(

) is the

-th layer,

and

◊

[¸]

are the parameters of that layer. As with other deep learning models, we denote

[¸]

as the result of

[¸]

◊

[¸]

(

[¸≠1]

) and

[0]

, keeping in mind that

contains both the graph

structure and features.

Some models are dynamic graph neural networks in w hich case the graph structure is modiﬁed

as it is processed by the neural network layers. Other models, are static graph neural networks

for which

V,E

) is not modiﬁed across layers, and thus

has the same value within

each layer. In such a case, it is convenient to represent

[¸]

) where

[¸]

is sometimes

called the hidden state.Weset

[0]

as the input features from

and denote the neural

network function as

◊

(

[0]

). With this representation, the function of each layer can be

denoted as

[¸]

◊

[¸]

(

[¸≠1]

;

). Note that

is ﬁxed for all layers, but with each application

of the complete

◊

(

[0]

),adiﬀerent graph structure

is possible. As with feedforward

neural network models, the parameters of the model are ◊ =(◊

[1]

,...,◊

[L]

344

8.5 Graph Neural Networks

Our focus in this exposition is only on static graph neural networks for which the forward

pass

[1]

= f

[1]

◊

[1]

(

Input features

˙˝¸˚

[0]

; G)

[2]

= f

[2]

◊

[2]

[1]

; G)

[L≠1]

= f

[L≠1]

◊

[L≠1]

[L≠2]

; G)

ˆy

¸˚˙˝

output

= f

[L]

◊

[L]

[L≠1]

; G)

The speciﬁc structure of

[¸]

can vary be tween applications and may b e a matrix, or a higher

dimensional tensor. In our case, for simplicity, let us assume it is a matrix of dimension

r ◊ p

where the

-th row represents the node level features for node

in the graph, and

there are

nodes in total. In general one may have diﬀerent column dimensions (number of

features) per layer

, yet for simplicity let us assume that this is ﬁxed as

throughout the

layers.

An important requirement of the layer function

[¸]

◊

[¸]

(

[¸]

;

) is permutation invariance where

the order of nodes (and consequently, the order of entries in the adjacency matrix) does

not aﬀect the network’s output. To understand this requirement, assume that we represent

via an adjacency matrix

, and hence the layer’s operation can be represented with

in place of

, namely we can denote the layer’s function as

[¸]

◊

[¸]

(

[¸≠1]

;

). Now for any

permutation on the nodes represented via P as in (8.86), we require that,

[¸]

◊

[¸]

( Ph

[¸]

¸˚˙˝

Permuted hidden state

;

Permuted adjacency matrix

˙ ˝¸ ˚

P AP

€

)= Pf

[¸]

◊

[¸]

; A)

¸ ˚˙ ˝

Permuted hidden state at layer ¸+1

. (8.88)

The permutation invariance requirement in

(8.88)

then ensures that node numbering is

indeed arbitrary and does not does not aﬀect the operation of the model.

Let us now dive into the structure of a single layer except for the ﬁnal layer. Namely,

let us see the structure of

[¸]

◊

[¸]

(

[¸≠1]

;

) for

,...,L ≠

1. Here, similarly to the

convolutional neural networks of Chapter 6, graph neural networks try to enforce locality

and translation invariance; see Section 6.1. The translation invariance property is analogous

to the permutation invariance property of

(8.88)

. Locality is enforced by requiring that the

output of

[¸]

◊

[¸]

(

[¸≠1]

;

) for node

only depends on the neighbours of

. Speciﬁcally in

our data representation it means that only the rows with indices of neighbours

(

) as

well as

itself are used to compute the

-th row of

[¸]

. Permutation invariance is enforced

by using the same form of function (and same parameters) for each target output node, and

not considering the actual index of a node but rather only the graphical structure.

Compare with the feedforward network forward pass in (5.4).

345

8 Specialized Architectures and Paradigms

Speciﬁcally, if we denote the

-th row of

[¸]

via

[¸]

(i)

then the computation of the layer

associated with node v

can be represented via,

[¸]

(i)

= f

[¸]

node,◊

[¸]

[¸≠1]

(j)

for v

œ N(v

) ﬁ {v

}

, (8.89)

where the function

[¸]

node,◊

[¸]

(

) determines how the hidden state of a node in the next layer

is determined by the hidden state of the node and its neighbours in the previous layer.

The ﬁnal layer

is typically diﬀerent because the output

ˆy

is often not of the same dimension

as h

[¸]

. In this case, we just retain the ﬁnal layer action via the general function f

[L]

◊

[L]

(·).

Message Passing Schemes

The operation of

(8.89)

is based on a so-called message passing scheme where two steps

called aggregate and update

are executed one after the other. These steps break up the

operation of the

[¸]

node,◊

[¸]

(

) function via, a function

aggregate

(

) and a function

[¸]

update,◊

[¸]

(

)

and are executed as,

[¸]

(i)

= f

aggregate

[¸≠1]

(j)

for v

œ N(v

) ﬁ {v

}

[¸]

(i)

= f

[¸]

update,◊

[¸]

[¸≠1]

(i)

[¸]

(i)

Observe that the aggregate function is the same for all layers and is not parameterized, while

the parameters for layer ¸, ◊

[¸]

are associated only with the update function.

As is evident, the aggregation step utilizes all the hidden states of the neighbours of the

node, say

, to achieve a summary,

[¸]

(i)

, which we call the message. Note that in some cases

it also uses node

itself (self loops) and in other cases not (n o self loops). The message

collects hidden state information of neighbouring nodes.

In our exposition, we assume that the message for node

is an element of

. Similarly

to the hidden state, for simplicity, we assume this dimension,

, does not depend on the

layer ¸.

Typical aggregation functions are the sum, the mean, or the element-wise maximum. In our

exposition we focus on the sum as an illustrative simple cas e where ,

[¸]

(i)

]

[

j : v

œN (v

)

[¸≠1]

(j)

, for sum aggregation without self lo ops,

j : v

œN (v

)

[¸≠1]

(j)

+ h

[¸≠1]

(i)

, for sum aggregation with self loops.

(8.90)

The aggregate and update steps are carried out for all nodes in the graph and thus one

may view this scheme as messages passing along neighbours within the graph. Note that

this type of architecture is sometimes called a message passing neural network (MPNN);see

Figure 8.14 for an illustration.

Sometimes this is called combine.

346

8.5 Graph Neural Networks

Target Node

(a)

Aggregate

(b)

Figure 8.14:

Aggregation in a message passing scheme. (a) An input graph where the messages for

node

are considered. (b) The application of the aggregation via multiple layers (2 layers in this

case) yielding the message aggreagated for the node. Observe the neighbour of v

in each layer.

Considering the whole graph, note that like the

r ◊ p

hidden state matrix

[¸]

, we can also

consider an

r ◊ p

message matrix

[¸]

,wherethe

-th row is

[¸]

(i)

. Now focusing on the

unweighted case, using the graph’s adjacency matrix

from

(8.85)

, one can verify that the

message matrix resulting from the sum aggregation (8.90)is,

[¸]

]

[

[¸≠1]

, for sum aggregation without self loops,

(A + I) h

[¸≠1]

, for sum aggregation with self loops.

(8.91)

The update step is where a neural network approach is used. In this step, the hidden state

of the node,

[¸≠1]

(i)

, and the message

[¸]

(i)

are used to determine the hidden state of the

node in the output of the

-th layer. The update function is typically composed of an aﬃne

transformation together with a non-linear activation. A simple typical form of

[¸]

update,◊

[¸]

(

)

is,

[¸]

(i)

= S

[¸]

(i)

[¸]

+ h

[¸≠1]

(i)

[¸]

+ b

[¸]

, (8.92)

where we treat the vectors as row vectors. Here

(

) is a vector activation function typically

formed of scalar activation functions such as

‡

Sig

(

), similarly to other neural networks.

With this form, the learned parameters

◊

[¸]

) include weight matrices and a

bias vector, and under our (simplifying) dimensionality assumptions, we have

[¸]

◊p

, and

[¸]

œ R

(considered as a row vector). Note that more generally, dimensions

may vary across layers and hence the matrices may be non-square.

If we focus on sum aggregation without self loops, then (8.92) becomes,

[¸]

(i)

= S

j:v

œN (v

)

[¸≠1]

(j)

[¸]

+ h

[¸≠1]

(i)

[¸]

+ b

[¸]

. (8.93)

347

8 Specialized Architectures and Paradigms

Further, in this case, if we consider the whole graph, we can use

(8.91)

to represent

(8.93)

via a network wide equation,

[¸]

= S

[¸≠1]

[¸]

+ h

[¸≠1]

[¸]

+ B

[¸]

, (8.94)

where in similar nature to

(5.10)

of Chapter 5,

[¸]

is a

r ◊ p

matrix with each row equal

to the bias vector

[¸]

. Note that here

(

) is taken as the activation function over the whole

matrix, typically element wise.

It is also of interest to note that in case where we restrict the parameters with

[¸]

both denoted as W

[¸]

,then(8.94)isreducedto

[¸]

= S

(A + I)h

[¸≠1]

[¸]

+ B

[¸]

. (8.95)

This case of

[¸]

yields a similar update rule to what we would have if we

consider sum aggregation with self loops; see (8.91).

In practice, one can make a choice if to use the formulation with more parameters

(8.94)

the less parameterized formulation

(8.95)

. With the latter, there can be some restriction on

the expressivity of the graph neural network as there is no separation of the information

from the node and from its neighbours. Nevertheless in some cases, the less parameterized

case, (8.95)suﬃces.

Model Variants

We close this section with a few model variants of graph neural networks. Each of the variants

has its advantages and disadvantages, and the applicability of the variants for applications

is beyond our scope. Our purpos e here is simply to explore the various basic ideas and

equations associated with each of the models. We consider graph convolutional networks,

spectral approaches, and the use of the attention mechanism.

In a graph convolutional network,(8.95) is modiﬁed to,

[¸]

= S

≠

(A + I)

≠

[¸≠1]

[¸]

+ B

[¸]

, (8.96)

where the matrix

is a diagonal matrix where

is the degree of node

plus one. Thus

≠

is a diagonal matrix with entries that are inverse of the square ro ot of the degree plus

one. At the no de level, again representing vectors as row vectors, we can unpack (8.96) for

node v

as,

[¸]

(i)

= S

j:v

œN (v

)

[¸≠1]

(j)

[¸]

[¸≠1]

(i)

[¸]

+ b

[¸]

. (8.97)

If we compare

(8.97)

with

(8.93)

, we see that a graph convolutional network is similar to sum

aggregation, yet with weighting in the summation proportional to the degrees. Speciﬁcally,

when considering the update for the hidden state

[¸]

(i)

of node

, the weight of the hidden

state of neighbouring nodes that have more neighbours than

is reduced, and conversely

348

8.5 Graph Neural Networks

the weight for nodes that have less neighbours is increased. This scaling acts as a form of

regularization in the network.

Graph convolutional networks can be extended with a spectral approach. Speciﬁcally let us

now outline key ideas of spectral graph neural networks, also known as spectral convolutional

graph neural networks. Generally, in the world of signal processing and mathematics, a

spectral approach deals with analyzing a transform of a signal in place of the signal itself.

For example in the context of time signals, as brieﬂy discussed in Section 6.2, instead of

the signal, one may sometimes consider the Fourier transform of the signal. Then based on

the so-called convolution theorem, the Fourier transform of the convolution between two

signals, as for example in equation

(6.2)

of Chapter 6, can be represented as the product of

the Fourier transforms of the individual signals.

In the context of graphs, an analogy to the convolution theorem can be considered using

eigenvalue decompositions of matrices, which is the topic of study of an area called spectral

graph theory. Let us focus on undirected graphs that are connected, and thus the degree of

each node is at least one.

In spectral graph theory, an important matrix associated with a graph with

nodes is the

r ◊ r

Laplacian matrix, appearing in either the unnormalized form, or the normalized form,

and deﬁned as,

L =

D ≠ A (unnormalized),

I ≠ D

≠

(normalized).

Here, similarly to

from ab ove,

a diagonal matrix where this time

is the degree of

node v

One may consider a vector

[¸]

œ R

which represents some state values for each node in the

graph, and the ope ration

[¸+1]

= Lu

[¸]

, (8.98)

can encompass the eﬀect of applying the Laplacian matrix of the graph (either unnormalized

or normalized) on the state

[¸]

to achieve the next state

[¸+1]

. Interpretations of this

operation for speciﬁc graph contexts are beyond our scope, yet we mention that in the

context of random walks on graphs, electrical networks (Kirchhoﬀ laws), and other contexts,

this operation is common.

A key idea in spectral graph networks is to modify the operation (8.98) to,

[¸+1]

[¸]

, (8.99)

where the modiﬁcation from

is the essence of the learned parameters in the model and

is further explained below. With this modiﬁcation it is useful to use the spectral decomposition

and learn how to modify the eigenvalues of

eﬀectively. This is analogous to learning

how to modify ﬁlters, as is done in Chapter 6, yet unlike Chapter 6, learning is on the

spectral domain (eigenvalues) and not on the time domain (convolutions).

Now looking at the spectral decomposition, since the graph is undirected,

,eitherinthe

unnormalized or normalized form, is a symmetric matrix, and further it can be shown to be

Note that in Chapter 6 we actually do not discuss Fourier transforms, yet we point the reader there for

general context of signals and systems.

349

8 Specialized Architectures and Paradigms

positive semideﬁnite. The spectral decomposition of L is,

L = U  U

€

, (8.100)

where

U œ R

r◊r

has normalized eigenvectors of

as columns, and 

œ R

r◊r

is a diagonal

matrix with corresponding eigenvalues, each of which is real and non-negative (due to being

positive semideﬁnite). With this spectral decomp o sition,

is an orthogonal matrix, namely

€

(the inverse of

is its transpose). Note that it is also customary to order the

eigenvalues in the diagonal of  in desce nding order (and the associated eigenvectors in the

columns of U are obviously ordered accordingly).

Now when we consider

(8.98)

and wish to modify it to

(8.99)

, we do so by transforming

the eigenvalues of

, appearing in the diagonal of . For example with so called high pass

ﬁltering we retain the high valued eigenvalues (above some threshold), while shrinking or

zeroing out the other eigenvalues. Similarly, low pass ﬁltering works in the other direction,

retaining only low eigenvalues. In each such case, we may view the transformation of the

eigenvalues as some function

(

) which applied to the diagonal matrix  is denoted as

()

œ R

r◊r

. With this we have the modiﬁed Laplacian matrix as

()

€

, and

thus (8.99) is represented as,

[¸+1]

= UF() U

€

[¸]

. (8.101)

Note that we may view

(8.101)

as ﬁrst projecting

[¸]

onto the orthogonal eigenvector space

via the transformation

€

[¸]

, then applying individual (adjusted via

(

)) eigenvalues on

each coordinate of the basis via the left multiplication by the diagonal matrix

(), and

ﬁnally, transforming back to the original basis via another left multiplication by U.

Having understood how the spectral decomposition can be used, we now return to the

general setup of graph neural networks where as before

[¸]

œ R

r◊p

is the hidden state

matrix which is updated from layer

to layer

+1. Now also denote

[¸]

(|i)

as a vector in

which is the

-th column of this matrix. This vector represents the hidden state information

for each node in the network, based on the

-th hidden state feature at layer

.Withthis

notation, the update for this hidden state vector per feature

depends on all features in the

previous layer and is,

[¸+1]

(|i)

= S

k=1

[¸]

(k,i)

€

[¸]

(|k)

+ b

[¸]

(i)

, for i =1,...,p

. (8.102)

Here for each pair of features

and

[¸]

(k,i)

is an

r ◊r

diagonal matrix of learned parameters

that we call spectral weights, and

[¸]

(i)

is a bias vector in

. As always,

(

) is a vector

activation function. For one such layer

, we can observe that the total number of learned

parameters for spectral weights is

r ◊

(

)

and the total number of learned parameters for

the bias vectors is r ◊ p

To get a better feel for the update equation

(8.102)

, compare it with

(8.101)

.In

(8.101)

see that

() is a ﬁlter applied to the eigenvalues whereas in

(8.102)

we represent a learned

() via

[¸]

(k,i)

. Similarly to the concept of channels in convolutional neural networks of

Chapter 6, the summation over all updates

[¸]

(k,i)

€

[¸]

(|k)

integrates all the features from

layer ¸ into the associated feature of layer ¸ +1.

350

8.5 Graph Neural Networks

The story of spectral graph neural networks does not stop here because ideally one would

like to reduce the dimens ion of the learned parameters

[¸]

(k,i)

so that they do not depend

on the whole graph and are independent of the number of nodes in the graph

.Thestudy

and application of s pectral graph neural networks deals with such approaches and indeed

in certain cases it has been shown that spectral graph neural networks generalize well and

sometimes perform better than their convolutional graph neural networks counter parts.

The details are beyond our scope.

Adiﬀerent variant is graph attention networks where the aggregation step use s an attention

mechanism. Concepts of attention were heavily discussed in Chapter 7 in the context of

sequence models. Speciﬁcally, in Section 7.4 we introduced the general concept of attention

where attention weights are calculated as in

(7.20)

and are then used for linear combinations of

inputs in

(7.21)

. We then also used attention in the context of transformers as in Section 7.5.

In the graph context, attention can be incorporated via the aggregation step. Spe ciﬁcally, we

can enhance the basic sum aggregation as in

(8.90)

, focusing here only on the case without

self loops, to be,

[¸]

(i)

j : v

œN (v

)

–

[¸]

(i),j

[¸≠1]

(j)

. (8.103)

Here for node v

, and each neigbour v

œ N(v

), we have attention weight –

[¸]

(i),j

where

j : v

œN (v

)

–

[¸]

(i),j

=1.

Similarly to Chapter 7, attention weights are calculated using a scoring mechanism via

an alignment function

◊ R

æ R

,where

(

[¸≠1]

)

[¸≠1]

)

) measures the proximity

between the hidden states of nodes

and

at layer

¸ ≠

1. The basic alignment function

is the inner product between

[¸≠1]

)

and

[¸≠1]

)

, yet more complex options with learned

parameters are also possible. One option with linear re-weighting is,

s(h

[¸≠1]

)

[¸≠1]

)

[¸≠1]

)

[¸]

[¸≠1]

)

[¸]

€

(8.104)

where

[¸]

œ R

◊m

is a matrix of learned parameters for layer

with

set as some

dimension (recall here that our hidden state vectors are taken as rows).

Based on the alignment function, the attention weights are calculated for each node

. As

with all attention mechanisms, we use a softmax function over neighbours indices

to obtain

the attention weights,

–

[¸]

(i),j

s(h

[¸≠1]

(i)

[¸≠1]

(j)

)

k : v

œN (v

)

s(h

[¸≠1]

(i)

[¸≠1]

(k)

)

which are then used in the aggregation step

(8.103)

. Once the message is computed, it can

be used in an update function as in (8.92).

Note that with our description of graph attention networks here, the learned parameters per

layer are

[¸]

for the alignment function

(8.104)

as well as

◊

[¸]

) used in the

update

(8.92)

. However, other options, typically with reduced parameters, reusing weight

matrices either across layers, or between the alignment function and the update equation,

351

8 Specialized Architectures and Paradigms

are also possible. Also, similar to the transformer architecture, multi-head attention has also

been introduced and in certain graph neural network applications, such architectures are

very popular.

352

8.5 Graph Neural Networks

Notes and References

This chapter covered a broad range of specialized architectures and paradigms where each section

covers a major topic which could have in fact made a whole chapter. Hence in our notes and

references about the topics of this chapter we only summarize key references and developments in

each of the sub-ﬁelds. A further recent overarching text that we recommend is [

336

] with multiple

chapters, one per each of the topics covered here.

The ﬁeld of generative modelling has multiple origins. Early models include hidden Markov models

and Gaussian mixture models with origins in the 1950’s and 1960’s; see chapters 11 and 17 of [

298

] for

background. Somewhat more recently, some authors consider the study of Boltzman machine models

introduced in the 1980’s in [

], and deep Boltzman machines in [

361

], as the initial meaningful

generative models in the context of deep learning. See also chapter 20 of [

142

] for an overview. A

more recent survey of generative models in machine learning is [

162

] and a comparison of deep

generative modelling approaches is in [46].

Up to 2014, while generative models were useful for some applications and certainly interesting,

in terms of images, they lacked the ability to create real life looking data. The big advance came

with the development of generative adversarial networks (GANs), in Goodfellow et al.’s work [

143

This opened up possibilities for creation of realistic looking images (and data) and is still a very

active topic. Variational autoencoders, initially introduced in [

234

], grew into multiple directions

and contemporary diﬀusion models such as [

183

], and those surveyed in [

430

], constitute the state

of the art in image generative modelling. As of the time of publishing of this book, diﬀusion models

and GANs still comp ete, with diﬀusion models generally able to produce more impressive images,

while GANs are much faster in production since they do not require multiple neural networks.

Ideas of variational autoencoders are rooted in m odern developments of Bayesian statistics. See [

]

for an introductory general text on Bayesian statistics and [

407

] for an accessible review of the area.

Speciﬁcally, the variational Bayes methods, a well-known optimization-based approach in the ﬁeld

of approximate Bayesian computation, captures the key ideas used in variational autoenco ders . See

[

] and [

444

] for reviews of variational Bayes. This approach also falls in the realm of approximate

Bayesian computation and entails a method for approximating posterior distributions using simpler

surrogate distributions. See [

381

] for a collection of approximate Bayesian computation methods.

Speciﬁcally, for more details about variational autoencoders, see [235].

Our presentation of variational autoencoders was geared towards hierarchical Markovian variational

autoencoders of which diﬀusion mo dels are a special case. Nevertheless, variational autoencoders

and their variants are interesting and useful in their own right. They have been applied to many

ﬁelds. In image processing, prediction of the trajectory of pixels of an image is tackled in [

414

] and

natural image modelling is in [

152

]. In the ﬁeld of speech analysis, voice conversion is handled in

[

191

] and speech synthesis in [

]. In the area of text processing as in [

], reccurent neural network

based variational autoencoders for generating sentences are put forward and in [

193

], controlled

text generation is handled. Another ﬁeld is graph-based data analysis as in [

236

] where learning on

graph-structured data is handled, and [

210

] which deals with molecular graph generation. As we

presented diﬀusion models as special cases of hierarchical variational autoencode rs, the literature

on these models is also relevant. In particular, see [

343

] for an application in black box variational

inference and [385] for a variant called ladder variational autoencoder.

Diﬀusion models, initially introduced in [

384

], gained signiﬁcant prominence following [

183

], which

showcased exceptional image synthesis results. These models were further improved with [

104

where for the ﬁrst time the prolonged dominance of GANs was broken. For recent surveys of diﬀusion

mo dels, refer to [

], [

], and [

430

]. In terms of applications in the realm of image processing,

diﬀusion models are utilized for tasks such as colorization, inpainting, uncropping, and restoration

as in [

358

]. Other image processing applications include super-resolution as in [

360

], and image

editing as in [

176

]. There is an extensive study on applications of diﬀusion models for text to image

generation such as for example the work in [

359

] which introduced Imagen. This system utilizes

a transformer based large language model which is used for understanding text combined with a

diﬀusion model used for image generation. The application of diﬀusion models extends to video data

as well. Notable contributions include [

163

] where an approach for long-duration video completions

is put forward, and [

182

] which introduced Imagen Video, a text-conditional video generation system

based on a cascade of video diﬀusion models.

353

8 Specialized Architectures and Paradigms

A related paradigm to variational autoencoders and diﬀusion models that we did not cover is

normalizing ﬂows. This paradigm was ﬁrst introduced in [

347

] for representing the posterior in

variational autoencoders. These generative models construct complex distributions by transforming

a distribution through a series of invertible mappings. See chapter 16 of [

336

] for an overview and

[322] for a recent survey.

As already mentioned, GANs were intro duced in [

143

]. After the introduction of this paradigm,

multiple generalizations appeared. Particular early variants included C-GAN [

293

], and convolutional

versions of GANs as in [

339

]. The idea of NS-GAN was already introduced in [

143

]. The W-GAN

concept ﬁrst appeared in [

] and was later developed in [

151

] where the gradient penalty approach

was introduced. See also [

] as well as the empirical comparison in [

271

] and the general GAN

surveys [

150

] and [

200

]. The ideas of AC-GAN were developed in [

316

] and the ideas of Info-GAN

were developed in [

]. The image to image paradigm in GANs is broad. A survey on this topic

is [

] with initial ideas in [

], [

204

], and [

417

]. Style-GAN was developed in [

224

], and further

developments are in [

225

] and [

223

]. Finally we mention a few general developments in GANs

including sequence GAN from [437], and EditGAN from [264].

Mo dern approaches to deep reinforcement learning are surveyed extensively in the texts [125] and

[

393

], whereas more dated accounts are [

] and [

217

]. At the more fundamental level, basics of

Markov decision processes are presented in [

], [

], and [

337

]. An expository overview of engineering

control theory is in [

]. The criterion for convergence of Q-learning as in our equation

(8.81)

from [

419

]. The success of reinforcement learning in playing Atari video games is in [

294

], and later

landmark results in the game of Go are documented in [

377

]. Apart from games, reinforcement

learning is successfully used in several domains one of which is addressing hard combinatorial

problems; see [

282

] for a survey of such approaches. Nowadays, reinforcement learning is applied

for ﬁne tuning large language models through human feedback as documented in [

318

] and [

429

Reinforcement learning is also critical for some robotic tasks as described in the survey [

448

However, as one would have thought initially that self driving cars are best handled as a system

via reinforcement learning, in practice other techniques have so far prevailed; see [

] for a general

survey. A related approach often used is imitation learning,see[

286

] for a survey of this technique

in the context of autonomous vehicles.

General texts about graph neural networks are [

273

], and [

157

], with recent survey papers in [

413

]

and [

427

] as well as a notable chapter in [

336

]. Early ideas in graph neural networks arose in [

387

]

where a concept called at the time the generalized recursive neuron was introduced. Further ideas of

graph neural networks were developed in [

144

] and [

365

]. The se days graph neural networks are used

in multiple applications such as those discussed in [

242

], [

445

], and [

451

] among many others. There

are various techniques for graph embedding including DeepWalk [

329

], node2vec [

148

], GraphSAGE

[

156

], LINE (Large-scale Information Network Embedding) [

397

], and HOPE (High-order Proximity

preserved Embedding), [

317

]. For an introductory computer science overview of traditional graph

algorithms and data structures see [93].

General ideas of message passing schemes in graph neural networks were introduced in [

287

]. Ideas of

graph convolutional networks were developed in [

], [

134

], and [

309

]. Ideas of spectral convolutional

graph neural networks were developed in [

] and [

174

]. Graph attention networks were developed in

[

411

], motivated by the eﬀective application of the attention mechanism to sequence models as in

[

410

]. Indeed many developments in deep learning cross architectures, domains, and sub-disciplines,

and the development of graph attention networks serve as one such example. We also close by

mentioning spatial-temporal graph neural networks which are designed to deal with dynamic graphs,

sometimes also with a spatial component. These models were introduced in [

257

] and [

371

], and are

further described in the recent review [209].

354