The world is has some 7.7 billion beautiful people. But in terms of images, infinitly more… Watch this video which shows an advanced application of GANs:
In this unit we overview some of the basics of GANs, a new branch of deep learning that emerged out of a 2014 paper by Ian Goodfellow et. al.
Generative modelling is an unsupervised learning task in machine learning that involves automatically discovering and learning the regularities or patterns in input data in such a way that the model can be used to generate or output new examples that plausibly could have been drawn from the original dataset.
In general, machine learning distinguishes between the generative approach and the discriminative approach. In the generative approach data such as XX or (X,Y)(X,Y) is used to learn about the probability distribution p(X)p(X) or p(X,Y)p(X,Y). In the discriminative approach we consider the (X,Y)(X,Y) data but learn about the conditional distribution P(Y | X=x)P(Y | X=x).
Logistic regression and the classification and regression neural networks we have used up to now are discriminative. However, some classic machine learning methods are generative. A notable generative model to quickly review is the naive Bayes classifier:
Note that naive Bayes is not a Bayesian method.
The Naive Bayes algorithm is not related to GANs or deep learning, but is a classic machine learning (generative) algorithm that is worth knowing.
We have data {(x(1),y1),…(x(n),yn)}{(x(1),y1),…(x(n),yn)} with dd-vector features and finite labels, say {1,…,K}{1,…,K}. We assume some family of distributions pθ(⋅)pθ(⋅), with parameters θθ, over the data which has a key property (the ‘Naive’ property):
So the key assumption is that the features of xx are condtionally independent given the value of yy. pθ(x,y)=pθ(x | y)pθ(y)=⏟Naive(d∏i=1pθ(xi | y))pθ(y).pθ(x,y)=pθ(x | y)pθ(y)=Naive(d∏i=1pθ(xi | y))pθ(y).
The naive Bayes algorithm aims to produce a classifier such that with a new observation x∈Rdx∈Rd, we have a prediction for the label yy. Learning is carried out by estimating ˆθ^θ from the training data. Then in production/test time, with ˆθ^θ at hand, prediction is as follows,
This is called the MAP (maximum a posteriori probability) decision rule.
Notice Bayes’ rule.
ˆy=argmaxy∈{1,…,K}pˆθ(y | x)=argmaxy∈{1,…,K}pˆθ(x | y)pˆθ(y)pˆθ(x)=argmaxy∈{1,…,K}pˆθ(x | y)pˆθ(y)=argmaxy∈{1,…,K}(∏di=1pˆθ(xi | y))pˆθ(y)^y=argmaxy∈{1,…,K}p^θ(y | x)=argmaxy∈{1,…,K}p^θ(x | y)p^θ(y)p^θ(x)=argmaxy∈{1,…,K}p^θ(x | y)p^θ(y)=argmaxy∈{1,…,K}(∏di=1p^θ(xi | y))p^θ(y)
One can now suggest different probability families and use them accordingly. One of the simplest examples which holds when the features are {0,1}{0,1} boolean is the Bernoulli naive Bayes. Here we assume, p(x | y)=d∏i=1pxiyi(1−pyi)1−xi.p(x | y)=d∏i=1pxiyi(1−pyi)1−xi. Hence the parameters pyipyi, for y=1,…,Ky=1,…,K and i=1,…,di=1,…,d is the probability of class yy generating the term xixi.
Naive Bayes of this sort (and generalizations) has been very popular in e-mail spam detection based on a bag of words summary of the document in question.
Generally, Adversarial machine learning is the collection of techniques that attempt to fool models by supplying deceptive input. Adversarial training includes a variety of methods and applications, many of which are aimed at improving the quality of classifiers or at the same time exposing vulnerabilities.
A famous example is in this video:
More details are in this paper:
2018: Synthesizing robust adversarial examples by Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok.
In the context of GANs, we harness the “adversarial” nature of two sub-models which we train.
It is ‘different’ from Naive bayes or density estimation methods which attempt to learn about p(X,Y)p(X,Y) directly. A somewhat different type of generative model does not learn about p(X)p(X) or p(X,Y)p(X,Y) but rather learns of a mechanism to generate samples of XX or (X,Y)(X,Y) that follow the same (or similar) distribution to the data. Generative Adversrial Networks (GANs) are such models.
Here we discuss the data as only XX, but later we can expand to (X,Y)(X,Y). The basic idea of a GAN is to simultaneously create two models, a generator (GG) and a discriminator (DD). The role of GG is to be a generative model which can transform noise samples ZZ, (not having any information) also called latent space, into examples of XX. The role of DD is to act as a binary classifier between “generated samples” G(Z)G(Z) and real samples XX. The basic output of training a GAN is G(⋅)G(⋅), the generator. In contrast the discriminator is there helping along the way, but is eventually “useless” because once G(⋅)G(⋅) is good in generating samples, D(⋅)D(⋅) cannot classify between a randomly generated sample G(Z)G(Z) and a data sample XX.
This image describes the basic architecture well:
The generator GG works on noise vectors from the latent space to create “fake samples”. The discriminator aims to discriminate between real samples of the data and fake samples. The training of GG and DD is simultanious where eventually DD cannot distinguish between “real” and “fake”.
When we say a generator model G(⋅)G(⋅) and a discriminator model D(⋅)D(⋅) we assume they each have some (typically) fixed architectures and learned parameters θGθG and θDθD. These are going to be neural networks. so θGθG and θDθD, are the weights, biases, filter parameters, batch-normalization parameters, etc.
In training GG and DD we want an objective that will motivate GG to create better samples and at the same time improve DD in detecting if a sample is from GG, or is alternativly a real data sample. To formulate this object lets make use of the two probability distributions that are at play:
We often take the probability distribution of the noise to be an i.i.d. Gaussians distribution of some given dimension (called the latent dimension).
pdata(x)pdata(x) is the probability distributions of data samples.
pz(z)pz(z) is the probability distribution of the noise.
Take D(⋅)D(⋅) as a classifier that returns 11 for a real image and 00 for a fake image. Then the discriminator wants high values of: EX∼pdata [logD(X)].EX∼pdata [logD(X)].
Further for every ZZ we have a fake sample G(Z)G(Z) and D(G(Z))D(G(Z)) is the discriminators clasification which is ideally 00. Hence the discriminator also wants high values of: EZ∼pz[log(1−D(G(Z)))].EZ∼pz[log(1−D(G(Z)))].
Put together for some fixed discriminator D(⋅)D(⋅) and fixed generator G(⋅)G(⋅) the discriminator wishes to maximize,
V(G,D):=EX∼pdata [logD(X)]+EZ∼pz[log(1−D(G(Z)))].V(G,D):=EX∼pdata [logD(X)]+EZ∼pz[log(1−D(G(Z)))].
Now for any fixed discriminator (parameters/model) that achieves maxDV(D,G),maxDV(D,G), the generator wants to “fool” the discriminator, so it wishes to achieve, minGmaxDV(D,G).minGmaxDV(D,G).
Since by now there are several other GAN formulations (see below), this type of original GAN formulation is these days refered to as vanilla GAN. This is the basic GAN formulation, formulated as a mini-max game and if we found parameters θGθG and θDθD such that achieve the “saddle point” minGmaxDV(D,G)minGmaxDV(D,G), then we have trained the GAN.
Achieving the idealized minGmaxDV(D,G)minGmaxDV(D,G) can be obtained approximately via basic iteration of stochastic gradient descent where GG and DD are trained jointly. The basic algorithm (still from the original GAN paper) is as follows:
Minibatches of size mm are used to train both the discriminator and the generator. There is an option to carry out kk discriminator training iterations for every training step of the generator, however (more) recent empirical evidence generally shows that there isn’t typically a need for this.
Here are the first ever published images generated by GAN experiments from the original 2014 paper.
The rightmost yellow highlighted images are from the actual dataset closest to the GAN sample left of it. They are there to ‘show’ that these GANs did not memorize the data.
GAN Lab is similar to the well know Tensor Flow Playground with which you might have played around previously. Before we dive into more examples and details, let’s play with the wonderful online GAN visualization/simulator, GAN Lab. Here is one visualization from GAN Lab:
The use of the “latent space”, also sometimes just called a “feature vector” is a very powerful technique sometimes related to auto-encoders. More on this is in the sections below and the next unit dealing with sequence data.
Having an image represeted via the latent space is very powerful. What happens if you slightly perturb the ZZ in the latent space? What if you interpolate between two images?
Lets get a feel for such perturnabtions of the latent space by playing with a very simple analogy (almost too simple). Say we have data from a univariate exponential distribution with mean 11.
Note that p(x)p(x) is not a probability but p(x)⋅dxp(x)⋅dx for small dxdx approximates the probability of falling in the region [x,x+dx)[x,x+dx). The density for x≥0x≥0 is, p(x)=e−x.p(x)=e−x.
If we were to somehow have a generative model for generating xx, what would it be?
One general method for generating (Monte Carlo sampling) a random variable from effectively any uni-variate distribution is the inverse probability transform. Say the CDF (cumulative distribution function) of XX is F(x)F(x) (and for simplicity assume it is continuous). The inverse CDF is denoted F(−1)(u)F(−1)(u) (it is the quantile function). If we take a uniform [0,1][0,1] random variable, UU, then the random variable F−1(U)F−1(U) will be distributed with CDF F(⋅)F(⋅).
In the case of exponential, F(x)={0x<0,1−e−xx≥0.F(x)={0x<0,1−e−xx≥0. Hence we can show that the inverse function is Just observe that F−1(F(x))=xF−1(F(x))=x. F(−1)(u)=−log(1−u).F(−1)(u)=−log(1−u).
So now we can call U∈[0,1]U∈[0,1] our latent random variable. For every UU we have an XX which we generate via,
In this example the distribution of −log(1−U)−log(1−U) is the same as that of −logU−logU, so for simplicity we can use the latter.
X=−logU.X=−logU.
Hence the model −log(⋅)−log(⋅) is a ‘generative’ model for generating samples from p(⋅)p(⋅) (this is a very simple example!!!).
Now if we perturb the value UU in the latent space slightly then XX will be perturbed slightly also! The same happens with the images perturbations on higher dimensional latent spaces as well, it is simply that the latent variable ZZ is a vector and the generative model G(⋅)G(⋅) is (much more) complicated than our −log(⋅)−log(⋅) function.
Since the (quite recent) original 2014 GAN paper there has been huge progress on GANs, here is a very partial quick tour of some advances. See also the GAN ZOO which presents links to dozens of key GAN papers - an evolving list.
The original GAN was generative with respect to (X)(X) with CGAN we also consider the labels YY.
2014: Conditional Generative Adversarial Nets by Mehdi Mirza and Simon Osindero.
minGmaxDV(D,G)=Ex∼pdata (x)[logD(x∣y)]+Ez∼pz(z)[log(1−D(G(z∣y)))]
Here is conditional GAN used for MNIST:
The original GAN used fully connected networks. But it is very sensible to use convolutional networks
2016: Unsupervised representation learning with deep convolutional generative adversarial networks, by Alec Radford, Luke Metz, and Soumit Chintala.
This hints that there may be meaningful semantics in the latent space.
The strength of the latent space is further demonstrated.
It took a while to generate ‘ImageNet like’ (quality) images. It is now doable. Note that this 2018 work relies on a combination of large and well-managed compute, and several other mathematical advances along the way.
2018: Large scale GAN training for high fidelity natural image synthesis by Andrew Brock, Jeff Donahue, and Karen Simonyan.
Note that in ImageNet there are about 100 dog classes and only one tennis ball class…
See the video at the top of this unit.
2019: A style-based generator architecture for generative adversarial networks by Tero Karras, Samuli Laine and Timo Aila.
Let’s dive into further mathematical details.
Observe first that if p=q then DKL(p||q)=0. Observe further that KL is asymmetric: In general DKL(p||q)≠DKL(q||p). First review the KL (Kullback-Leibler) divergence. It considers two probability distributions p and q and measures “how one diverges from the other”:
DKL(p||q)=∫xp(x)logp(x)q(x)dx.
The Jensen-Shanon (JS) divergence also measures similarity between two probability distributions. It is defined as follows:
DJS(p||q)=12(DKL(p||r)+DKL(q||r)),wherer=p+q2.
Here are some properties of the JS divergence:
Note that if we use log base 2 then it is bounded between in [0,1].
0≤DJS(p||q)≤log2.
It is sometimes called the information radius or total divergence to the average.
It can be used to make a proper metric (its square root is a metric).
We denote pg(x) as the distribution of samples generated by G(⋅). Let’s first observe that for fixed G, the optimal discriminator D is, D∗(x)=pdata(x)pdata(x)+pg(x), where pg(⋅) is the density of G(z) based on the density pz(⋅) of the latent space.
To see this observe that for fixed G, the objective of D is to maximize,
V(G,D)=∫xpdata (x)log(D(x))dx+∫zpz(z)log(1−D(G(z)))dz=∫xpdata (x)log(D(x))+pg(x)log(1−D(x))dx.
Now consider the function f(y)=alog(y)+blog(1−y). It achieves it’s maximum at, y=aa+b, so for any fixed x, ∫xpdata (x)log(D(x))+pg(x)log(1−D(x))dx≤∫xpdata (x)log(pdata(x)pdata(x)+pg(x))+pg(x)log(1−pdata(x)pdata(x)+pg(x))dx:=C(G)
Observe that, C(G)=−log4+2DJS(pdata||pg)
See proof details in the original 2014 GAN paper. Now the generator wants to minimize C(G). It can be shown that it is achieved when pg(⋅)=pdata(⋅) at which point C(G)=−2log2.
Hence the theoretical “solution” of a GAN has pg=pdata in which case the value V(G,D) is −log4≈−1.3862.
While the game has a theoretical equilibrium, in practice we may sometimes not converge to pg=pdata. This is sometimes called mode colapse. There are several possible reasons for this:
We focus on the last bullet point by modifing the value function, as described below.
A general survey papers whcih presents several GAN loss functions and compares them empirically is the following:
2018: Are GANs Created Equal? A Large-Scale Study by Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly and Olivier Bousquet.
We will focus on the Wasserstein GAN initially presented here:
Below are key snippets from the 2017 WGAN paper where some theoretical justification for using the Wasserstein GAN are presented.
The infimum in the Wasserstein distance,
W(Pr,Pg)=infγ∈Π(Pr,Pg)E(x,y)∼γ[‖x−y‖]
is highly non-attainable. However, the Kantorovich-Rubinstein duality tells us that,
W(Pr,Pθ)=sup‖f‖L≤1Ex∼Pr[f(x)]−Ex∼Pθ[f(x)], where the supremum is over all 1-Lipschitz functions, i.e., ||f(x)−f(y)||≤1⋅||x−y||
It still isn’t easy to compute this metric, but it inspires using the discrimentator D(⋅) as f(⋅). Note that now the discriminator is not longer a binary classifier. Hence the WGAN objective is
minGmaxD:||D||L≤1Ex∼pdata[D(x)]−Ez∼pz[D(G(x))],
There is the problem of how to deal with the constraint, :||D||L≤1. A crude (first) way to do this is via clipping. Here is the algorithm from the WGAN paper
Here are comments also taken from the WGAN paper:
It isn’t obvious how to evaluate the performance of a GAN. Here are some current methods:
2016: Improved techniques for training gans by Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.
Taken from that paper:
For a definition see:
2017: Gans trained by a two time-scale update rule converge to a local nash equilibrium by Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
See for example this blog post. These are some of the applications of GANs:
This is still an evolving field. This (quite recent) talk by Ian Goodfellow presents an overview of GANs and the evolving applications:
However we must also acknowledge that there is significant danger with GAN technology, specifically dealing with DeepFakes. Deep fakes pose not only political danger but also private danger. See for example this video:
they are more resilient to vanishing gradients
Page built: 2021-03-04 using R version 4.0.3 (2020-10-10)