/ Vishakh Gopu

Notes on deep generative models part 1

Motivation

Over the past few years a plethora of generative models have been introduced and it can be difficult to see how they all relate to each other. The objective of this post is to review the underlying characteristics of a generative modeling approach instead of any specific model itself. Any specific approach can then be seen as the consequence of a series of decisions along certain axes leading to those characteristics.

From this vantage point it's easier for a practitioner to tailor the modeling approach to the application at hand and for a researcher to see natural connections to explore.

Introduction

Generative modeling is the task of learning a model from data that we can sample new points from. This often translates into learning a probability distribution that is close to some real data distribution from which we have access to samples. In recent years deep neural networks have been leveraged to amortize inference across observed data in latent variable models, sample new points via flexible implicit models and more.

Phrasing the task more formally as done in Bottou et al 2017. :

View the data as a sample from an underlying probability distribution $\mathcal{Q}$ defined over a Polish space $\mathcal{X}$ (Complete and separable whose topology comes from a distance function). Denote by $\mathcal{P}_{\mathcal{X}}$ the space of probability measures $\mu$ defined on $(\mathcal{X},\mathcal{U})$, $\mathcal{U}$: Borel $\sigma$-algebra generated by open sets of $\mathcal{X}$. Consider a way to compare elements of $$\mathcal{P}_{\mathcal{X}}: (P,Q) \rightarrow D(P,Q) \in (0, \infty)$$ The goal, given a family of distributions $\mathcal{P}_\theta \in P_X$, is to find $$\min_{\theta}L(\theta)=D(Q, P_{\theta})$$

Organizing the landscape of deep generative modeling approaches is a daunting task, but we can begin by stating some high level delineating attributes.

  1. The model: How the parameterized family of distributions $P_{\theta}$ and any other auxiliary models are represented.
  2. The principle of learning: How to bring the distribution defined by our generative model closer to the true data generating distribution: $\min_{\theta}L(\theta)=D(Q, P_{\theta})$. This is often very closely tied with the choice of model.
  3. The underlying distance/divergence: What measure $D(Q, P_{\theta})$ that is being minimized. Often the consequence of the model or principle of learning.

Many popular deep generative models in use today can be indexed by how they fall along these axes. The categories are often very intertwined; a modeling choice might also fix the techniques that are available for learning or vice versa. There are also many other factors that can be used to compare different approaches to each other. Despite these shortcomings, It's a reasonable way to try and start organizing the landscape of deep generative models.

In this review we will take a look at each of these attributes in turn and the different choices that can be made under each of them. We will also provide specific examples of approaches that make these choices to give an idea of how things unfold in practice.

We will pay special attention to (3) and discuss how different choices of divergence/distance impact the final learned model and how the different measures of $D$ relate to each other. We will also discuss downstream applications of deep generative modeling, and what the desiderata are for those tasks. Finally we will examine how well different choices of $D$ meet those desiderata.

The model

There are many different ways to model the family of distributions $P_{\theta}$ and any auxiliary models in the context of deep generative modeling. Two general archetypical model classes are latent variable models and implicit models.

Within each class there are innumerable extensions and variations left out of this post. Additionally, there are modeling approaches that fall outside these classes.

Latent variable models

Latent variable models tie observations $x \in \mathbb{R^{d_{x}}}$ with unobserved variables $z \in \mathbb{R^{d_{z}}}$ and model parameters $\theta \in \mathbb{R^{d_{\theta}}}$ They typically specify a story for how a sample is generated with respect to the latent structure. The goal is to then infer the hidden structure from observed data.

One of the simplest instantiations of this approach would be to associate a single latent variable with each data point and have a generative story: $$z \sim p(z)$$ $$x \sim p(x|z)$$ This is the model shown in the illustration below, and corresponds to a generative modeling approach called the variational auto-encoder ( Rezende et al 2014, Kingma et al 2013).

A simple latent variable model

Implicit models

In many practical situations there may be an easy mechanism to sample new points, or to specify how to do so, but no explicit likelihood function. These types of models are typically referred to as implicit (Boutou et al 2017).

The type of implicit models that appear in deep generative modeling typically transform noise from some base distribution into a sample using a differentiable, parameterized, map.

As an example, consider the following mechanism to sample from the generative distribution $\mathbb{P_{g}}$. First sample a vector from some base distribution $z \sim p(z)$, then next pass it through a function $g: \mathcal{Z } \to \mathcal{X}$, where $g$ has parameters $\theta$. This seemingly simple formalism has been leveraged to much success in models like Generative Adversarial Networks (Goodfellow et al 2014) and its many varieties. The aforementioned procedure is depicted below.

A basic implicit model, where $f$ is a differentiable map

Principle of learning

It should be of no surprise that the principle of learning employed is a key property of a generative modeling procedure. It should be equally unsurprising that this area is incredibly rich and vast. Consequently there are many different specific mechanisms for learning and they are often closely related to the modeling choices made.

For each of the two archetypical modeling choices discussed earlier, there comes an accompanying preferred mechanism for learning. Many alternative approaches, as well as extensions and combinations exist from these two broad principles of learning. We follow the general exposition found in Mohamed et al 2018.

For latent variable models, in the context of deep learning, the general principle of learning most commonly employed is maximum likelihood estimation via variational inference.

Maximum likelihood and variational inference

To recall the definition of a latent variable model, take $$x \in \mathbb{R^{d_{x}}}$$ $$z \in \mathbb{R^{d_z}}$$ $$\theta \in \mathbb{R^{d_{\theta}}}$$ Let $\mathcal{D}$ be a dataset of $N$ elements.

The likelihood of any single example under the model is then $$\log p_{\theta}(x)= \log \int p_{\theta}(x|z)p(z)dz$$ Which in the form of an expectation is: $$\log \mathbb{E_{p(z)}}[p_{\theta}(x|z)]$$ The likelihood of $\mathcal{D}$ is then the sum of the individual likelihoods: $$\log p_{\theta}(\mathcal{D}) = \sum_{i=1}^N \log \mathbb{E}_{p(z)}[p_{\theta}(x_i|z)]$$

Though this seems straightforward enough, it is difficult to optimize directly. One problem is that the $z$ is usually very high dimensional in this setting and it may be prohibitively expensive to handle the resulting integral. Additionally, $p_{\theta}(x_i|z)$ may be represented by a complex function like a neural network which has no tractable closed form.

A way to deal with this is to introduce an approximate family of densities $q$ that is much easier to handle computationally and then try and approximate the true posterior $p(z|x)$ with this simpler family.

Doing some manipulation shows how the introduction of this approximate density changes the relevant expectation from before.

$$\log_{p_{\theta}}(D) = \sum_{i=1}^N \log \mathbb{E_{p(z)}}[p_{\theta}(x_i|z)]$$ $$\log \mathbb{E_{p(z)}}[p_{\theta}(x_i|z)]=\log \mathbb{E}_{q_i(z)}[\frac{p_{\theta}(x_i|z)p(z)}{q_i(z)}]$$

By Jensen's inequality:

$$\log \mathbb{E_{q_{i}(z)}}[\frac{p_{\theta}(x_i|z)p(z)}{q_i(z)}] \geq \mathbb{E_{q_i}(z)}[\log\frac{p_{\theta}(x_i|z)p(z)}{q_i(z)}]$$ $$\log_{p_{\theta}}(\mathcal{D})\geq \sum_{i}^N \mathbb{E}_{q_i(z)}[\log \frac{p_{\theta}(x_i|z)p(z)}{q_i(z)}]$$

We can write: $$\mathbb{E}_{q_i(z)}[\log \frac{p_{\theta}(x_i|z)p(z)}{q_i(z)}] = \mathbb{E}_{q_i(z)}[\log p_{\theta}(x_i |z)] - KLD(q_i||p)$$ Which is often referred to as the evidence lower bound or ELBO and consists of a reconstruction term $$\mathbb{E_{q_i}(z)}[log_{p_{\theta}}(x_i|z)]$$ and a regularizer which controls how far $q_i$ strays from the prior $$KLD(q_i||p)$$

We made a tradeoff, as is often the case with any approximation scheme. The introduction of $q$ leads to an easier expectation to work with but also results in another thing to optimize. One natural way to do this optimization is to alternate between finding a good setting for $q$ and good model parameters, which is known as the variational EM algorithm. Since the way $q$ is actually represented depends on various factors, we will just use simplified notation and refer to the optimal $q$s at any time-step for any data point as $q_{i}^*$. In the (E)xpectation step: $$q_i^{*}(z) = \text{argmax}_{q_i}E_{q^{*}_{i}(z)}[\log \frac{p_{\theta}(x_i|z)p(z)}{q_{i}^{*}(z)}]$$ In the (M)aximization step: $$\theta^{*} = \text{argmax}_{\theta}\sum_{i=1}^N \mathbb{E}_{q_i^* (z)}[\log\frac{p_{\theta}(x_i|z)p(z)}{q_{i}^{*}(z)}]$$

Instead of searching over all $q$s we could instead introduce a parametric family of densities: $$ q_i^*(z) = \text{argmax}_{q_i}\mathbb{E_{q_i^*(z)}}[-\mathcal{F}(x_i, z)]$$ $$\text{argmax}_{\phi} \mathbb{E_{q_{\phi}(z|x)}}[-F_{\phi}(x_i, z)]$$

The popular generative model known as the Variational Autoencoder (VAE) does exactly this. ( Rezende et al 2014, Kingma et al 2013 )

The VAE uses the Gaussian variational family $q$ indexed by a mean $\mu$ and a covariance $\Sigma$. The latent variable $z_i$ associated to a data point $x_i$ is then sampled from $\mathcal{N}(\mu_i, \Sigma_i)$ The conditional distributions $p(x_i| z_i)$ is modeled by a neural network and so is the conditional distribution $p(z_i|x_i)$ which maps data points to variational parameters $\eta$:($\mu_i, \Sigma_i$). The loss is then $\mathbb{E}_{q_i(z)}[\log_{p_{\theta}}(x_i|z)]-KLD(q_i||p)$

Summary:

Prior: $$z \sim \mathcal{N}(0, \mathbb{I})$$ Data sufficient statistics $$\eta = f_{\theta}(z)$$ The data conditional likelihood $$x \sim \mathcal{N}(\eta)$$ The data sample: $$x \sim \mathcal{D}$$ Latent sufficient statistics $$\eta=f_{\phi}(x)$$ Posterior sample $$z=\mathcal{N}(\eta)$$

Some attempts to extend this approach include positing richer families of approximate densities (Gemici et al 2016), better stochastic optimization techniques to optimize the ELBO (Rangananth et al 2014), and automating variational inference so its use is not model specific (Kucukelbir et al).

For implicit models with no explicit likelihood the principle of learning typically employed is learning by comparison. We will refer to this interchangeably as learning by comparison and GAN style training.

Learning by comparison

Let the data generating distribution be $p^*$, the approximating distribution be $q$, and our samples from $p^*$ be $\mathcal{D}$.

Often we don't have access to the likelihood function $P(\mathcal{D}|\theta)$ but can easily sample from our model $q_{\phi}$ and have access to samples from the data generating distribution $p^{*}$.

We are also primarily interested in cases where our sampler takes the form of a differentiable map from random vectors to samples: $f_{\phi}(z) \to X$. The function $f$ can be thought of as a deep neural network in this context.

In this scenario our only tractable path to learning is to compare real samples from $p^*$ to samples from the approximating distribution defined by the generative model. We can compare the difference of the two ($q - p^*$) or the ratio $\frac{q}{p^*}$. Lets focus on the ratio for now since it corresponds to the formulation popularized by Generative Adversarial Networks (Goodfellow et al).

Let the ratio of the two distributions be $r_{\phi} = \frac{p^*}{q}$ such that when $r_{\phi}$ is 1 they are identical. We can try and develop a learning procedure that attempts to improve $r_{\phi}$ by alternating between Comparison and Estimation steps.

In the Comparison step we build an auxiliary model to evaluate differences between observed data and generated data. In the estimation step we adjust model parameters accordingly. The general framework for this type of learning is as follows: First combine the real data and simulated data to form $\{\boldsymbol{x_1}..\boldsymbol{x_N}\}$. Next assign the real and fake samples binary labels (+1, -1) respectively, denoting their class membership. Observe that $$p^*(\boldsymbol{x})=p(\boldsymbol{x}|y=1)$$ and $$q(\boldsymbol{x}) = p(\boldsymbol{x}|y=-1)$$ We can now write the ratio as $$\frac{p^*(\boldsymbol{x})}{q(\boldsymbol{x})} = \frac{p(\boldsymbol{x}|y=1)}{p(\boldsymbol{x}|y=-1)}$$ Using Bayes theorem to invert the conditional probabilities gives: $$\frac{p(y=+1|\boldsymbol{x})p(x)}{p(y=+1)} / \frac{p(y=-1|\boldsymbol{x})p(x)}{p(y=-1)}$$ This gives: $$r_{\phi} \propto \frac{p(y=1|x)}{p(y=-1|x)}$$ Next assume a scoring function (discriminator) $$D_{\theta}(\boldsymbol{x}) = p(y=+1|\boldsymbol{x})$$ Since we have binary labels it makes sense to use a standard Bernoulli loss: $$\mathcal{F}(\boldsymbol{x}, \theta, \phi) = E_{p^*(x)}[\log D_{\theta}(\boldsymbol{x})] + E_{q_{\phi}}(x)[\log (1-D_{\theta}(\boldsymbol{x}))]]$$ Our optimization can now be $$\min_{\phi}\max_{\theta}\mathcal{F}(\boldsymbol{x}, \theta, \phi)$$ This naturally leads two alternating steps with respect to either set of parameters:

  1. Update $\theta$: $\nabla_{\theta} E_{p^*}(x)[\log D_{\theta}(\boldsymbol{x})] + \nabla_{\theta} E_{q_{\phi}}(x)[\log(1-D_{\theta}(\boldsymbol{x}))]$
  2. Update $\phi$: $-\nabla_{\phi}E_{q(z)} [\log D_{\theta}(f_{\phi}(\boldsymbol{z}))]$

This form corresponds to the approach popularized by generative adversarial networks where $D_{\theta}$ is called the discriminator and is a neural network, and $f_{\phi}$ the generator which is also a neural network.

Note that we can vary $f_{\phi}$ to get different generative models. This choice of $f_{\phi}$ is closely related to which $D$ is minimized. It's also related to a class of divergences called Integral Probability Metrics $$M_f(p,q)= sup_{f\in \mathcal{F}}|\mathbb{E}_{p(x)}[f]- E_{q_{\theta}(x)}[f]|$$ Where the $f$ in question is sometimes referred to as the critic, test function or witness. (Mohamed et al 2018)

References

Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN.

Arjovsky, M., & Bottou, L. (2017). Towards Principled Methods for Training Generative Adversarial Networks.

Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014 (pp. 1278–1286). Retrieved from http://jmlr.org/proceedings/papers/v32/rezende14.html

Mohamed, S., & Rezende, D. (2018). Tutorial on deep generative models. UAI 2017 Australia. Australia. Retrieved from http://www.shakirm.com/slides/DeepGenModelsTutorial.pdf

Bottou, L., Arjovsky, M., Lopez-Paz, D., & Oquab, M. (2017). Geometrical Insights for Implicit Generative Modeling.

Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11(2), 125–139. https://doi.org/10.1023/A:1008923215028

Johnson, M. J., Duvenaud, D., Wiltschko, A. B., Datta, S. R., & Adams, R. P. (2016). Composing graphical models with neural networks for structured representations and fast inference.

Siddharth, N., Paige, B., van de Meent, J.-W., Desmaison, A., Goodman, N. D., Kohli, P., … Torr, P. H. S. (2017). Learning Disentangled Representations with Semi-Supervised Deep Generative Models.

Li, Y., & Turner, R. E. (2016). Rényi Divergence Variational Inference.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative Adversarial Nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27 (pp. 2672–2680). Curran Associates, Inc. Retrieved from http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf

Rizzo, M. L., & Szekely, G. J. (2016). Energy Distance. WIREs Comput. Stat., 8(1), 27–38. https://doi.org/10.1002/wics.1375

Kingma, D. P., & Welling, M. (2013). Auto-Encoding Variational Bayes. CoRR, abs/1312.6114. Retrieved from http://arxiv.org/abs/1312.6114

Kucukelbir, A., Ranganath, R., Gelman, A., & Blei, D. (2015). Automatic Variational Inference in Stan. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 28 (pp. 568–576). Curran Associates, Inc.

Bińkowski, M., Sutherland, D. J., Arbel, M., & Gretton, A. (2018). Demystifying MMD GANs.

Ranganath, R., Gerrish, S., & Blei, D. M. (2014). Black Box Variational Inference. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 22-25, 2014 (pp. 814–822). Retrieved from http://jmlr.org/proceedings/papers/v33/ranganath14.html

Gemici, M. C., Rezende, D. J., & Mohamed, S. (2016). Normalizing Flows on Riemannian Manifolds. CoRR, abs/1611.02304. Retrieved from http://arxiv.org/abs/1611.02304

Zolotarev, V. M. (1976). Metric Distances In Spaces Of Random Variables And Their Distributions. Mathematics of the USSR-Sbornik, 30(3), 373–401. https://doi.org/10.1070/sm1976v030n03abeh002280