Generative Neural Networks

Note: these notes are in a pretty questionable state (at least after the first few lectures) – there are lot of TODOs that I won’t do. If you’re taking this class and have better notes, I’d be happy to update the website.


This website contains my lecture notes from a lecture by Ullrich Köthe from the academic year 2023/2024 (University of Heidelberg). If you find something incorrect/unclear, or would like to contribute, feel free to submit a pull request (or let me know via email).

Introduction to generative modeling

What I cannot create, I do not understand. (Richard Feynman)

Warm up: 1D generative models

Basic principle: p(X)p(X) is complicated     \implies reduce it to something simple.

Apply to probability distribution:

Now we can substitute z=f(z),dz=f(x)dx,x0=f1(z0),x1=f1(z1)z = f(z), dz = f'(x) dx, x_0 = f^{-1}(z_0), x_1 = f^{-1}(z_1), getting f1(f(x0))f1(f(x1))q(z=f(x))f(x)  dx\int_{f^{-1}(f(x_0))}^{f^{-1}(f(x_1))} q(z=f(x)) f'(x)\;dx which must hold for any interval x0,x1x_0, x_1, meaning p(x)=q(z=f(x))f(x)\boxed{p(x) = q(z=f(x)) f'(x)} which is called the change-of-variables formula, allowing us to express the complex distribution in terms of the simple distribution, transformation and its derivative (this is where the fact that it’s always positive comes in handy, since it’s just a scaling factor).

For 1D, q(Z)uniform(0,1)q(Z) \sim \text{uniform}(0, 1) works well (numerically easy, a generator is available anywhere)

Ex.: exponential distribution p(X)={1vex/τx00otherwisep(X) = \begin{cases} \frac{1}{v}e^{-x/\tau} & x \ge 0 \\ 0 & \text{otherwise} \end{cases}

for τ\tau hyperparameter vv normalization, which has to be v=0ex/τdx=τex/τx=0=τv = \int_{0}^{\infty} e^{-x/\tau} dx' = -\tau e^{-x/\tau} \mid_{x=0}^{\infty} = \boxed{\tau}

To get f(x)f(x), we again do the integral with upper bound being xx, getting 0x1τex/τdx=1ex/τ\int_{0}^{x} \frac{1}{\tau}e^{-x/\tau} dx' = 1 - e^{-x/\tau}

To sample from it, we solve for xx in the equation above and get f1(z)=τlog(1z)\boxed{f^{-1}(z) = -\tau \log(1-z)} which is either inverse CDF, quanti function or the percent-point function (PPF)

There was another example of a Gaussian PDF/CDF/PPF. Unlike the exponential, it the CDF doesn’t have a closed-form solution and has to be approximated.

Ex.: f(X)f(X) and f1(X)f^{-1}(X) need not be continuous (they can have jumps) – Bernoulli p(X)={+1probability π[0,1]1probability 1πp(X) = \begin{cases} +1 & \text{probability}\ \pi \in \left[0, 1\right] \\ -1 & \text{probability}\ 1 - \pi \\ \end{cases}

We want to make it continuous, we do embedding via the δ\delta-distribution p(x)=(1π)δ(x+1)+πδ(x1)p(x) = (1 - \pi) \delta(x + 1) + \pi \delta (x - 1)

Bernoulli example.

Ex.: spike-and-slab distribution p(X)={0probability π(nothing happened)N(μ,σ2)probability 1π(something happened)p(X) = \begin{cases} 0 & \text{probability}\ \pi & \text{(nothing happened)} \\ \mathcal{N}(\mu, \sigma^2) & \text{probability}\ 1 - \pi &\text{(something happened)} \end{cases}

Spike-and-slab example.

Classical approaches

What if we don’t know p(x)p^*(x) but we have a TS={Xip(X)}i=1NTS = \left\{X_i \sim p^*(X)\right\}_{i=1}^N?

  1. histogram – define Xmin=miniXiX_{\text{min}} = \min_{i} X_i (smallest value), h=2 IQR(TS)N3h = \frac{2\ \mathrm{IQR}(TS)}{\sqrt[3]{N}} (bin width)
    • the IQR(TS)\mathrm{IQR}(TS) “inter-quantile range” is obtained by sorting TS and getting the difference between values X[N4]X\left[\frac{N}{4}\right] and X[3N4]X\left[\frac{3N}{4}\right] (i.e. quarter to the left, quarter to the right)
    • the formula gives a good balance between number of bins vs. the number of items per bin binl={xi:xmin+(l1)hxi<xmin+lh}Nl=#binl\mathrm{bin}_l = \left\{x_i: x_{\text{min}} + (l - 1)h \le x_i < x_{\text{min}} + lh\right\} \qquad N_l = \# \mathrm{bin}_l p(X)=l=1L1[Xbinl]NlNh\boxed{p(X) = \sum_{l = 1}^{L} \mathrm{1}\left[X \in \mathrm{bin}_l\right] \cdot \frac{N_l}{N \cdot h}}
  2. generalization of a histogram – mixture distribution
    • idea: express a complicated p(X)p(X) as a superposition of LL simple pl(X)p_l(X), i.e. p(X)=l=1Lπlpl(X)p(X) = \sum_{l = 1}^{L} \pi_l p_l(X)
      • πl\pi_l is weight, plp_l is some simple distribution
      • πl0,l=1Lπl=1\pi_l \ge 0, \sum_{l = 1}^{L} \pi_l = 1 (has to be a probability distribution…)
    • for histogram, pl(X)p_l(X) is uniform
    • for Gaussians, we have Gaussian mixture model (GMM); solved with the EM algorithm
  3. kernel density estimation pl(X)=N(μl,σ2I)p_l(X) = \mathcal{N}(\mu_l, \sigma^2\mathbb{I}) (or any other simple distribution)
    • L=NL = N (one component per training sample)
    • μl=Xl,πl=1N\mu_l = X_l, \pi_l = \frac{1}{N}
    • similar to mixture distribution (has less distributions), here σ2\sigma^2 is the hyperparameter

Warmed up: more dimensions!

Extensions of classical approaches

  1. multi-variate standard normal p(X)=N(μ=0,Σ=I)=1(2π)D/2exp(XXT2)p(X) = \mathcal{N}\left(\mu=0, \Sigma=\mathbb{I}\right) = \frac{1}{\left(2\pi\right)^{D/2}}\mathrm{exp}\left(-\frac{X X^T}{2}\right)
    • can be expressed as a product of 1-D normal distributions: XXT=i=1Dxi2    exp(XXT2)=exp(iXi22)=i=1Dexp(xi22)X X^T = \sum_{i = 1}^{D} x^2_i \implies \mathrm{exp}\left(-\frac{X X^T}{2}\right) = \mathrm{exp}\left(-\frac{\sum_{i} X_i^2}{2}\right) = \prod_{i = 1}^{D} \mathrm{exp}\left(-\frac{x_i^2}{2}\right)
    • \Rightarrow sampling in DD dimensions boils down to many samplings in 1-D
  2. multi-variate Gaussian distribution with mean μ\mu, covariance Σ\Sigma p(X)=N(μ,Σ)=1det2πΣexp((Xμ)Σ1(Xμ)T2)p(X) = \mathcal{N}(\mu, \Sigma) = \frac{1}{\sqrt{\det 2\pi \Sigma}} \mathrm{exp}\left(-\frac{\left(X-\mu\right)\Sigma^{-1}\left(X - \mu\right)^T}{2}\right)
    • to sample, we compute the eigen decomposition of Σ=UΛUT\Sigma = U \Lambda U^T (for URD×DU \in \mathbb{R}^{D \times D} orthonormal and Λ\Lambda diagonal with eigen values) and do the following:
      • invert Σ1=UΛ1UT\Sigma^{-1} = U \Lambda^{-1} U^{T} (inverse is transposition for orthogonal)
      • then Λ1/2=(1λ11λD)\Lambda^{-1/2} = \begin{pmatrix} \frac{1}{\sqrt{\lambda_1}} & & \\ & \ddots & \\ & & \frac{1}{\sqrt{\lambda_D}} \end{pmatrix}
      • define new coordinates X~=(Xμ)UΛ1/2\tilde{X} = \left(X - \mu\right) \cdot U \Lambda^{-1/2}
      • then X~X~T=(Xμ)Σ1(Xμ)T\tilde{X}\tilde{X}^T = \left(X - \mu\right) \Sigma^{-1} \left(X - \mu\right)^T
    • we can sample DD 1-D gaussians and put them in the X~\tilde{X} vector, which we invert
      • X=X~Λ1/2UT+μX =\tilde{X} \Lambda^{1/2} U^T + \mu: reparametrization trick (from arbitraty Gaussian to std. normal)
    • in our language: Z=(Xμ)UΛ1/2N(0,I)    X=f1(Z)=ZΛ1/2UT+μN(μ,Σ)Z = \left(X - \mu\right) U \Lambda^{-1/2} \sim \mathcal{N}(0, \mathbb{I}) \implies X = f^{-1}(Z) = Z \Lambda^{1/2} U^T + \mu \sim \mathcal{N}(\mu, \Sigma)

What if p(X)p(X) must be learned?

Mixture model idea naturally generalizes to DD dimensions:

  1. histograms are defined by a regular grid with KK levels per dimension     L=kD\implies L = k^D (exponential, does not scale)
    • solution: define bins by recursive subdivision (eg. density tree)
    • quality of the models is only medium – each subdivision doubles the number but looks at only one variable, which is bad if you have 100s of variables… the number of coorelations to be considered is bounded by tree depth, which must be O(logN)\mathcal{O}\left(\log N\right)
  2. for Gaussians, we have to learn co-variance (instead of variance)     \implies change the EM algorithm accordingly
  3. kernel density estimation is also unchanged, but finding a bandwidth σ2\sigma^{2} that works equally well Xi\forall X_i is very difficult

Reducing p(X)p(X) to many 1-D distributions

Measuring model quality

Forward KL divergence

KL[pp]=p(X)logp(X)p(X) dx=Exp(X)forward[logp(X)p(X)]\mathrm{KL}\left[p^* \mid\mid p\right] = \int p^*(X) \log \frac{p^*(X)}{p(X)}\ dx = \mathbb{E}_{\underbrace{x \sim p^*(X)}_{\text{forward}}} \left[\log \frac{p^*(X)}{p(X)}\right]

A few useful properties:

It’s a divergence and not a distance (i.e. a metric) – not symmetric, triangular inequality doesn’t hold

There was an example here with a discrete distribution.

Caveat: if there are datapoints in XX s.t. p(X)>0p^*(X) > 0 but p(X)=0p(X) = 0, then KL=\mathrm{KL} = \infty

Relationship between forward KL and maximum likelihood training: KL[pp]=p(X)logp(X)dxH[p](entropy)p(X)logp(X)dxEXp(X)[logp(X)]\mathrm{KL}\left[p^* \mid \mid p\right] = \underbrace{\int p^*(X) \log p^*(X) dx}_{H\left[p^*\right] \text{(entropy)}} - \underbrace{\int p^*(X) \log p(X) dx}_{\mathbb{E}_{X \sim p^*(X) \left[-\log p(X)\right]}}

Given TS={Xip(X)}i=1NTS = \left\{X_i \sim p^*(X)\right\}_{i=1}^N, p^(X)arg minp(X)1Ni=1Nlogp(Xi)\boxed{\hat p(X) \approx \argmin_{p(X)} \frac{1}{N} \sum_{i=1}^{N} -\log p(X_i)}

Reverse KL divergence

KL[pp]=p(X)logp(X)p(X) dx=Exp(X)reverse[logp(X)p(X)]\mathrm{KL}\left[p \mid\mid p^*\right] = \int p(X) \log \frac{p(X)}{p^*(X)}\ dx = \mathbb{E}_{\underbrace{x \sim p(X)}_{\text{reverse}}} \left[\log \frac{p(X)}{p^*(X)}\right]

Comparison (forward x reverse)

Maximum Mean Discrepancy

The idea is to use the kernel trick

Typical kernels (for h,Lh, L hyperparameters):

Generative Modeling with NN

Here we have 3 conflicting goals:

  1. small codes: d=dim(Z)dim(X)=Dd = \mathrm{dim}(Z) \ll \mathrm{dim}(X) = D
  2. accurate reconstruction: dist(X,X~)0\mathrm{dist}(X, \tilde X) \approx 0 (for some distance function)
  3. preserve data distribution: dist(p(X),p(X^))0\mathrm{dist}(p^*(X), p(\hat X)) \approx 0 (for some distribution)

Note that 3⇏23 \not\Rightarrow 2 (shown during lecture).


Generating data

  1. expert learning of pE(Z)p_E(Z) by a second generative model
    • often simpler than learning p(X)p^*(X) directly (e.g. d<Dd < D)
    • Stable Diffusion (image generation) does this
  2. joined optimization:
    • predefine q(Z)q(Z) (desired code dimension)
    • measure MMD(pE(Z),q(Z))    \mathrm{MMD}\left(p_E(Z), q(Z)\right) \implies add new loss term
      • choosing a kernel is another hyperparameter f^,g^=arg minf,gEXp(X)[Xq(f(X))2]+λMMD(f#p,q)\hat f, \hat g = \argmin_{f, g} \mathbb{E}_{X \sim p^*(X)} \left[||X - q(f(X))||^2\right] + \lambda \mathrm{MMD}\left(f_\#p^*, q\right)
    • paper: variant of INfoVAE [2019]
  3. variational autoencoder (VAE) [2014]
    • idea: replace deterministic functions f(X)f(X) and q(Z)q(Z) with conditional distributions
      • encoder: pE(ZX)p_E(Z \mid X), decoder pD(XZ)p_D(X \mid Z)
      • we also have data p(X)p^*(X) and desired code dist. q(Z)q(Z)
    • implies two version of joint distribution of XX and ZZ
      • encoder pE(X,Z)=p(X)pE(ZX)p_E(X, Z) = p^*(X) \cdot p_E(Z \mid X) (Bayes)
      • decoder pD(X,Z)=q(Z)pD(XZ)p_D(X, Z) = q(Z) \cdot p_D(X \mid Z) (also Bayes)
    • two requirements:
      1. decoder marginal p(X)=pD(X,Z)  dzp(X)p(X) = \int p_D(X, Z)\;dz \approx p^*(X)
      2. encoder-decoder pair must be self-consistent: pE(X,Z)=pD(X,Z)p_E(X, Z) = p_D(X, Z)
    • here we will use the ELBO\mathrm{ELBO} loss: ELBO=KL[pE(ZX)q(Z)]how close is pE(ZX) to q(Z)+EpE(ZX)[logpD(XZ)]reconstruction error\mathrm{ELBO} = \underbrace{-\mathrm{KL}\left[p_E(Z \mid X) \mid\mid q(Z)\right]}_{\text{how close is $p_E(Z \mid X)$ to $q(Z)$}} + \underbrace{\mathbb{E}_{p_E(Z \mid X)} \left[\log p_D(X \mid Z)\right]}_{\text{reconstruction error}}
      • trade-off between two objectives:
        • reconstruction error minimized if encoder & decoder are deterministic with perfect reconstruction
        • first term minimized if pE(ZX)=q(Z)p_E(Z \mid X) = q(Z) – encoder ignores the data, which is the opposite of perfect reconstruction
    • maximizing ELBO\mathrm{ELBO} loss enforces (2) (proved during the lecture)
      • in literature, ELBO\mathrm{ELBO} is usually maximized
      • conceptually, minimizing ELBO-\mathrm{ELBO} is simpler
    • maximizing the ELBO\mathrm{ELBO} also indirectly maximizes the data likelihood under the model – ML principle: TS should be a typical model outcome
      •     \iff minimize expected negative log-likelihood (NLL) (proved during the lecture)
      • ELBO-\mathrm{ELBO} is an upper bound for NLL loss
    • most common implementation encoder and decoder are diagonal Gaussians: pE(ZX)=N(ZμE(X),Diag(σE(X)2))p_E(Z \mid X) = \mathcal{N}(Z \mid \mu_E(X), \mathrm{Diag}(\sigma_E(X)^2)) pD(XZ)=N(XμD(Z),β2I)p_D(X \mid Z) = \mathcal{N}(X \mid \mu_D(Z), \beta^2 \mathbb{I})
      • since we only have diagonal Gaussians, we can’t rotate the ellipses
      • since we only have Gaussians, the shape is restricted to circles
      • if q(Z)=N(0,I)q(Z) = \mathcal{N}(0, \mathbb{I}) then KL[pEg]\mathrm{KL}\left[p_E \mid\mid g\right] can be calculated analytically
      • if pD(XZ)=N(ZμD(Z),β2I)p_D(X \mid Z) = \mathcal{N}(Z \mid \mu_D(Z), \beta^2 \mathbb{I}), reconstruction error becomes squared loss
      • β21\beta^2 \gg 1 downscales squared loss     \implies reconstruction error unimportant
      • β21\beta^2 \ll 1 upscales squared loss     \implies reconstruction error dominant
      • generation: Zq(Z),XpD(XZ)    μD(Z)+β2εnoiseZ \sim q(Z), X \sim p_D(X \mid Z) \iff \mu_D(Z) + \overbrace{\beta^2 \varepsilon}^{\text{noise}}
      • inference: if p(X)=p(X)p(X) = p^*(X) and pE(X,Z)=pD(X,Z)p_E(X, Z) = p_D(X, Z) then pE(X,Z)=p(X)pE(ZX)=q(Z)pD(XZ)=pD(X,Z)p_E(X, Z) = p(X) p_E(Z \mid X) = q(Z) p_D(X \mid Z) = p_D(X, Z)     p(X)=q(Z)pD(XZ)pE(ZX)\implies p(X) = \frac{q(Z) p_D(X \mid Z)}{p_E(Z \mid X)} must give the same value for all ZpE(ZX)Z \sim p_E(Z \mid X)
  4. conditional VAE
    • instead of learning p(X)p(X), we learn p(XY)p(X \mid Y) for some variable YY
    • e.g. Y{0,,9}Y \in \left\{0, \ldots, 9\right\} for MNIST label p(XY)=q(Z)pD(XY,Z)  dz    p(X)=Yp(XY)p(Y)p(X \mid Y) = \int q(Z) \cdot p_D (X \mid Y, Z)\;dz \implies p(X) = \sum_{Y} p(X \mid Y) p^*(Y)
    • if encoder & decoder are Gaussians, add YY as input to the networks
    • we can supply different labels for encoding / decoding – style transfer
      • one digit in style of another / one image in the style of another
    • here we can do operative classification: test XX with unknown label, try encoding with every label and calculate the corresponding p(XY)p(X \mid Y), returning the one with maximum probability

Generative Adversarial Networks (GANs)

Normalized Flows (NF)

Goals Autoencoder VAE GAN NF
small codes hyperp. hyperp. hyperp. lossless
accurate distribution p(X)p(X)p(X) \approx p^*(X) bad (doesn’t care) trade-off good good
good reconstruction X^=D(E(X))X\hat X = D(E(X)) \approx X good trade-off can’t good

Idea: generalize the inverse transformation method to arbitrary dimensions:

Since consistency must hold for any AA, the integrals must be equal, we get the multi-variate change-of-variables formula p(X)=q(Z=f(X))  detJf(X)\boxed {p(X) = q(Z = f(X))\; | \det \mathcal{J}_f (X) | }


Recall the 11-D case: q(Z)=uniform(0,1)    f(X)=CDFp(X)q(Z) = \text{uniform}(0, 1) \implies f(X) = \mathrm{CDF}_{p^*}(X)

Two difficult problems in practice, covered in the next sections:

  1. how to ensure that f(X)f(X) is invertible & efficiently compute f1(Z)f^{-1}(Z)
  2. how to (efficiently) calculate detJf\det \mathcal{J}_f

(2) Calculating detJf\det \mathcal{J}_f

If f(X)f(X) is a multi-layer network, f(X)f(X) is a composition of functions f(l)f^{(l)}

(1) Ensuring that f(X)f(X) is invertible and computable

TODO: add the drawing here



Spline coupling

Conditional derivatives

Simulation-based inference (SBI)

Main tasks of SBI

  1. surrogate modelling: train a model p(XY)p(X \mid Y) that emulates the simulation
    • good for speed-up (since X=ϕ(Y,η)X = \phi(Y, \eta) is often slow)
    • forward inference: often, X=ϕ(Y,η)X = \phi(Y, \eta) only defines pS(XY)=ϕ#(Y,η)p^S(X \mid Y) = \phi_{\#} (Y, \eta) implicitly, but doesn’t allow to calculate pS(X=XY)p^{S}(X=X \mid Y) (“likelihood-free inference, implicit likelihood”)
      • approximate true likelihood by p(XY)pS(XY)p(X \mid Y) \approx p^S(X \mid Y)
  2. inverse inference: run the simulation backwards: Y=ϕ1(X)Y = \phi^{-1}(X)
    • usually intractable (no analytic solution) and/or ill-posed (no inverse)
    • traditional solution would be to pick constrainst & regularization that select one
    • SBI solution: pick probabilistic via Bayes rule pS(YX)=pS(Y)pS(XY)pS(X)p^S(Y \mid X) = \frac{p^S(Y) p^S(X \mid Y)}{p^S(X)}
      • define equivalence classes F(X)={Y:η with ϕ(Y,η)=X}\mathcal{F}(X) = \left\{Y : \exists \eta \ \text{with}\ \phi(Y, \eta) = X\right\}
        • all YYs that could have produced given XX
      • posterior pS(YX)p^S(Y \mid X) assigns a “possibility” to every YF(X)Y \in \mathcal{F}(X)
      • problems:
        • if likelihood pS(XY)p^S(X \mid Y) is only implicitly defined then Bayes rule canot be calculated (i.e. surrogate model above)
        • even if pS(XY)p^S(X \mid Y) (or a surrogate) is known, Bayes rules is usually intractable
          • \Rightarrow learn generative model for posterior p(YX)pS(YX)p(Y \mid X) \approx p^S(Y \mid X)
  3. model missclasification & outliner detection – a simulation is not reality: pS(Y)pS(XY)simulationp(Y)p(XY)reality\underbrace{p^S(Y) \cdot p^S(X \mid Y)}_{\text{simulation}} \approx \underbrace{p^*(Y)p^*(X \mid Y)}_{\text{reality}}