slama.dev

notes Icon Generative Neural Networks

Note: these notes are in a pretty questionable state (at least after the first few lectures) – there are lot of TODOs that I won’t do. If you’re taking this class and have better notes, I’d be happy to update the website.

Preface

This website contains my lecture notes from a lecture by Ullrich Köthe from the academic year 2023/2024 (University of Heidelberg). If you find something incorrect/unclear, or would like to contribute, feel free to submit a pull request (or let me know via email).

Introduction to generative modeling

What I cannot create, I do not understand. (Richard Feynman)

Warm up: 1D generative models

Basic principle: p(X)p(X) is complicated     \implies reduce it to something simple.

Apply to probability distribution:

Now we can substitute z=f(z),dz=f(x)dx,x0=f1(z0),x1=f1(z1)z = f(z), dz = f'(x) dx, x_0 = f^{-1}(z_0), x_1 = f^{-1}(z_1), getting f1(f(x0))f1(f(x1))q(z=f(x))f(x)  dx\int_{f^{-1}(f(x_0))}^{f^{-1}(f(x_1))} q(z=f(x)) f'(x)\;dx which must hold for any interval x0,x1x_0, x_1, meaning p(x)=q(z=f(x))f(x)\boxed{p(x) = q(z=f(x)) f'(x)} which is called the change-of-variables formula, allowing us to express the complex distribution in terms of the simple distribution, transformation and its derivative (this is where the fact that it’s always positive comes in handy, since it’s just a scaling factor).

For 1D, q(Z)uniform(0,1)q(Z) \sim \text{uniform}(0, 1) works well (numerically easy, a generator is available anywhere)

Ex.: exponential distribution p(X)={1vex/τx00otherwisep(X) = \begin{cases} \frac{1}{v}e^{-x/\tau} & x \ge 0 \\ 0 & \text{otherwise} \end{cases}

for τ\tau hyperparameter vv normalization, which has to be v=0ex/τdx=τex/τx=0=τv = \int_{0}^{\infty} e^{-x/\tau} dx' = -\tau e^{-x/\tau} \mid_{x=0}^{\infty} = \boxed{\tau}

To get f(x)f(x), we again do the integral with upper bound being xx, getting 0x1τex/τdx=1ex/τ\int_{0}^{x} \frac{1}{\tau}e^{-x/\tau} dx' = 1 - e^{-x/\tau}

To sample from it, we solve for xx in the equation above and get f1(z)=τlog(1z)\boxed{f^{-1}(z) = -\tau \log(1-z)} which is either inverse CDF, quanti function or the percent-point function (PPF)

There was another example of a Gaussian PDF/CDF/PPF. Unlike the exponential, it the CDF doesn’t have a closed-form solution and has to be approximated.

Ex.: f(X)f(X) and f1(X)f^{-1}(X) need not be continuous (they can have jumps) – Bernoulli p(X)={+1probability π[0,1]1probability 1πp(X) = \begin{cases} +1 & \text{probability}\ \pi \in \left[0, 1\right] \\ -1 & \text{probability}\ 1 - \pi \\ \end{cases}

We want to make it continuous, we do embedding via the δ\delta-distribution p(x)=(1π)δ(x+1)+πδ(x1)p(x) = (1 - \pi) \delta(x + 1) + \pi \delta (x - 1)

Bernoulli example.

Ex.: spike-and-slab distribution p(X)={0probability π(nothing happened)N(μ,σ2)probability 1π(something happened)p(X) = \begin{cases} 0 & \text{probability}\ \pi & \text{(nothing happened)} \\ \mathcal{N}(\mu, \sigma^2) & \text{probability}\ 1 - \pi &\text{(something happened)} \end{cases}

Spike-and-slab example.

Classical approaches

What if we don’t know p(x)p^*(x) but we have a TS={Xip(X)}i=1NTS = \left\{X_i \sim p^*(X)\right\}_{i=1}^N?

  1. histogram – define Xmin=miniXiX_{\text{min}} = \min_{i} X_i (smallest value), h=2 IQR(TS)N3h = \frac{2\ \mathrm{IQR}(TS)}{\sqrt[3]{N}} (bin width)
    • the IQR(TS)\mathrm{IQR}(TS) “inter-quantile range” is obtained by sorting TS and getting the difference between values X[N4]X\left[\frac{N}{4}\right] and X[3N4]X\left[\frac{3N}{4}\right] (i.e. quarter to the left, quarter to the right)
    • the formula gives a good balance between number of bins vs. the number of items per bin binl={xi:xmin+(l1)hxi<xmin+lh}Nl=#binl\mathrm{bin}_l = \left\{x_i: x_{\text{min}} + (l - 1)h \le x_i < x_{\text{min}} + lh\right\} \qquad N_l = \# \mathrm{bin}_l p(X)=l=1L1[Xbinl]NlNh\boxed{p(X) = \sum_{l = 1}^{L} \mathrm{1}\left[X \in \mathrm{bin}_l\right] \cdot \frac{N_l}{N \cdot h}}
  2. generalization of a histogram – mixture distribution
    • idea: express a complicated p(X)p(X) as a superposition of LL simple pl(X)p_l(X), i.e. p(X)=l=1Lπlpl(X)p(X) = \sum_{l = 1}^{L} \pi_l p_l(X)
      • πl\pi_l is weight, plp_l is some simple distribution
      • πl0,l=1Lπl=1\pi_l \ge 0, \sum_{l = 1}^{L} \pi_l = 1 (has to be a probability distribution…)
    • for histogram, pl(X)p_l(X) is uniform
    • for Gaussians, we have Gaussian mixture model (GMM); solved with the EM algorithm
  3. kernel density estimation pl(X)=N(μl,σ2I)p_l(X) = \mathcal{N}(\mu_l, \sigma^2\mathbb{I}) (or any other simple distribution)
    • L=NL = N (one component per training sample)
    • μl=Xl,πl=1N\mu_l = X_l, \pi_l = \frac{1}{N}
    • similar to mixture distribution (has less distributions), here σ2\sigma^2 is the hyperparameter

Warmed up: more dimensions!

Extensions of classical approaches

  1. multi-variate standard normal p(X)=N(μ=0,Σ=I)=1(2π)D/2exp(XXT2)p(X) = \mathcal{N}\left(\mu=0, \Sigma=\mathbb{I}\right) = \frac{1}{\left(2\pi\right)^{D/2}}\mathrm{exp}\left(-\frac{X X^T}{2}\right)
    • can be expressed as a product of 1-D normal distributions: XXT=i=1Dxi2    exp(XXT2)=exp(iXi22)=i=1Dexp(xi22)X X^T = \sum_{i = 1}^{D} x^2_i \implies \mathrm{exp}\left(-\frac{X X^T}{2}\right) = \mathrm{exp}\left(-\frac{\sum_{i} X_i^2}{2}\right) = \prod_{i = 1}^{D} \mathrm{exp}\left(-\frac{x_i^2}{2}\right)
    • \Rightarrow sampling in DD dimensions boils down to many samplings in 1-D
  2. multi-variate Gaussian distribution with mean μ\mu, covariance Σ\Sigma p(X)=N(μ,Σ)=1det2πΣexp((Xμ)Σ1(Xμ)T2)p(X) = \mathcal{N}(\mu, \Sigma) = \frac{1}{\sqrt{\det 2\pi \Sigma}} \mathrm{exp}\left(-\frac{\left(X-\mu\right)\Sigma^{-1}\left(X - \mu\right)^T}{2}\right)
    • to sample, we compute the eigen decomposition of Σ=UΛUT\Sigma = U \Lambda U^T (for URD×DU \in \mathbb{R}^{D \times D} orthonormal and Λ\Lambda diagonal with eigen values) and do the following:
      • invert Σ1=UΛ1UT\Sigma^{-1} = U \Lambda^{-1} U^{T} (inverse is transposition for orthogonal)
      • then Λ1/2=(1λ11λD)\Lambda^{-1/2} = \begin{pmatrix} \frac{1}{\sqrt{\lambda_1}} & & \\ & \ddots & \\ & & \frac{1}{\sqrt{\lambda_D}} \end{pmatrix}
      • define new coordinates X~=(Xμ)UΛ1/2\tilde{X} = \left(X - \mu\right) \cdot U \Lambda^{-1/2}
      • then X~X~T=(Xμ)Σ1(Xμ)T\tilde{X}\tilde{X}^T = \left(X - \mu\right) \Sigma^{-1} \left(X - \mu\right)^T
    • we can sample DD 1-D gaussians and put them in the X~\tilde{X} vector, which we invert
      • X=X~Λ1/2UT+μX =\tilde{X} \Lambda^{1/2} U^T + \mu: reparametrization trick (from arbitraty Gaussian to std. normal)
    • in our language: Z=(Xμ)UΛ1/2N(0,I)    X=f1(Z)=ZΛ1/2UT+μN(μ,Σ)Z = \left(X - \mu\right) U \Lambda^{-1/2} \sim \mathcal{N}(0, \mathbb{I}) \implies X = f^{-1}(Z) = Z \Lambda^{1/2} U^T + \mu \sim \mathcal{N}(\mu, \Sigma)

What if p(X)p(X) must be learned?

Mixture model idea naturally generalizes to DD dimensions:

  1. histograms are defined by a regular grid with KK levels per dimension     L=kD\implies L = k^D (exponential, does not scale)
    • solution: define bins by recursive subdivision (eg. density tree)
    • quality of the models is only medium – each subdivision doubles the number but looks at only one variable, which is bad if you have 100s of variables… the number of coorelations to be considered is bounded by tree depth, which must be O(logN)\mathcal{O}\left(\log N\right)
  2. for Gaussians, we have to learn co-variance (instead of variance)     \implies change the EM algorithm accordingly
  3. kernel density estimation is also unchanged, but finding a bandwidth σ2\sigma^{2} that works equally well Xi\forall X_i is very difficult

Reducing p(X)p(X) to many 1-D distributions

Measuring model quality

Forward KL divergence

KL[pp]=p(X)logp(X)p(X) dx=Exp(X)forward[logp(X)p(X)]\mathrm{KL}\left[p^* \mid\mid p\right] = \int p^*(X) \log \frac{p^*(X)}{p(X)}\ dx = \mathbb{E}_{\underbrace{x \sim p^*(X)}_{\text{forward}}} \left[\log \frac{p^*(X)}{p(X)}\right]

A few useful properties:

It’s a divergence and not a distance (i.e. a metric) – not symmetric, triangular inequality doesn’t hold

There was an example here with a discrete distribution.

Caveat: if there are datapoints in XX s.t. p(X)>0p^*(X) > 0 but p(X)=0p(X) = 0, then KL=\mathrm{KL} = \infty

Relationship between forward KL and maximum likelihood training: KL[pp]=p(X)logp(X)dxH[p](entropy)p(X)logp(X)dxEXp(X)[logp(X)]\mathrm{KL}\left[p^* \mid \mid p\right] = \underbrace{\int p^*(X) \log p^*(X) dx}_{H\left[p^*\right] \text{(entropy)}} - \underbrace{\int p^*(X) \log p(X) dx}_{\mathbb{E}_{X \sim p^*(X) \left[-\log p(X)\right]}}

Given TS={Xip(X)}i=1NTS = \left\{X_i \sim p^*(X)\right\}_{i=1}^N, p^(X)arg minp(X)1Ni=1Nlogp(Xi)\boxed{\hat p(X) \approx \argmin_{p(X)} \frac{1}{N} \sum_{i=1}^{N} -\log p(X_i)}

Reverse KL divergence

KL[pp]=p(X)logp(X)p(X) dx=Exp(X)reverse[logp(X)p(X)]\mathrm{KL}\left[p \mid\mid p^*\right] = \int p(X) \log \frac{p(X)}{p^*(X)}\ dx = \mathbb{E}_{\underbrace{x \sim p(X)}_{\text{reverse}}} \left[\log \frac{p(X)}{p^*(X)}\right]

Comparison (forward x reverse)

Maximum Mean Discrepancy

The idea is to use the kernel trick

Typical kernels (for h,Lh, L hyperparameters):

Generative Modeling with NN

Here we have 3 conflicting goals:

  1. small codes: d=dim(Z)dim(X)=Dd = \mathrm{dim}(Z) \ll \mathrm{dim}(X) = D
  2. accurate reconstruction: dist(X,X~)0\mathrm{dist}(X, \tilde X) \approx 0 (for some distance function)
  3. preserve data distribution: dist(p(X),p(X^))0\mathrm{dist}(p^*(X), p(\hat X)) \approx 0 (for some distribution)

Note that 3⇏23 \not\Rightarrow 2 (shown during lecture).

Autoencoder

Generating data

  1. expert learning of pE(Z)p_E(Z) by a second generative model
    • often simpler than learning p(X)p^*(X) directly (e.g. d<Dd < D)
    • Stable Diffusion (image generation) does this
  2. joined optimization:
    • predefine q(Z)q(Z) (desired code dimension)
    • measure MMD(pE(Z),q(Z))    \mathrm{MMD}\left(p_E(Z), q(Z)\right) \implies add new loss term
      • choosing a kernel is another hyperparameter f^,g^=arg minf,gEXp(X)[Xq(f(X))2]+λMMD(f#p,q)\hat f, \hat g = \argmin_{f, g} \mathbb{E}_{X \sim p^*(X)} \left[||X - q(f(X))||^2\right] + \lambda \mathrm{MMD}\left(f_\#p^*, q\right)
    • paper: variant of INfoVAE [2019]
  3. variational autoencoder (VAE) [2014]
    • idea: replace deterministic functions f(X)f(X) and q(Z)q(Z) with conditional distributions
      • encoder: pE(ZX)p_E(Z \mid X), decoder pD(XZ)p_D(X \mid Z)
      • we also have data p(X)p^*(X) and desired code dist. q(Z)q(Z)
    • implies two version of joint distribution of XX and ZZ
      • encoder pE(X,Z)=p(X)pE(ZX)p_E(X, Z) = p^*(X) \cdot p_E(Z \mid X) (Bayes)
      • decoder pD(X,Z)=q(Z)pD(XZ)p_D(X, Z) = q(Z) \cdot p_D(X \mid Z) (also Bayes)
    • two requirements:
      1. decoder marginal p(X)=pD(X,Z)  dzp(X)p(X) = \int p_D(X, Z)\;dz \approx p^*(X)
      2. encoder-decoder pair must be self-consistent: pE(X,Z)=pD(X,Z)p_E(X, Z) = p_D(X, Z)
    • here we will use the ELBO\mathrm{ELBO} loss: ELBO=KL[pE(ZX)q(Z)]how close is pE(ZX) to q(Z)+EpE(ZX)[logpD(XZ)]reconstruction error\mathrm{ELBO} = \underbrace{-\mathrm{KL}\left[p_E(Z \mid X) \mid\mid q(Z)\right]}_{\text{how close is $p_E(Z \mid X)$ to $q(Z)$}} + \underbrace{\mathbb{E}_{p_E(Z \mid X)} \left[\log p_D(X \mid Z)\right]}_{\text{reconstruction error}}
      • trade-off between two objectives:
        • reconstruction error minimized if encoder & decoder are deterministic with perfect reconstruction
        • first term minimized if pE(ZX)=q(Z)p_E(Z \mid X) = q(Z) – encoder ignores the data, which is the opposite of perfect reconstruction
    • maximizing ELBO\mathrm{ELBO} loss enforces (2) (proved during the lecture)
      • in literature, ELBO\mathrm{ELBO} is usually maximized
      • conceptually, minimizing ELBO-\mathrm{ELBO} is simpler
    • maximizing the ELBO\mathrm{ELBO} also indirectly maximizes the data likelihood under the model – ML principle: TS should be a typical model outcome
      •     \iff minimize expected negative log-likelihood (NLL) (proved during the lecture)
      • ELBO-\mathrm{ELBO} is an upper bound for NLL loss
    • most common implementation encoder and decoder are diagonal Gaussians: pE(ZX)=N(ZμE(X),Diag(σE(X)2))p_E(Z \mid X) = \mathcal{N}(Z \mid \mu_E(X), \mathrm{Diag}(\sigma_E(X)^2)) pD(XZ)=N(XμD(Z),β2I)p_D(X \mid Z) = \mathcal{N}(X \mid \mu_D(Z), \beta^2 \mathbb{I})
      • since we only have diagonal Gaussians, we can’t rotate the ellipses
      • since we only have Gaussians, the shape is restricted to circles
      • if q(Z)=N(0,I)q(Z) = \mathcal{N}(0, \mathbb{I}) then KL[pEg]\mathrm{KL}\left[p_E \mid\mid g\right] can be calculated analytically
      • if pD(XZ)=N(ZμD(Z),β2I)p_D(X \mid Z) = \mathcal{N}(Z \mid \mu_D(Z), \beta^2 \mathbb{I}), reconstruction error becomes squared loss
      • β21\beta^2 \gg 1 downscales squared loss     \implies reconstruction error unimportant
      • β21\beta^2 \ll 1 upscales squared loss     \implies reconstruction error dominant
      • generation: Zq(Z),XpD(XZ)    μD(Z)+β2εnoiseZ \sim q(Z), X \sim p_D(X \mid Z) \iff \mu_D(Z) + \overbrace{\beta^2 \varepsilon}^{\text{noise}}
      • inference: if p(X)=p(X)p(X) = p^*(X) and pE(X,Z)=pD(X,Z)p_E(X, Z) = p_D(X, Z) then pE(X,Z)=p(X)pE(ZX)=q(Z)pD(XZ)=pD(X,Z)p_E(X, Z) = p(X) p_E(Z \mid X) = q(Z) p_D(X \mid Z) = p_D(X, Z)     p(X)=q(Z)pD(XZ)pE(ZX)\implies p(X) = \frac{q(Z) p_D(X \mid Z)}{p_E(Z \mid X)} must give the same value for all ZpE(ZX)Z \sim p_E(Z \mid X)
  4. conditional VAE
    • instead of learning p(X)p(X), we learn p(XY)p(X \mid Y) for some variable YY
    • e.g. Y{0,,9}Y \in \left\{0, \ldots, 9\right\} for MNIST label p(XY)=q(Z)pD(XY,Z)  dz    p(X)=Yp(XY)p(Y)p(X \mid Y) = \int q(Z) \cdot p_D (X \mid Y, Z)\;dz \implies p(X) = \sum_{Y} p(X \mid Y) p^*(Y)
    • if encoder & decoder are Gaussians, add YY as input to the networks
    • we can supply different labels for encoding / decoding – style transfer
      • one digit in style of another / one image in the style of another
    • here we can do operative classification: test XX with unknown label, try encoding with every label and calculate the corresponding p(XY)p(X \mid Y), returning the one with maximum probability

Generative Adversarial Networks (GANs)

Normalized Flows (NF)

Goals Autoencoder VAE GAN NF
small codes hyperp. hyperp. hyperp. lossless
accurate distribution p(X)p(X)p(X) \approx p^*(X) bad (doesn’t care) trade-off good good
good reconstruction X^=D(E(X))X\hat X = D(E(X)) \approx X good trade-off can’t good

Idea: generalize the inverse transformation method to arbitrary dimensions:

Since consistency must hold for any AA, the integrals must be equal, we get the multi-variate change-of-variables formula p(X)=q(Z=f(X))  detJf(X)\boxed {p(X) = q(Z = f(X))\; | \det \mathcal{J}_f (X) | }

Goal:

Recall the 11-D case: q(Z)=uniform(0,1)    f(X)=CDFp(X)q(Z) = \text{uniform}(0, 1) \implies f(X) = \mathrm{CDF}_{p^*}(X)

Two difficult problems in practice, covered in the next sections:

  1. how to ensure that f(X)f(X) is invertible & efficiently compute f1(Z)f^{-1}(Z)
  2. how to (efficiently) calculate detJf\det \mathcal{J}_f

(2) Calculating detJf\det \mathcal{J}_f

If f(X)f(X) is a multi-layer network, f(X)f(X) is a composition of functions f(l)f^{(l)}

(1) Ensuring that f(X)f(X) is invertible and computable

TODO: add the drawing here

Summary

RealNVP

Spline coupling

Conditional derivatives

Simulation-based inference (SBI)

Main tasks of SBI

  1. surrogate modelling: train a model p(XY)p(X \mid Y) that emulates the simulation
    • good for speed-up (since X=ϕ(Y,η)X = \phi(Y, \eta) is often slow)
    • forward inference: often, X=ϕ(Y,η)X = \phi(Y, \eta) only defines pS(XY)=ϕ#(Y,η)p^S(X \mid Y) = \phi_{\#} (Y, \eta) implicitly, but doesn’t allow to calculate pS(X=XY)p^{S}(X=X \mid Y) (“likelihood-free inference, implicit likelihood”)
      • approximate true likelihood by p(XY)pS(XY)p(X \mid Y) \approx p^S(X \mid Y)
  2. inverse inference: run the simulation backwards: Y=ϕ1(X)Y = \phi^{-1}(X)
    • usually intractable (no analytic solution) and/or ill-posed (no inverse)
    • traditional solution would be to pick constrainst & regularization that select one
    • SBI solution: pick probabilistic via Bayes rule pS(YX)=pS(Y)pS(XY)pS(X)p^S(Y \mid X) = \frac{p^S(Y) p^S(X \mid Y)}{p^S(X)}
      • define equivalence classes F(X)={Y:η with ϕ(Y,η)=X}\mathcal{F}(X) = \left\{Y : \exists \eta \ \text{with}\ \phi(Y, \eta) = X\right\}
        • all YYs that could have produced given XX
      • posterior pS(YX)p^S(Y \mid X) assigns a “possibility” to every YF(X)Y \in \mathcal{F}(X)
      • problems:
        • if likelihood pS(XY)p^S(X \mid Y) is only implicitly defined then Bayes rule canot be calculated (i.e. surrogate model above)
        • even if pS(XY)p^S(X \mid Y) (or a surrogate) is known, Bayes rules is usually intractable
          • \Rightarrow learn generative model for posterior p(YX)pS(YX)p(Y \mid X) \approx p^S(Y \mid X)
  3. model missclasification & outliner detection – a simulation is not reality: pS(Y)pS(XY)simulationp(Y)p(XY)reality\underbrace{p^S(Y) \cdot p^S(X \mid Y)}_{\text{simulation}} \approx \underbrace{p^*(Y)p^*(X \mid Y)}_{\text{reality}}
    • \Rightarrow use SBI to detect if pS(X,Y)p(X,Y)p^S(X, Y) \neq p^*(X, Y)
    • this and observed outcome Xobsp(X,Y)X^{\text{obs}} \sim p^*(X, Y) compatible with pS(X,Y)p^S(X, Y)
      • if not, the simulation is unrealistic – “simulation gap”
      • is a set of outcomes {Xiobs}i=1N\left\{X_{i}^{\text{obs}}\right\}_{i=1}^N compatible with pS(X,Y)p^S(X, Y)?
  4. model comparison and selection
    • if we have competing theories X=ϕ(l)(Y(l),η(l))X = \phi^{(l)}(Y^{(l)}, \eta^{(l)})
    • \Rightarrow determine which ll describes XobsX^{\text{obs}} best (if any)
  5. digital twins: in a mixed effects setting, given {Xiobs}i=1N\left\{X_i^{\text{obs}}\right\}_{i=1}^N
    • determine YnY_n and YLiY_{Li} accurately enough to predict Xifuture=ϕ(YG,YLi,η)X_i^{\text{future}} = \phi(Y_G, Y_{Li}, \eta)
      • the same treatment that worked on this pacient will work on this one too
    • classical: base treatment decisions mainly on YGY_G (“treatment guidelines”) after an appropriate stratification of population into subgroups
    • desired: “precision medicine” – use YGY_G and YLiY_{Li}
  6. experimental design and active learning: how should we measure {Xiobs}i=1N\left\{X_i^{\text{obs}}\right\}_{i=1}^N to learn as much as possible about YY with given experiment budget?

Classical approaches for inverse inference

  1. conjugate priors – chose pS(Y)p^S(Y) and pS(XY)p^S(X \mid Y) such taht pS(YX)p^S(Y \mid X) can be analytically calculated and is in the same distribution family as pS(Y)p^S(Y) (\Rightarrow incremental Bayesian updating)
    • common is the Gaussian (surprise surprise) but exists for many other distributions
    • ++ efficient and mathematically elegant
    • - very unrealistic \Rightarrow big simulation gap (usually picked for convenience)
  2. likelihood-based inferencepS(Y)p^S(Y) and pS(XY)p^S(X \mid Y) are known (not just X=ϕ(Y,η)X = \phi(Y, \eta))
    • but pS(YX)p^S(Y \mid X) is intractable (so we can’t use Bayes rule)
    • \Rightarrow create a sample {YnpS(YX=Xobs)}k=1K\left\{Y_n \sim p^S(Y \mid X=X^{\text{obs}})\right\}_{k=1}^K using Markov chain Monte Carlo (MCMC)
      • important relaxation: pS(X,Y)p^S(X, Y) can be unnormalized (e.g. Gibbs distribution)
    • ++ very general, has lots of mathematical theory
    • ++ implemented in many libraries/languages
    • - not amortized – runs from scratch for each new XobsX^{\text{obs}}
    • - expensive – TT must be very large for complete “mixing”, i.e. until all modes of pS(YXobs)p^S(Y \mid X^{\text{obs}}) have been covered
    • - often, a long “burn-in” phase is needed to forget a bad initial guess
      • throw away Y(0)Y(T0)Y^{(0)} \ldots Y^{(T_0)}
    • - samples Y(t)Y^{(t)} and Y(t1)Y^{(t-1)} are close to each other (chain moves slightly away from the previous guess) – may bias derived statistics of the chain (i.e. both are rejected)
      • skip each kkth samples in the chain
    • - often difficult to define proposal distribution p(YY(t1))p(Y' \mid Y^{(t-1)}) that has low rejection rate
    • \Rightarrow only applicable when dim(Y)\mathrm{dim}(Y) is not large and pS(XY)p^S(X \mid Y) not too slow

Algorithm (MCMC):

  1. XobsX^{\text{obs}} given, Y0Y^{0} arbitrary initial guess
    • pick a “proposal transition distribution” for transition prob q(YY(t1),Xobs)q(Y' \mid Y^{(t-1)}, X^{\text{obs}})
      • is a hyperparameter, e.g. a Gaussian – something easy to work with
  2. for t=1,,Tt=1, \ldots, T
    • sample a proposal Yq(YY(t1),Xobs)Y' \sim q(Y' \mid Y^{(t-1)}, X^{\text{obs}})
    • calculate acceptance weight α=pS(XobsY)p(Y)pS(XobsY(t1))pS(Y(t1))=pS(YXobs)pS(Y(t1)Xobs)intractable\alpha = \frac{p^S(X^{\text{obs}} \mid Y') p(Y')}{p^S(X^{\text{obs}} \mid Y^{(t-1)}) p^S(Y^{(t-1)})} = \underbrace{\frac{p^S(Y' \mid X^{\text{obs}})}{p^S(Y^{(t-1)} \mid X^{\text{obs}})}}_{\text{intractable}}
    • sample uuniform(0,1)u \sim \text{uniform}(0, 1)– acceptance threshold
    • if uαu \le \alpha: accept (Y(t)=YY^{(t)} = Y'), else reject (Y(t)=Y(t1)Y^{(t)} = Y^{(t-1)})
      • theory: for TT \rightarrow \infty, {Y(t)}t=1TpS(YXobs)\left\{Y^{(t)}\right\}_{t=1}^T \sim p^S(Y \mid X^{\text{obs}})
  3. likelihood-free inferencepS(XY)p^S(X \mid Y) is unknown and only implicitly defined as ϕ#pS(Y,η)\phi_{\#} p^S(Y, \eta)
    • only samples X=ϕ(Y,η)X = \phi(Y, \eta) with Y,ηpS(Y,η)Y, \eta \sim p^S(Y, \eta)
    • Approximate Bayesian Computation (ABC) =^\hat = brute force
    • requires a distance dist(X,Xobs)\mathrm{dist}(X, X^{obs}) for outcomes, usually hand-crafted
    • ++ works when MCMC doesn’t
    • ++ implemented in many libraries/languages
    • - very slow – if ε\varepsilon is small, the model is more precise and slower
    • - not amortized
    • - hard to define dist\mathrm{dist} if XX is complicated or dim(X)\mathrm{dim}(X) is high
      • design hand-crafted summary statistics h(X)h(X) and compare dist(h(X),h(Xobs))\mathrm{dist(h(X), h(X^{\text{obs}}))}
    • \Rightarrow only applicable when dim(Y)\mathrm{dim}(Y) is not large

Algorithm (ABC):

  1. t=0t=0, sample {}\left\{\right\} (we later become {Y(t)pS(YXobs)}t=1T\left\{Y^{(t)} \sim p^S(Y \mid X^{\text{obs}})\right\}_{t=1}^T
  2. repeat until t=Tt=T
    • sample Y,ηpS(Y,η)Y, \eta \sim p^S(Y, \eta), simulate Xϕ(Y,η)X \sim \phi(Y, \eta)
    • if dist(X,Xobs)ε\mathrm{dist}(X, X^{\text{obs}}) \le \varepsilon, add YY to samples and increase tt, else reject
      • if we were lucky with YY, we accept; otherwise reject

Doing it better with NN

Claw.

Validation of generative models (especially SBI)

Validation of SBI

There is a missing lecture here! See slides for what was in it.

Epidemology: SIR model

New lecture here, we’re doing SBI for epidemiology again.

Making things more complex (because we can)

Checking if the simulation is realistic

Idea: if XobsX^{\mathrm{obs}} is an outlier of the marginal then networks have not seen data like it during training:

  1. learn a surrogate for p(X)pS(X)p(X) \approx p^S(X) (expensive, big neural network)
  2. get outlier detection almost for free: we have a summary network h(X)h(X)

Continuing with new lecture here.

Model misspecification detection: is xobsx^{\mathrm{obs}} an outlier?

Model comparison & selection (between competing theories): measuring training error might not be enough because of overfitting.

Some more stuff was here but brain small.

External validation algorithm:

Parameter degeneracy

Lecture talks about SoftFlow here which addresses this issue for cNFs.

This lecture had slides with some interesting use cases of the methods that we previously discussed.

Hierarchical models for SBI

Here we discuss Estimation of a non-linear mixed effects model [Arruda et al 2023].

SINDy-Autoencoders

Two lectures were spent on this.

Physics-informed Neural Networks (PINNs)

Why is this useful?

Example: growth data (real, not simulated)

TODO: image here

The regression problem is now just f^,λ^=arg minf,λνdatalossdata+νODElossODE\hat f, \hat \lambda = \argmin_{f, \lambda} \nu_{\mathrm{data}} \mathrm{loss}_{\mathrm{data}} + \nu_{\mathrm{ODE}} \mathrm{loss}_{\mathrm{ODE}}

The lecture talks here about the general case for ODEs.

Benefits of using NNs:

Disadvantages of using NNs:

Tips and tricks:

PINN-related things (training, variants).

State-of-the-art generative modeling

Free-form flows

The lecture shows the derivation for why this is the case here.

ΘlNLLExp(x)Θlogg(z=fΘ(x))Θ(stop-gradient(vjp(gΘ,z=fΘ(x),v))jvp(f,x,v))\boxed{\nabla\Theta \mathcal{l}_{\mathrm{NLL}} \approx \mathbb{E}_{x \sim p(x)} \left| \nabla\Theta - \log g(z=f_\Theta(x)) - \nabla_\Theta \left(\text{stop-gradient}\left(\text{vjp}(g_\Theta, z=f_\Theta(x), v)\right) \cdot \text{jvp}(f, x, v)\right) \right|} ltotal=lNLL+βlrec\mathcal{l}_{\mathrm{total}} = \mathcal{l}_{\mathrm{NLL}} + \beta \mathcal{l}_{\mathrm{rec}}

Injective flows

Fever dream.

Bottleneck Free-form Flows

We can formalize the quesiton:

Results: injective FFFs are competitive on standard benchmarks

FFFs for data on a known manifold M\mathcal{M} (don’t have to learn it)

TODO image here