Generative Neural Networks
- Introduction to generative modeling
- Warm up: 1D generative models
- Warmed up: more dimensions!
- Measuring model quality
- Generative Modeling with NN
- Generative Adversarial Networks (GANs)
- Normalized Flows (NF)
- Simulation-based inference (SBI)
- Epidemology: SIR model
- SINDy-Autoencoders
- Physics-informed Neural Networks (PINNs)
- State-of-the-art generative modeling
- Free-form flows
- Injective flows
- Bottleneck Free-form Flows
Note: these notes are in a pretty questionable state (at least after the first few lectures) – there are lot of TODOs that I won’t do. If you’re taking this class and have better notes, I’d be happy to update the website.
Preface
This website contains my lecture notes from a lecture by Ullrich Köthe from the academic year 2023/2024 (University of Heidelberg). If you find something incorrect/unclear, or would like to contribute, feel free to submit a pull request (or let me know via email).
Introduction to generative modeling
- assumption:
- “nature” works according to some hidden mechanisms (complicated probability distribution)
- is sampled from the true probability / generating process (or )
- we have a training set
- use to learn
- goal:
- find approximation (for )
- two aspects of generative modeling (neural network can usually do both):
- inference – given some data instance , calculate value of
- generation – create synthetic data which is indistinguishable from real data
- downstream benefits of generative modelling
- powerful tool for humans (e.g. chat bot as personal teacher)
- great for things that are well-established
- poor for things that aren’t (tends to hallucinate)
- helps to produce insight: identify important factors influencing the outcome
- ex. “symbolic regression”: find / learn analytic formulas to explain the reality (vanilla neural networks are black boxes)
- use for decision making: is treatment better than treatment for patient ?
- powerful tool for humans (e.g. chat bot as personal teacher)
What I cannot create, I do not understand. (Richard Feynman)
Warm up: 1D generative models
Basic principle: is complicated reduce it to something simple.
- introduce new variable (“code”) for deterministic function s.t. the distribution is simple
- we’ll be actually doing it he other way – pick and learn (will be a NN)
- for generative modeling to work, we additionally require that is invertible
- if – inference direction
- if – generative direction
- called “inverse transform sampling’
- consequence in 1-D is that must be a monotonic function (increasing by convention)
- (strictly monotonic requires )
- universal property of monotonic functions (saw proof in the lecture):
Apply to probability distribution:
- define mass in interval as
- for the Gaussian distribution, gives about ( gives )
- is permissible, if for every interval :
Now we can substitute , getting which must hold for any interval , meaning which is called the change-of-variables formula, allowing us to express the complex distribution in terms of the simple distribution, transformation and its derivative (this is where the fact that it’s always positive comes in handy, since it’s just a scaling factor).
For 1D, works well (numerically easy, a generator is available anywhere)
- e.g. Mersenne twister (based on Mersenne primes)
- for that, we get and so the CDF is
- if then (integral over nothing)
- if then (integral over everything)
- if then because
Ex.: exponential distribution
for hyperparameter normalization, which has to be
To get , we again do the integral with upper bound being , getting
To sample from it, we solve for in the equation above and get which is either inverse CDF, quanti function or the percent-point function (PPF)
There was another example of a Gaussian PDF/CDF/PPF. Unlike the exponential, it the CDF doesn’t have a closed-form solution and has to be approximated.
Ex.: and need not be continuous (they can have jumps) – Bernoulli
We want to make it continuous, we do embedding via the -distribution
- the delta funtion is very handy: for any function ,
- are atoms – their probability is not zero (unlike functions like Gaussian)
Ex.: spike-and-slab distribution
Classical approaches
What if we don’t know but we have a ?
- histogram – define (smallest value), (bin width)
- the “inter-quantile range” is obtained by sorting TS and getting the difference between values and (i.e. quarter to the left, quarter to the right)
- the formula gives a good balance between number of bins vs. the number of items per bin
- generalization of a histogram – mixture distribution
- idea: express a complicated as a superposition of simple , i.e.
- is weight, is some simple distribution
- (has to be a probability distribution…)
- for histogram, is uniform
- for Gaussians, we have Gaussian mixture model (GMM); solved with the EM algorithm
- idea: express a complicated as a superposition of simple , i.e.
- kernel density estimation (or any other simple distribution)
- (one component per training sample)
- similar to mixture distribution (has less distributions), here is the hyperparameter
Warmed up: more dimensions!
Extensions of classical approaches
- multi-variate standard normal
- can be expressed as a product of 1-D normal distributions:
- sampling in dimensions boils down to many samplings in 1-D
- multi-variate Gaussian distribution with mean , covariance
- to sample, we compute the eigen decomposition of (for orthonormal and diagonal with eigen values) and do the following:
- invert (inverse is transposition for orthogonal)
- then
- define new coordinates
- then
- we can sample 1-D gaussians and put them in the vector, which we invert
- : reparametrization trick (from arbitraty Gaussian to std. normal)
- in our language:
- to sample, we compute the eigen decomposition of (for orthonormal and diagonal with eigen values) and do the following:
- similar for other analytic multi-variate distributions (
scipy.stats
offers ~16)
What if must be learned?
Mixture model idea naturally generalizes to dimensions:
- histograms are defined by a regular grid with levels per dimension (exponential, does not scale)
- solution: define bins by recursive subdivision (eg. density tree)
- quality of the models is only medium – each subdivision doubles the number but looks at only one variable, which is bad if you have 100s of variables… the number of coorelations to be considered is bounded by tree depth, which must be
- for Gaussians, we have to learn co-variance (instead of variance) change the EM algorithm accordingly
- kernel density estimation is also unchanged, but finding a bandwidth that works equally well is very difficult
- (2.) and (3.) become more expensive for growing
- sampling is simple: sample , sample
Reducing to many 1-D distributions
- naive: , learn a 1-D model for each coordinate
- assumes variable indepence, we loose all coorelation (highly unrealistic)
- exact: auto-regressive model – decompose by Bayesian chain rule:
- any ordering of the chain rule also works (variable order is interchangeable)
- each is a collection of 1-D distributions (one distribution per value of )
- use conditional inverse transform method
- then which is auto-regressive (values rely on the previous ones)
- problem: is a different 1-D function for each value of
- eg. if is defined on a regular grid with levels per dimension, then we have different values does not scale
- general solution: learn with a neural networks
- generalizes from a few seen (TS) values of to all possible values
- simpler solution: if (independent) then can be dropped
- also, if for then can be dropped
- repeat with more complicated conditions, dropping as many as possible
- example: Markov chain: – the DAG becomes a chain
- typical if is a timestep (future only depends on the present, not the past)
- easy to learn
- example: causal model for causal parents of
- e.g. both earthquake and burglar cause the alarm (not the other way lmao)
- causal DAG has the fewest edges (i.e. easier to learn) BUT finding it is very hard
Measuring model quality
- we can measure “distance” between and (even if we don’t know ) to:
- use as objective function for training
- use for validation after training
Forward KL divergence
A few useful properties:
It’s a divergence and not a distance (i.e. a metric) – not symmetric, triangular inequality doesn’t hold
There was an example here with a discrete distribution.
Caveat: if there are datapoints in s.t. but , then
- can’t be used as training gradient (since it’s infinity)
- use model families s.t.
Relationship between forward KL and maximum likelihood training:
- entropy is independent of and can be dropped, so get an optimization problem
- minimize KL minimize NLL maximize for
Given ,
- does not contain we can optimize without knowing (samples sufficient)
- but we need to calculate during training must be capable to do “inference”
Reverse KL divergence
- empirical approximation: iterate :
- current guess : draw batch
-
- need to know … cannot be used in many application
- useful when we know the distribution but it’s intractable (ex. Gibbs distribution)
Comparison (forward x reverse)
- forward KL tends to smooth out models of – mode-covering
- reverse KL tends to focus on single (or few) highest modes – mode-seeking/collapse
Maximum Mean Discrepancy
The idea is to use the kernel trick
- without kernel:
- define
- two data sets
- calculate
- (if when )
- is a metric
- : witness function – certifies that and are different when is big
- family function must be chosen s.t. we can’t cheat (i.e. arbitratily add/multiply to increase – e.g. conditions on variance)
- now replace explicit mapping with a kernel function :
Typical kernels (for hyperparameters):
- squared exponential
- inverse multi-quadratic
- multi-scale squared exponential
Generative Modeling with NN
- relationship between generative modeling and compression: both are essentially
- lossless: (e.g. ZIP)
- idea: use short codes for frequent symbols and longer for more rare symbols
- lossy compression: (e.g. JPEG)
- idea: decompose into smaller parts, drop important stuff, use lossless compression for important stuff
- lossless: (e.g. ZIP)
Here we have 3 conflicting goals:
- small codes:
- accurate reconstruction: (for some distance function)
- preserve data distribution: (for some distribution)
Note that (shown during lecture).
Autoencoder
- learned compression: and are neural networks
- lossy compression because usually
- “bottleneck” is a hyperparameter
- train by reconstruction error
- lossy compression because usually
- distance functions for the loss:
- is most common
- tends to be less blurry for images (better preserves small details)
- multi-resolution : compute image pyramid (apply loss to different image sizes)
- denoising autoencoder: add more noise to the data to teach the network what noise is
- better, deep mathematical properties later
- autoencoder is not a generative model (according to our definition):
- we haven’t learned the distribution – no generative capability
- no inference, i.e. no way to calculate
Generating data
- expert learning of by a second generative model
- often simpler than learning directly (e.g. )
- Stable Diffusion (image generation) does this
- joined optimization:
- predefine (desired code dimension)
- measure add new loss term
- choosing a kernel is another hyperparameter
- paper: variant of INfoVAE [2019]
- variational autoencoder (VAE) [2014]
- idea: replace deterministic functions and with conditional distributions
- encoder: , decoder
- we also have data and desired code dist.
- implies two version of joint distribution of and
- encoder (Bayes)
- decoder (also Bayes)
- two requirements:
- decoder marginal
- encoder-decoder pair must be self-consistent:
- here we will use the loss:
- trade-off between two objectives:
- reconstruction error minimized if encoder & decoder are deterministic with perfect reconstruction
- first term minimized if – encoder ignores the data, which is the opposite of perfect reconstruction
- trade-off between two objectives:
- maximizing loss enforces (2) (proved during the lecture)
- in literature, is usually maximized
- conceptually, minimizing is simpler
- maximizing the also indirectly maximizes the data likelihood under the model – ML principle: TS should be a typical model outcome
- minimize expected negative log-likelihood (NLL) (proved during the lecture)
- is an upper bound for NLL loss
- most common implementation encoder and decoder are diagonal Gaussians:
- since we only have diagonal Gaussians, we can’t rotate the ellipses
- since we only have Gaussians, the shape is restricted to circles
- if then can be calculated analytically
- if , reconstruction error becomes squared loss
- downscales squared loss reconstruction error unimportant
- upscales squared loss reconstruction error dominant
- generation:
- inference: if and then must give the same value for all
- idea: replace deterministic functions and with conditional distributions
- conditional VAE
- instead of learning , we learn for some variable
- e.g. for MNIST label
- if encoder & decoder are Gaussians, add as input to the networks
- we can supply different labels for encoding / decoding – style transfer
- one digit in style of another / one image in the style of another
- here we can do operative classification: test with unknown label, try encoding with every label and calculate the corresponding , returning the one with maximum probability
Generative Adversarial Networks (GANs)
- dominant generative model 2014-2019/20
- now diffusion models and transformers are better
- idea: learn quality function (instead of hand-crafted formula like MMD)
- new NN “discriminator/critic” – classifier vs.
- train the decoder (“generator”) jointly with the critic
- training becomes a game:
- critic becomes better at distinguishing reals and fakes
- decoder becomes better at fooling the critic
- training objective
- at optimal convergence, we have (~proven during the lecture)
- assumes that we have a perfect critic and proves from there
- in practice, we don’t have a perfect critic (and it wouldn’t work in practice because for the bad decoder, recognizing fakes is easy so gradient will be zero and we won’t train anything)
- instead, train and jointly (both random initially) via alternating optimization – one step for , one/more steps for
- also use non-saturating loss – replace with
- global optimum is preserved, should work better for training
- see the graphs for vs.
- GANs defined state-of-the-art up to 2019/20:
- WassersteinGAN: uses alternative loss for critic (not really better)
- CycleGAN: replace with true distribution over another variant of real data
- e.g. dist. of daylight photos, dist. of night photos
- two cases:
- paired dataset (same from both variants) – supervised learning
- unpaired dataset (no overlap between instances)
- we have new “cycle losses” – and should be small
- for paired, we additionally have and
- we still need a critic, otherwise and are identity (must combine with cycle loss)
- as opposed to GAN, we have no bottleneck – and are the ~same dimensions
- InvertibleGAN: add an encoder to original GAN
- recreating codes and distinguishing reals from fakes can use the same image features same network to encode and decode
- many more additional losses (similar to CycleGAN)
- hyperparameter optimization is hard
Normalized Flows (NF)
- one of the major recent approaches
Goals | Autoencoder | VAE | GAN | NF |
---|---|---|---|---|
small codes | hyperp. | hyperp. | hyperp. | lossless |
accurate distribution | bad (doesn’t care) | trade-off | good | good |
good reconstruction | good | trade-off | can’t | good |
Idea: generalize the inverse transformation method to arbitrary dimensions:
- high-dimensional density, a region
- define
- let an invertible encoding,
- the image of in -space:
- we want consistency: for all
- apply the multi-dimensional change-of-variables formula: for the Jacobian (matrix of partial derivatives) of
Since consistency must hold for any , the integrals must be equal, we get the multi-variate change-of-variables formula
Goal:
- pre-define , e.g.
- learn s.t. ( neural network)
Recall the -D case:
- for learning is better:
- has non-zero gradients for gradient descent
- supported on all of (unlike uniform – what to do with points outside?)
Two difficult problems in practice, covered in the next sections:
- how to ensure that is invertible & efficiently compute
- how to (efficiently) calculate
(2) Calculating
- 2-D case is easy:
- for higher cases, we can calculate determinants recursively, which grows exponentially
- general solution with SVD:
- effort of , which is a little better than exponential
- we can also expoit special case if is triangular (and the diagonal is non-zero, otherwise the determinant is zero), in which case is just the product of the diagonal elements
- we already talked about an instance of this: auto-regressive models – since they rely on the previous terms, the Jacobian is triangular and the determinant is very easy
If is a multi-layer network, is a composition of functions
- the Jacobian of comosition is the product of all Jacobians (consequence of chain)
- the determinant is the product of determinants (consequence of linear algebra)
- determinant of multi-layer network is easy when layer determinants are
- popular architecture is to define all as auto-regressive functions
(1) Ensuring that is invertible and computable
- trick: chose auto-regressive and easy to invert
- major architecture: Coupling layer network is “real NVP” (non-volume-preserving due to the determinants being non-unit)
- auto-regressive model
- in each layer, change only half of the dimensions of
- pass the remaining dimensions on unchanged (“skip connection”)
- no surprise since it’s a variant of the auto-regressive model
- simplest invertible function – affine functions:
- and are neural networks that we train: (same for )
TODO: add the drawing here
- for to be invertible, we need
- we actually learn and set
- also takes care of the sign of the Jacobian – no abs!
- in practice, it is numerically more stable to set
- we actually learn and set
Summary
- forward:
- inverse:
- invertible layer constructed from non-invertible and
-
choose and according to architecture (fully-connected, convolutional, etc.)
- determinant of an affine coupling layer
RealNVP
- invertible NN with autoregressive form, i.e. is easy
- after each coupling layer apply orthonormal transformation , e.g. permutation
- have and so easy to work with
- it turned out experimentally that learning is not necessary, no one knows why
- fixed matrices are sufficient / may change with new learning algorithm
- final architecture:
- training algorithm: minimize negative likelihod of data (derivation in slides):
- the slides here explain why affine coupling works
Spline coupling
- better than affine coupling for low dimensions ()
- uses Hermite splines, which require location of knots, function at knots and derivatives at knots (with interpolation in between)
- most popular: rational quadratic spline [Neural Spline Flows, 2019]
Conditional derivatives
- e.g. is digit label, is MNIST
- : sample any digit
- : sample only s
- typical setup for supervised learning:
- traditional networks point estimates (regression or classification)
- conditional NFs: distribution of estimate uncertainty of
- an autoregressive function is easy to generalize for conditionality:
- can be added as an input to all nested networks
- works if is known for both forward and backward network execution
- if is complicated (e.g. high dimensional image), we have shared preprocessing network (feature detector / summary network), which can
- use architecture of an existing regression networks minus the last layer
- use a foundational model trained by the big guys on big data
Simulation-based inference (SBI)
- setting:
- is observables, i.e. variables we can measure
- are hidden properties, i.e. variables we’d like to know but can’t measure
- assumptions:
- hidden variables are more fundamental, e.g. is caused by
- we have a scientific theory how the arrise from the (forward process)
- theory is implemented as an algorithm computer simulation
- we can do “in-silico experiments” (as opposed to “in-vivo” and “in-vitro”)
- three types of variables:
- : inputs to the simulation (pretend we know the hidden state)
- : outputs
- optionally : random variables for non-deterministic simulation
- deterministic (the simulation program)
- non-deterministic – “noise outsourcing”
- simulation paradigm: if we knew , we could predit
- since we don’t know , try multiple and generate alternate scenarios
- during covid, get various assumptions about the virus and about prevention measures () and simulate what happens (), getting various scenarios
- it’s hard to select a good set of s – SBI improves upon this
- since we don’t know , try multiple and generate alternate scenarios
- two important special cases of the structure of :
- mixed effects model
- for , sample and
- look at the group first, then differentiate for each individual
- , but (independent conditionally based on the global assumptions)
- dynamical systems (time-dependent behavior)
- for ,
- look at the group first, then differentiate for each individual
- if is continuous, we get a stochastic differential equation
- if is discrete, we get a hidden Markov model
- , but (if Markov property is fulfilled) \end{cases}]
- mixed effects model
Main tasks of SBI
- surrogate modelling: train a model that emulates the simulation
- good for speed-up (since is often slow)
- forward inference: often, only defines implicitly, but doesn’t allow to calculate (“likelihood-free inference, implicit likelihood”)
- approximate true likelihood by
- inverse inference: run the simulation backwards:
- usually intractable (no analytic solution) and/or ill-posed (no inverse)
- traditional solution would be to pick constrainst & regularization that select one
- SBI solution: pick probabilistic via Bayes rule
- define equivalence classes
- all s that could have produced given
- posterior assigns a “possibility” to every
- problems:
- if likelihood is only implicitly defined then Bayes rule canot be calculated (i.e. surrogate model above)
- even if (or a surrogate) is known, Bayes rules is usually intractable
- learn generative model for posterior
- define equivalence classes
- model missclasification & outliner detection – a simulation is not reality: