Introduction to Generative Modelling
In neural computation, an image is treated as a high-dimensional random variable . These exists a βcomplicatedβ underlying distribution of real images, , which we want to understand.
The goal of a Generative Model is to use a neural network to approximate this distribution:
Once trained, we can βsampleβ from to create new images that look like they belong to the original dataset. Currently, diffusion models are considered the state-of-the-art, when outperforming Generative Adversarial Networks (GANs) in image synthesis.
Denoising Diffusion Probabilistic Models (DDPM)
A DDPM operates via two distinct processes: Forward Diffusion and Reverse Denoising.
A. The Forward Diffusion (Fixed)
This process takes a clean data sample and gradually βdiffusesβ it by adding random Gaussian noise over time steps.
- Transition Step: Each step adds a small amount of noise governed by a variance schedule .
- The Shortcut (Diffusion Kernel): We do not need to calculate every step. We can sample directly from the starting image .
- Define and
- Formula:
- Sampling Equation: , where
B. The Reverse Denoising Process (Generative)
This is the βgenerativeβ part where the model learns to undo the noise.
- It starts with pure noise and iteratively produces a slightly cleaner image from a noisy image
- The model learns the transition
Model Training and Architecture
The network is trained as Denoising Auto-encoder or U-Net.
Network Design
- Architecture: Typically uses a U-net with ResNet blocks and self-attention layers.
- Input: The noisy image and the current time step
- Time Representation: is fed into the model via fully-connected layers so the network knows the current level of noise it is dealing with.
Training Algorithm
The goal is to predict the noise that was added to the clean image.
- Pick a clean image
- Pick a random time step
- Pick random noise
- Loss Function: Minimise the difference between the actual noise and the predicted noise:
Sampling (Image Generation)
To generate a new image, the model follows an iterative process starting from pure noise.
The Sampling Algorithm
- Start with
- For :
- Sample (if , else )
- Calculate the previous (cleaner) step:
- Return
The Generative Trilemma
Generative model face three competing requirements:
- High Quality Samples: Sharpe, realistic images
- Mode Coverage/Diversity: Capturing all variations in the data
- Fast Sampling: Speed of generation
Tip
Key Insight: Diffusion models excel at High Quality and Diversity but struggle with Fast Sampling because they require hundreds of sequential iterations to produce one image.
Extensions and Latent Diffusion
To overcome limitations like sampling speed, several extensions exist:
Conditional Diffusion
Conditions (like text or labels) are incorporated into the U-Net
- Scalar/Class: Encoded as vector embeddings.
- Image: Channel-wise concatenation (used for colorisation or super-resoilution).
- Text: Cross-attention mechanism (e.g., Stable Diffusion)
Latent Diffusion Models (Stable Diffusion)
LDMs solve the speed problem by performing the diffusion process in a compressed latent space.
- Encoder : Compress the pixel-space image into a lower-dimensional latent vector
- Latent Diffusion: The forward and reverse processes are applied only to .
- Decoder : Decompress the final denoising latent back into a high-resolution image.