25 Diffusion Models

Introduction to Generative Modelling

In neural computation, an image is treated as a high-dimensional random variable $x \in R^{H \times W \times C}$ . These exists a “complicated” underlying distribution of real images, $p (x)$ , which we want to understand.
The goal of a Generative Model is to use a neural network to approximate this distribution:

\overset{p}{^}_{θ} (x) \approx p (x)

Once trained, we can “sample” from $\overset{p}{^}_{θ} (x)$ to create new images that look like they belong to the original dataset. Currently, diffusion models are considered the state-of-the-art, when outperforming Generative Adversarial Networks (GANs) in image synthesis.

Denoising Diffusion Probabilistic Models (DDPM)

A DDPM operates via two distinct processes: Forward Diffusion and Reverse Denoising.

A. The Forward Diffusion (Fixed)

This process takes a clean data sample $x_{0}$ and gradually “diffuses” it by adding random Gaussian noise over $T$ time steps.

Transition Step: Each step adds a small amount of noise governed by a variance schedule $β_{t}$ .
- $q (x_{t} ∣ x_{t - 1}) = N (x_{t}; 1 - β_{t} x_{t - 1}, β_{t} I)$
The Shortcut (Diffusion Kernel): We do not need to calculate every step. We can sample $x_{y}$ directly from the starting image $x_{0}$ .
- Define $α_{t} = 1 - β_{t}$ and $\overset{α}{ˉ}_{t} = \prod_{s = 1}^{t} α_{s}$
- Formula: $q (x_{t} ∣ x_{0}) = N (x_{t}; \overset{α}{ˉ}_{t} x_{0}, (1 - \overset{α}{ˉ}_{t}) I)$
- Sampling Equation: $x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ϵ$ , where $ϵ \sim N (0, I)$

B. The Reverse Denoising Process (Generative)

This is the “generative” part where the model learns to undo the noise.

It starts with pure noise $x_{T} \sim N (0, I)$ and iteratively produces a slightly cleaner image $x_{t - 1}$ from a noisy image $x_{t}$
The model learns the transition $p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), σ_{t}^{2} I)$

Model Training and Architecture

The network is trained as Denoising Auto-encoder or U-Net.

Network Design

Architecture: Typically uses a U-net with ResNet blocks and self-attention layers.
Input: The noisy image $x_{t}$ and the current time step $t$
Time Representation: $t$ is fed into the model via fully-connected layers so the network knows the current level of noise it is dealing with.

Training Algorithm

The goal is to predict the noise $ϵ$ that was added to the clean image.

Pick a clean image $x_{0} \sim q (x_{0})$
Pick a random time step $t \sim Uniform ({1, \dots, T})$
Pick random noise $ϵ \sim N (0, I)$
Loss Function: Minimise the difference between the actual noise and the predicted noise:

\nabla_{θ} ∥ ϵ - ϵ_{0} (\overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ϵ, t)∥

Sampling (Image Generation)

To generate a new image, the model follows an iterative process starting from pure noise.

The Sampling Algorithm

Start with $x_{T} \sim N (0, I)$
For $t = T, \dots, 1$ :
- Sample $z \sim N (0, I)$ (if $t > 1$ , else $z = 0$ )
- Calculate the previous (cleaner) step:

x_{t - 1} = \frac{1}{α _{t}} (x_{t} - \frac{1 - α _{t}}{1 - α ˉ _{t}} ϵ_{θ} (x_{t}, t)) + σ_{t} z

Return $x_{0}$

The Generative Trilemma

Generative model face three competing requirements:

High Quality Samples: Sharpe, realistic images
Mode Coverage/Diversity: Capturing all variations in the data
Fast Sampling: Speed of generation

Tip

Key Insight: Diffusion models excel at High Quality and Diversity but struggle with Fast Sampling because they require hundreds of sequential iterations to produce one image.

Extensions and Latent Diffusion

To overcome limitations like sampling speed, several extensions exist:

Conditional Diffusion

Conditions (like text or labels) are incorporated into the U-Net

Scalar/Class: Encoded as vector embeddings.
Image: Channel-wise concatenation (used for colorisation or super-resoilution).
Text: Cross-attention mechanism (e.g., Stable Diffusion)

Latent Diffusion Models (Stable Diffusion)

LDMs solve the speed problem by performing the diffusion process in a compressed latent space.

Encoder $(E)$ : Compress the pixel-space image $x_{0}$ into a lower-dimensional latent vector $z_{0}$
Latent Diffusion: The forward and reverse processes are applied only to $z$ .
Decoder $(D)$ : Decompress the final denoising latent $z_{0}$ back into a high-resolution image.

Ayush Acharjya's Notes

Explorer

25 Diffusion Models

Introduction to Generative Modelling

Denoising Diffusion Probabilistic Models (DDPM)

A. The Forward Diffusion (Fixed)

B. The Reverse Denoising Process (Generative)

Model Training and Architecture

Network Design

Training Algorithm

Sampling (Image Generation)

The Sampling Algorithm

The Generative Trilemma

Extensions and Latent Diffusion

Conditional Diffusion

Latent Diffusion Models (Stable Diffusion)

Graph View

Table of Contents

Backlinks