🌀 Module 2: Mechanics of Diffusion Models — Deconstructing the Process

2.1 The Diffusion Process: Adding Noise to Learn to Remove It

The fundamental principle behind diffusion models is simple yet powerful: if you can learn to undo a destructive process (adding noise), then you can use that knowledge to build something new from chaos.

The process consists of two phases:

➤ Forward (Diffusion) Process

In this phase, a real image ( x_0 ) is gradually corrupted by adding Gaussian noise over ( T ) discrete steps until the image becomes pure noise ( x_T \sim \mathcal{N}(0, I) ). Each step ( t ) is defined as:

[ x_t = \sqrt{\alphat} x{t-1} + \sqrt{1 - \alpha_t} \epsilon_t ]

where ( \epsilon_t \sim \mathcal{N}(0, I) ) is Gaussian noise, and ( \alpha_t ) is a variance scheduling factor controlling how much noise is added at each step. The sequence ( \alpha_1, \alpha_2, ..., \alpha_T ) is designed so that noise increases gradually.

This process is distributionally deterministic: given ( x_0 ), any ( x_t ) can be directly computed without iterating through all previous steps, thanks to properties of Gaussian processes:

[ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon ]

where ( \bar{\alpha}t = \prod{i=1}^{t} \alpha_i ).

➤ Reverse (Denoising) Process

This is where the neural network comes in. A model ( \epsilon_\theta(x_t, t) ) is trained to predict the noise ( \epsilon ) added at step ( t ), given ( x_t ) and timestep ( t ). Once trained, this model can be used to invert the process: starting from pure noise ( x_T ), iterate backward, removing the predicted noise at each step:

[ x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}t}} \epsilon\theta(x_t, t) \right) + \sigma_t z ]

where ( z ) is random noise (except at the final step, where ( z = 0 )) and ( \sigma_t ) controls the amount of stochastic noise added during reversal.

Although slow, this iterative process enables highly controlled, high-quality generation.

2.2 Conditioning: Guiding Generation with Text, Sketches, or Masks

One of the most significant advances in diffusion models is the ability to condition generation on an external signal — for example, a text prompt, sketch, segmentation mask, or reference image. This transforms the model from a random generator into an assisted creation tool.

In text-to-image generation, a text encoder (such as CLIP or T5) transforms the prompt into an embedding vector. This vector is injected into the U-Net (the model predicting noise) via cross-attention mechanisms, allowing each denoising step to “know” which concepts to preserve or emphasize.

The loss function is modified to include this conditioning:

[ \mathcal{L} = \mathbb{E}_{x0, c, \epsilon, t} \left[ | \epsilon - \epsilon\theta(x_t, t, c) |^2 \right] ]

where ( c ) is the condition (e.g., text embedding).


Course Info

Course: AI-course5

Language: EN

Lesson: Module2