🖼️ Module 3: Stable Diffusion — Architecture, Workflow, and Ecosystem
3.1 What is Stable Diffusion?
Stable Diffusion is a latent diffusion model developed by Stability AI in collaboration with researchers from LMU Munich and Runway ML. Unlike models such as DALL·E 2 or Imagen, which operate directly in pixel space, Stable Diffusion works in a compressed latent space, drastically reducing computational requirements without sacrificing visual quality.
The key lies in using a Variational Autoencoder (VAE):
- The VAE encoder compresses an image ( x ) into a lower-dimensional latent space ( z = \mathcal{E}(x) ).
- The decoder reconstructs the image from the latent: ( \hat{x} = \mathcal{D}(z) ).
The diffusion process is applied not to pixels, but to this latent space ( z ). This allows the U-Net to operate on much smaller tensors (e.g., 64x64 instead of 512x512), greatly accelerating both training and inference.
3.2 Key Components of the Architecture
➤ Latent Diffusion Model (LDM)
- Latent space: 4x smaller in each spatial dimension (factor 8 total).
- U-Net: Convolutional neural network with residual blocks and cross-attention, predicting noise in latent space.
- Text Encoder: CLIP (or T5 in recent versions) model converting prompt into embeddings.
- Scheduler: Defines noise dynamics (DDPM, DDIM, Euler, etc.) and controls inference steps.
➤ Cross-Attention: The Bridge Between Text and Image
Within the U-Net, cross-attention blocks allow visual latent features to interact with text embeddings. At each layer, keys and values come from text, while queries come from the visual latent. This enables the model to “attend” to relevant prompt words while reconstructing the image.
3.3 Stable Diffusion Generation Workflow
- Prompt tokenization: Text is split into tokens and passed through the text encoder.
- Initial latent sampling: A Gaussian noise tensor is generated in latent space.
- Denoising iteration: Over ( T ) steps (typically 20-50), the U-Net predicts residual noise, and the scheduler updates the latent.
- Decoding: The final latent is passed through the VAE decoder to obtain the pixel image.
- Post-processing (optional): Upscaling, color correction, artifact removal.