🔧 Module 5: Fine-Tuning Diffusion Models — Customize Your Generator

5.1 Why Fine-Tuning?

Although pretrained models like Stable Diffusion are incredibly versatile, one often wishes to generate content reflecting a personal style, specific object, brand, character, or unique concept poorly represented in the original training data. Fine-tuning allows adapting the model to these specific use cases.

5.2 Dreambooth: Inject Your Concept into the Model

Dreambooth is a fine-tuning technique that allows “teaching” the model a new concept (e.g., an object, person, or style) using only 3-5 images. It does so by using a unique identifier (a rare word or special token) associated with the concept during training.

Process:

Select a unique identifier (e.g., sks).
Prepare 3-5 images of the concept with varied backgrounds and angles.
Train the model to associate the prompt “a photo of a [sks] dog” with the actual dog images.
The model learns to generalize the concept and can generate it in new contexts: “a [sks] dog on the moon”, “a [sks] dog dressed as a pirate”.

Advantages: High fidelity, contextual generalization.
Disadvantages: Requires more VRAM, risk of overfitting without regularization.

5.3 LoRA for Images: Efficient and Modular Tuning

As with LLMs, LoRA (Low-Rank Adaptation) has been adapted to diffusion models. Instead of updating all weights of the U-Net or text encoder, low-rank matrices are trained and added to the original weights.

How it works:

Matrices ( A ) and ( B ) are inserted into key U-Net layers (typically cross-attention and convolutional layers).
Only ( A ) and ( B ) are trained, leaving the base model frozen.
The result is a small file (.safetensors, ~2-200 MB) that can be dynamically loaded over any base checkpoint.

Advantages:

Fast, lightweight training.
Small, portable files.
Combinable: multiple LoRAs can be mixed in a single generation.
Ideal for artistic styles, poses, facial features, aesthetic modifiers.

Common Applications:

Style of a specific artist (e.g., “Van Gogh style LoRA”).
Consistent facial features (for characters).
Design elements (typography, patterns, textures).

5.4 Textual Inversion: Learn New Words Without Changing the Model

Textual Inversion trains custom text embeddings instead of modifying model weights. A special token (e.g., <cat-toy>) is assigned to a set of images, and the embedding vector associated with that token is optimized so that, when used in a prompt, the model generates images coherent with the concept.

Advantages:

Very lightweight (only saves a ~10KB vector).
Does not modify the model, only the embedding space.
Easy to share and integrate.

Disadvantages:

Lower generalization capacity than Dreambooth.
Sensitive to quality and variety of training images.

← Module4 Module6 →

Course Info

Course: AI-course5

Language: EN

Lesson: Module5