🔧 Module 5: Fine-Tuning Diffusion Models — Customize Your Generator
5.1 Why Fine-Tuning?
Although pretrained models like Stable Diffusion are incredibly versatile, one often wishes to generate content reflecting a personal style, specific object, brand, character, or unique concept poorly represented in the original training data. Fine-tuning allows adapting the model to these specific use cases.
5.2 Dreambooth: Inject Your Concept into the Model
Dreambooth is a fine-tuning technique that allows “teaching” the model a new concept (e.g., an object, person, or style) using only 3-5 images. It does so by using a unique identifier (a rare word or special token) associated with the concept during training.
Process:
- Select a unique identifier (e.g.,
sks).
- Prepare 3-5 images of the concept with varied backgrounds and angles.
- Train the model to associate the prompt “a photo of a [sks] dog” with the actual dog images.
- The model learns to generalize the concept and can generate it in new contexts: “a [sks] dog on the moon”, “a [sks] dog dressed as a pirate”.
Advantages: High fidelity, contextual generalization.
Disadvantages: Requires more VRAM, risk of overfitting without regularization.
5.3 LoRA for Images: Efficient and Modular Tuning
As with LLMs, LoRA (Low-Rank Adaptation) has been adapted to diffusion models. Instead of updating all weights of the U-Net or text encoder, low-rank matrices are trained and added to the original weights.
How it works:
- Matrices ( A ) and ( B ) are inserted into key U-Net layers (typically cross-attention and convolutional layers).
- Only ( A ) and ( B ) are trained, leaving the base model frozen.
- The result is a small file (.safetensors, ~2-200 MB) that can be dynamically loaded over any base checkpoint.
Advantages:
- Fast, lightweight training.
- Small, portable files.
- Combinable: multiple LoRAs can be mixed in a single generation.
- Ideal for artistic styles, poses, facial features, aesthetic modifiers.
Common Applications:
- Style of a specific artist (e.g., “Van Gogh style LoRA”).
- Consistent facial features (for characters).
- Design elements (typography, patterns, textures).
5.4 Textual Inversion: Learn New Words Without Changing the Model
Textual Inversion trains custom text embeddings instead of modifying model weights. A special token (e.g., <cat-toy>) is assigned to a set of images, and the embedding vector associated with that token is optimized so that, when used in a prompt, the model generates images coherent with the concept.
Advantages:
- Very lightweight (only saves a ~10KB vector).
- Does not modify the model, only the embedding space.
- Easy to share and integrate.
Disadvantages:
- Lower generalization capacity than Dreambooth.
- Sensitive to quality and variety of training images.