📚 Module 3: LoRA — Low-Rank Adaptation

3.1 Origin and Motivation

LoRA was introduced in the paper “LoRA: Low-Rank Adaptation of Large Language Models” (Hu et al., Microsoft Research, 2021). The main motivation was to resolve the inefficiency of full fine-tuning without sacrificing performance. The authors observed that during fine-tuning, model weights do not change arbitrarily but tend to move along low-rank directions in the parameter space.

In other words: the weight updates of a model can be approximated by a low-rank matrix.

3.2 Mathematical Mechanism (Intuitively Explained)

In a Transformer model, key operations (such as projections in attention layers) are performed via matrix multiplication. For example, the query projection is computed as:

Q = X * W_Q

Where:

X is the input (activations from the previous layer)
W_Q is the trainable weight matrix for the query projection

During full fine-tuning, W_Q is directly updated, implying modification of all its elements.

In LoRA, instead, W_Q is kept frozen, and a low-rank update is introduced:

Q = X * (W_Q + B * A)

Where:

A is a matrix of size (d_model, r)
B is a matrix of size (r, d_head)
r is the rank (a small hyperparameter, typically between 4 and 64)
d_model and d_head are the original dimensions of W_Q

The matrix B * A has rank r, much lower than the full rank of W_Q. This means that instead of updating millions of parameters, only the parameters of A and B are trained, totaling:

Number of LoRA parameters = r * (d_model + d_head)

For example, if d_model = 4096, d_head = 128, and r = 8, then:

LoRA parameters = 8 * (4096 + 128) = 8 * 4224 = 33,792

Meanwhile, the original matrix W_Q could have 4096 * 128 = 524,288 parameters. That is, only 6.4% of the original parameters are trained.

3.3 Key LoRA Hyperparameters

r (rank): Controls the number of trainable parameters. Low values (4–8) suffice for simple tasks; higher values (32–64) for complex ones. Too high an r eliminates efficiency gains.
lora_alpha: Scaling factor controlling the magnitude of the update B * A. Typically set as a multiple of r (e.g., alpha = 16 if r = 8). Acts as an “implicit learning rate” for LoRA updates.
lora_dropout: Dropout applied to LoRA layer outputs for regularization. Typical values: 0.0 to 0.1.
target_modules: List of model modules where LoRA is applied. In Transformers, commonly applied to Query and Value projections (q_proj, v_proj) in attention layers. Sometimes also to k_proj, o_proj, or dense MLP layers (fc1, fc2).

3.4 Conceptual Implementation with Hugging Face PEFT

Hugging Face’s peft library allows applying LoRA to any model compatible with the transformers interface in just a few lines of code:

from peft import LoraConfig, get_peft_model

# LoRA configuration
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# The model now trains only LoRA parameters
model.print_trainable_parameters()  # Shows how many parameters are trainable

This transforms the original model so that only the A and B matrices in the specified layers are updated during training. The rest remain frozen.

← Module02 Module04 →

Course Info

Course: AI-course3

Language: EN

Lesson: Module03