LoRA was introduced in the paper “LoRA: Low-Rank Adaptation of Large Language Models” (Hu et al., Microsoft Research, 2021). The main motivation was to resolve the inefficiency of full fine-tuning without sacrificing performance. The authors observed that during fine-tuning, model weights do not change arbitrarily but tend to move along low-rank directions in the parameter space.
In other words: the weight updates of a model can be approximated by a low-rank matrix.
In a Transformer model, key operations (such as projections in attention layers) are performed via matrix multiplication. For example, the query projection is computed as:
Q = X * W_Q
Where:
X is the input (activations from the previous layer)W_Q is the trainable weight matrix for the query projectionDuring full fine-tuning, W_Q is directly updated, implying modification of all its elements.
In LoRA, instead, W_Q is kept frozen, and a low-rank update is introduced:
Q = X * (W_Q + B * A)
Where:
A is a matrix of size (d_model, r)B is a matrix of size (r, d_head)r is the rank (a small hyperparameter, typically between 4 and 64)d_model and d_head are the original dimensions of W_QThe matrix B * A has rank r, much lower than the full rank of W_Q. This means that instead of updating millions of parameters, only the parameters of A and B are trained, totaling:
Number of LoRA parameters = r * (d_model + d_head)
For example, if d_model = 4096, d_head = 128, and r = 8, then:
LoRA parameters = 8 * (4096 + 128) = 8 * 4224 = 33,792
Meanwhile, the original matrix W_Q could have 4096 * 128 = 524,288 parameters. That is, only 6.4% of the original parameters are trained.
r (rank): Controls the number of trainable parameters. Low values (4–8) suffice for simple tasks; higher values (32–64) for complex ones. Too high an r eliminates efficiency gains.lora_alpha: Scaling factor controlling the magnitude of the update B * A. Typically set as a multiple of r (e.g., alpha = 16 if r = 8). Acts as an “implicit learning rate” for LoRA updates.lora_dropout: Dropout applied to LoRA layer outputs for regularization. Typical values: 0.0 to 0.1.target_modules: List of model modules where LoRA is applied. In Transformers, commonly applied to Query and Value projections (q_proj, v_proj) in attention layers. Sometimes also to k_proj, o_proj, or dense MLP layers (fc1, fc2).Hugging Face’s peft library allows applying LoRA to any model compatible with the transformers interface in just a few lines of code:
from peft import LoraConfig, get_peft_model
# LoRA configuration
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA to the model
model = get_peft_model(model, lora_config)
# The model now trains only LoRA parameters
model.print_trainable_parameters() # Shows how many parameters are trainable
This transforms the original model so that only the A and B matrices in the specified layers are updated during training. The rest remain frozen.