QLoRA (Quantized Low-Rank Adaptation) is a natural and revolutionary extension of LoRA, presented in the paper “QLoRA: Efficient Finetuning of Quantized LLMs” (Dettmers et al., University of Washington & MBZUAI, 2023). While LoRA reduces the number of trainable parameters, QLoRA goes one step further: it reduces the numerical precision of the base model weights to fit into low-memory GPUs while maintaining performance nearly on par with full fine-tuning.
QLoRA enables training models up to 70 billion parameters on a single 48GB GPU, and 30–40B models on 24GB GPUs. Moreover, 7B–13B models can be trained on 16GB GPUs — such as those freely offered by Google Colab.
Quantization is a compression technique that reduces the precision of numbers representing neural network weights. Instead of storing each weight as a 32-bit (FP32) or 16-bit (FP16/BF16) floating-point number, they are stored as 8-bit integers (INT8) or even 4-bit integers (INT4).
For example:
This means a 7B-parameter model shrinks from ~28 GB in FP32 to just ~3.5 GB in INT4 — an 8x reduction.
QLoRA does not use arbitrary quantization. It employs advanced techniques to minimize precision loss:
QLoRA introduces a novel data type: 4-bit NormalFloat (NF4). Unlike standard INT4, which uses a uniform distribution, NF4 is optimized for data following a normal distribution (like neural network weights). This allows more faithful representation of values near zero, where most useful information resides.
QLoRA applies quantization at two levels:
This saves approximately 0.375 bits per parameter, equivalent to an extra 315 MB saved in a 7B model.
QLoRA uses “paged” optimizers that manage memory more efficiently, avoiding Out-of-Memory (OOM) errors when processing long sequences or large batches. This is especially useful in memory-constrained environments.
QLoRA combines three key components:
This approach ensures:
bitsandbytes ≥ 0.41.0 (library implementing 4-bit quantization).transformers ≥ 4.30.0.accelerate and peft.bitsandbytes and AutoModelForCausalLM.