🔢 Part 3: Quantization — Fewer bits, same power
What is quantization?
Quantization is a technique that reduces the numerical precision of the model’s weights and activations. Instead of using 32-bit floating-point numbers (FP32), lower-precision representations are used: 16 bits (FP16), 8 bits (INT8), or even 4 bits (INT4).
The impact is immediate:
- An FP32 weight occupies 4 bytes.
- The same weight in INT8 occupies 1 byte.
- In INT4, it occupies only 0.5 bytes.
This means a 1B-parameter model goes from occupying ~4 GB in FP32 to just ~0.5 GB in INT4 — an 8x reduction!
Types of quantization
1. Post-Training Quantization (PTQ)
- Applied after the model is already trained.
- Does not require retraining.
- Faster and simpler, but may have greater accuracy loss.
- Ideal for rapid prototyping or when training data is unavailable.
2. Quantization-Aware Training (QAT)
- Quantization is simulated during training or fine-tuning.
- The model learns to compensate for errors introduced by low precision.
- Requires more time and resources, but produces more accurate models.
- Ideal for critical production environments.
3. Dynamic vs Static Quantization
- Dynamic: quantization ranges are calculated on-the-fly during inference. More flexible, but slower.
- Static: ranges are calculated once, using a calibration set. Faster at inference, but requires an additional step.
Common precision levels
| Format |
Bits |
Typical Use |
Size Reduction |
| FP32 |
32 |
Standard training |
1x (base) |
| FP16 / BF16 |
16 |
Accelerated training/inference |
2x |
| INT8 |
8 |
CPU/GPU inference |
4x |
| INT4 / NF4 |
4 |
Inference on memory-constrained devices |
8x |
(NF4: NormalFloat4, an optimized variant for language models)
How is it done?
- Select the quantization scheme (INT8, INT4, dynamic, static, etc.).
- (Only for QAT) Modify the model graph to simulate quantization during training.
- (Only for static PTQ) Use a calibration set to compute optimal quantization ranges.
- Convert weights and activations to the new format.
- Evaluate the quantized model on accuracy and speed.
Advantages and disadvantages
✅ Massive reduction in size and memory usage.
✅ Significant acceleration on compatible hardware (modern GPUs, TPUs, CPUs with AVX-512 instructions).
✅ Compatible with most frameworks and hardware.
❌ Loss of accuracy (though often minimal with proper techniques).
❌ Requires hardware and software that support low-precision operations.
❌ QAT requires retraining; PTQ may be insufficient for highly sensitive models.