🔢 Part 3: Quantization — Fewer bits, same power

What is quantization?

Quantization is a technique that reduces the numerical precision of the model’s weights and activations. Instead of using 32-bit floating-point numbers (FP32), lower-precision representations are used: 16 bits (FP16), 8 bits (INT8), or even 4 bits (INT4).

The impact is immediate:

An FP32 weight occupies 4 bytes.
The same weight in INT8 occupies 1 byte.
In INT4, it occupies only 0.5 bytes.

This means a 1B-parameter model goes from occupying ~4 GB in FP32 to just ~0.5 GB in INT4 — an 8x reduction!

Types of quantization

1. Post-Training Quantization (PTQ)

Applied after the model is already trained.
Does not require retraining.
Faster and simpler, but may have greater accuracy loss.
Ideal for rapid prototyping or when training data is unavailable.

2. Quantization-Aware Training (QAT)

Quantization is simulated during training or fine-tuning.
The model learns to compensate for errors introduced by low precision.
Requires more time and resources, but produces more accurate models.
Ideal for critical production environments.

3. Dynamic vs Static Quantization

Dynamic: quantization ranges are calculated on-the-fly during inference. More flexible, but slower.
Static: ranges are calculated once, using a calibration set. Faster at inference, but requires an additional step.

Common precision levels

Format	Bits	Typical Use	Size Reduction
FP32	32	Standard training	1x (base)
FP16 / BF16	16	Accelerated training/inference	2x
INT8	8	CPU/GPU inference	4x
INT4 / NF4	4	Inference on memory-constrained devices	8x

(NF4: NormalFloat4, an optimized variant for language models)

How is it done?

Select the quantization scheme (INT8, INT4, dynamic, static, etc.).
(Only for QAT) Modify the model graph to simulate quantization during training.
(Only for static PTQ) Use a calibration set to compute optimal quantization ranges.
Convert weights and activations to the new format.
Evaluate the quantized model on accuracy and speed.

Advantages and disadvantages

✅ Massive reduction in size and memory usage.
✅ Significant acceleration on compatible hardware (modern GPUs, TPUs, CPUs with AVX-512 instructions).
✅ Compatible with most frameworks and hardware.

❌ Loss of accuracy (though often minimal with proper techniques).
❌ Requires hardware and software that support low-precision operations.
❌ QAT requires retraining; PTQ may be insufficient for highly sensitive models.

← Module2 Module4 →

Course Info

Course: AI-course4

Language: EN

Lesson: Module3