🛠️ Part 5: Practical Project — Apply quantization and measure gains

Project Objective

Take a small language model (e.g., distilbert-base-uncased or TinyLlama-1.1B), apply 8-bit and 4-bit quantization, and compare:

Model size before and after.
Average inference time.
Accuracy on a text classification task.
Memory usage during inference.

Tools to use

Hugging Face Transformers → to load model and tokenizer.
Hugging Face Optimum → to apply quantization with support for ONNX and specific hardware.
ONNX Runtime → to efficiently run quantized models.
TensorFlow Lite (optional) → if exploring compression for mobile.
Jupyter Notebook or Google Colab → for development and visualization.

Detailed Steps

Step 1: Installation and setup

pip install transformers optimum[onnxruntime] torch onnx onnxruntime

Step 2: Load the original model

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Step 3: Measure original size and performance

Save the model to disk and measure size in MB.
Perform 100 inferences and measure average time.
Measure memory usage with torch.cuda.memory_allocated() or similar.

Step 4: Apply dynamic 8-bit quantization with Optimum

from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig

# First export to ONNX
from optimum.onnxruntime import ORTModelForSequenceClassification
model_ort = ORTModelForSequenceClassification.from_pretrained(model_name, from_transformers=True)
model_ort.save_pretrained("./onnx_model")

# Configure dynamic INT8 quantization
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantizer = ORTQuantizer.from_pretrained("./onnx_model")
quantizer.quantize(save_dir="./quantized_model", quantization_config=dqconfig)

Step 5: Load quantized model and measure

from optimum.onnxruntime import ORTModelForSequenceClassification

model_quant = ORTModelForSequenceClassification.from_pretrained("./quantized_model")
# Repeat size, time, and memory measurements

Step 6: Apply 4-bit quantization (using bitsandbytes if possible)

import torch
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

model_4bit = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)
# Measure size, time, memory, and accuracy again

Step 7: Compare results in a table

Model	Size (MB)	Latency (ms)	VRAM Usage (MB)	Accuracy (%)
Original FP32	267	45	1024	91.2
Quantized INT8	67	28	256	90.8
Quantized NF4	34	35*	128	90.1

(* 4-bit quantization may be slower on some hardware due to decompression overhead)

Step 8: Conclusions and discussion

Which technique offers the best trade-off for this case?
Is the accuracy loss acceptable?
In which scenarios would you use each version?

← Module4 Module6 →

Course Info

Course: AI-course4

Language: EN

Lesson: Module5