⚖️ Part 4: Trade-offs — Balancing size, accuracy, and speed

No compression technique is magical. There’s always a trade-off: you gain in one aspect, but lose (or must compensate) in another.

The compression triangle

          Accuracy
             /\
            /  \
           /    \
          /      \
     Size ———— Speed
  • Reduce size (via pruning or quantization) → may reduce accuracy, but improves speed and memory.
  • Improve speed (via structured pruning or quantization) → may require sacrificing accuracy or size.
  • Maintain accuracy → may require more size or longer inference time.

How to choose the right technique?

Depends on the use case:

  • For mobile or IoT: prioritize size and speed → quantization + structured pruning.
  • For high-concurrency servers: prioritize speed and efficiency → INT8 quantization + distillation.
  • For critical tasks where accuracy is vital: use QAT + distillation, avoid aggressive pruning.
  • For rapid prototyping: use PTQ + dynamic quantization.

Metrics to evaluate success

It’s not enough to say “the model is smaller.” You must measure:

  • Model size (MB or GB).
  • Inference latency (ms per prediction).
  • Throughput (predictions per second).
  • RAM/VRAM memory usage.
  • Accuracy/Recall/F1 on the validation set.
  • Energy consumption (if measurable).

Tools for measurement

  • torchinfo or tensorflow_model_analysis for size and parameters.
  • time or torch.utils.benchmark for latency.
  • nvidia-smi or htop for memory usage.
  • Hugging Face evaluate for accuracy metrics.
Course Info

Course: AI-course4

Language: EN

Lesson: Module4