🧹 Part 1: Pruning — Eliminating the unnecessary

What is pruning?

Pruning is a technique that aims to remove parts of the model that do not significantly contribute to its performance. The analogy is simple: like a gardener pruning dead or unnecessary branches from a tree so it grows stronger and healthier, in AI we remove weights, neurons, attention heads, or even entire layers that are redundant or irrelevant.

Types of pruning

1. Unstructured Pruning

  • Individual weights scattered throughout the model are removed.
  • Results in sparse matrices (many zeros).
  • Although it reduces the number of parameters, it doesn’t always improve speed, because hardware isn’t optimized to efficiently operate with sparse matrices.
  • Example: Magnitude Pruning — remove weights with the smallest absolute value.

2. Structured Pruning

  • Entire components are removed: neurons, filters, attention heads, layers.
  • Produces smaller and faster models, as hardware can leverage the reduced structure.
  • Example: Remove entire attention heads in a Transformer, or entire MLP layers in Llama.

3. Global vs Local Pruning

  • Local: pruning is performed within each layer separately.
  • Global: the entire model is considered, and the least important weights across the entire graph are removed.

How do you decide what to prune?

Multiple criteria exist:

  • Magnitude: weights close to zero are candidates.
  • Gradient: weights with small gradients during training are less sensitive.
  • Sensitivity: measure how much performance is affected by removing a weight or neuron.
  • Advanced methods: such as Movement Pruning (observe how weights change during fine-tuning) or MAMA (analyze magnitude and movement).

Typical pruning process

  1. Train the original model until convergence.
  2. Identify components to prune using a criterion (magnitude, sensitivity, etc.).
  3. Remove the components (set weights to zero or delete layers).
  4. Retrain or fine-tune the pruned model to recover performance.
  5. Evaluate the final model on accuracy, size, and speed metrics.

Advantages and disadvantages

✅ Reduces number of parameters and model size.
✅ With structured pruning, improves inference speed.
✅ Maintains overall architecture, facilitating implementation.

❌ May require retraining, consuming time and resources.
❌ If pruned too aggressively, performance is irreversibly lost.
❌ Unstructured pruning doesn’t always improve latency on standard hardware.

Course Info

Course: AI-course4

Language: EN

Lesson: Module1