🧹 Part 1: Pruning — Eliminating the unnecessary
What is pruning?
Pruning is a technique that aims to remove parts of the model that do not significantly contribute to its performance. The analogy is simple: like a gardener pruning dead or unnecessary branches from a tree so it grows stronger and healthier, in AI we remove weights, neurons, attention heads, or even entire layers that are redundant or irrelevant.
Types of pruning
1. Unstructured Pruning
- Individual weights scattered throughout the model are removed.
- Results in sparse matrices (many zeros).
- Although it reduces the number of parameters, it doesn’t always improve speed, because hardware isn’t optimized to efficiently operate with sparse matrices.
- Example: Magnitude Pruning — remove weights with the smallest absolute value.
2. Structured Pruning
- Entire components are removed: neurons, filters, attention heads, layers.
- Produces smaller and faster models, as hardware can leverage the reduced structure.
- Example: Remove entire attention heads in a Transformer, or entire MLP layers in Llama.
3. Global vs Local Pruning
- Local: pruning is performed within each layer separately.
- Global: the entire model is considered, and the least important weights across the entire graph are removed.
How do you decide what to prune?
Multiple criteria exist:
- Magnitude: weights close to zero are candidates.
- Gradient: weights with small gradients during training are less sensitive.
- Sensitivity: measure how much performance is affected by removing a weight or neuron.
- Advanced methods: such as Movement Pruning (observe how weights change during fine-tuning) or MAMA (analyze magnitude and movement).
Typical pruning process
- Train the original model until convergence.
- Identify components to prune using a criterion (magnitude, sensitivity, etc.).
- Remove the components (set weights to zero or delete layers).
- Retrain or fine-tune the pruned model to recover performance.
- Evaluate the final model on accuracy, size, and speed metrics.
Advantages and disadvantages
✅ Reduces number of parameters and model size.
✅ With structured pruning, improves inference speed.
✅ Maintains overall architecture, facilitating implementation.
❌ May require retraining, consuming time and resources.
❌ If pruned too aggressively, performance is irreversibly lost.
❌ Unstructured pruning doesn’t always improve latency on standard hardware.