🧹 Part 1: Pruning — Eliminating the unnecessary

What is pruning?

Pruning is a technique that aims to remove parts of the model that do not significantly contribute to its performance. The analogy is simple: like a gardener pruning dead or unnecessary branches from a tree so it grows stronger and healthier, in AI we remove weights, neurons, attention heads, or even entire layers that are redundant or irrelevant.

Types of pruning

1. Unstructured Pruning

Individual weights scattered throughout the model are removed.
Results in sparse matrices (many zeros).
Although it reduces the number of parameters, it doesn’t always improve speed, because hardware isn’t optimized to efficiently operate with sparse matrices.
Example: Magnitude Pruning — remove weights with the smallest absolute value.

2. Structured Pruning

Entire components are removed: neurons, filters, attention heads, layers.
Produces smaller and faster models, as hardware can leverage the reduced structure.
Example: Remove entire attention heads in a Transformer, or entire MLP layers in Llama.

3. Global vs Local Pruning

Local: pruning is performed within each layer separately.
Global: the entire model is considered, and the least important weights across the entire graph are removed.

How do you decide what to prune?

Multiple criteria exist:

Magnitude: weights close to zero are candidates.
Gradient: weights with small gradients during training are less sensitive.
Sensitivity: measure how much performance is affected by removing a weight or neuron.
Advanced methods: such as Movement Pruning (observe how weights change during fine-tuning) or MAMA (analyze magnitude and movement).

Typical pruning process

Train the original model until convergence.
Identify components to prune using a criterion (magnitude, sensitivity, etc.).
Remove the components (set weights to zero or delete layers).
Retrain or fine-tune the pruned model to recover performance.
Evaluate the final model on accuracy, size, and speed metrics.

Advantages and disadvantages

✅ Reduces number of parameters and model size.
✅ With structured pruning, improves inference speed.
✅ Maintains overall architecture, facilitating implementation.

❌ May require retraining, consuming time and resources.
❌ If pruned too aggressively, performance is irreversibly lost.
❌ Unstructured pruning doesn’t always improve latency on standard hardware.

← Home Module2 →

Course Info

Course: AI-course4

Language: EN

Lesson: Module1