Distillation is a technique in which a large, powerful model (the “teacher”) transfers its knowledge to a smaller, lighter model (the “student”). The goal is not for the student to exactly copy the teacher’s predictions, but to learn to mimic its “behavior” and “internal logic.”
The key idea: the teacher doesn’t just know the correct answer, but also knows how confident it is in each possible answer. That additional information (the “soft labels”) is much richer than a simple categorical label.
Imagine a teacher model that classifies texts into 5 categories. For a given input, instead of simply saying “Category 3,” the teacher says:
The student learns not only that the correct answer is 3, but also that categories 2 and 4 are “almost possible,” and that 1 and 5 are very unlikely. This additional information allows it to generalize better.
✅ Produces small, fast, and efficient models.
✅ The student can outperform a same-sized model trained from scratch.
✅ Does not require access to the teacher’s original training dataset (only its predictions).
❌ Requires access to a pre-trained teacher model (can be expensive or restricted).
❌ The distillation process can be slow and requires hyperparameter tuning.
❌ Final quality heavily depends on teacher quality and distillation dataset.