🧑‍🏫 Part 2: Knowledge Distillation — The art of teaching

What is distillation?

Distillation is a technique in which a large, powerful model (the “teacher”) transfers its knowledge to a smaller, lighter model (the “student”). The goal is not for the student to exactly copy the teacher’s predictions, but to learn to mimic its “behavior” and “internal logic.”

The key idea: the teacher doesn’t just know the correct answer, but also knows how confident it is in each possible answer. That additional information (the “soft labels”) is much richer than a simple categorical label.

How does it work?

  1. Train the teacher model on the desired task (can be a state-of-the-art pretrained model).
  2. Generate teacher predictions on a dataset (can be the same training set or a new one).
  3. Train the student model using two objectives:
    • Minimize the difference with the true labels (hard labels).
    • Minimize the difference with the teacher’s predictions (soft labels), typically using a loss function like KL-Divergence.
  4. Adjust the temperature in the teacher’s softmax function to smooth predictions and make them more informative.
  5. Evaluate the student model on accuracy, size, and speed.

Practical example

Imagine a teacher model that classifies texts into 5 categories. For a given input, instead of simply saying “Category 3,” the teacher says:

  • Category 1: 0.05
  • Category 2: 0.10
  • Category 3: 0.70
  • Category 4: 0.10
  • Category 5: 0.05

The student learns not only that the correct answer is 3, but also that categories 2 and 4 are “almost possible,” and that 1 and 5 are very unlikely. This additional information allows it to generalize better.

Advanced variants

  • Distill-then-Prune: first distill, then prune the student model.
  • Task-Specific Distillation: teacher and student are focused on a specific task (e.g., summarization, translation).
  • Cross-Architecture Distillation: the student can have a completely different architecture than the teacher (e.g., a Transformer teaching a CNN).

Advantages and disadvantages

✅ Produces small, fast, and efficient models.
✅ The student can outperform a same-sized model trained from scratch.
✅ Does not require access to the teacher’s original training dataset (only its predictions).

❌ Requires access to a pre-trained teacher model (can be expensive or restricted).
❌ The distillation process can be slow and requires hyperparameter tuning.
❌ Final quality heavily depends on teacher quality and distillation dataset.

Course Info

Course: AI-course4

Language: EN

Lesson: Module2