📚 Module 1: What is Fine-Tuning and Why Is It So Expensive?

1.1 Conceptual Definition of Fine-Tuning

Fine-tuning is the process by which a machine learning model, previously trained on a general task using a large corpus of data (pretraining), is adapted or specialized for a specific task through additional training on a smaller, targeted dataset.

In the context of LLMs, this means taking a model such as GPT-2, Llama 3, Mistral, or Qwen — trained on billions of web pages of text — and adjusting it to answer technical support questions, generate product descriptions, translate legal jargon into plain language, or even adopt a specific humorous or formal tone.

Fine-tuning does not start from scratch. The model already possesses general knowledge of language, grammar, basic reasoning, and some inferential ability. Fine-tuning “teaches” it to apply that knowledge in a specific, contextualized manner aligned with a concrete objective.

1.2 Why Is Traditional Fine-Tuning Prohibitive?

Despite its power, full fine-tuning (full fine-tuning) of modern LLMs presents technical and economic challenges that make it unfeasible for most developers, independent researchers, and small businesses. These challenges can be summarized in three critical dimensions:

a) Memory Requirements (VRAM)

Current LLMs range in size from hundreds of millions to hundreds of billions of parameters. During training, each parameter must be stored in GPU memory (VRAM) as a 32-bit floating-point (FP32) or 16-bit (FP16/BF16) number. This implies:

  • A 7-billion-parameter (7B) model in FP16 requires approximately 14 GB just to store the weights.
  • During training, in addition to weights, gradients (equal in size to the weights) and optimizers (such as Adam, which stores two additional copies per parameter: momentum and squared momentum) must also be stored. This triples or quadruples the memory requirement.

Thus, training a 7B model may easily require 56 GB of VRAM or more, necessitating multiple high-performance GPUs (e.g., A100 or H100) running in parallel.

b) Computational Cost and Training Time

Training a complete LLM involves performing millions of mathematical operations per optimization step. Even with accelerated hardware, training time can extend over days or weeks, depending on dataset and model size. This translates into cloud costs that can exceed hundreds or thousands of dollars per experiment.

c) Risk of Catastrophic Forgetting

When all model parameters are updated during fine-tuning, there is a significant risk that the model will “forget” general knowledge acquired during pretraining, especially if the fine-tuning dataset is small or unrepresentative of general knowledge. This phenomenon, known as catastrophic forgetting, can drastically degrade the model’s performance on tasks unrelated to the specialization.