Fine-tuning is the process by which a machine learning model, previously trained on a general task using a large corpus of data (pretraining), is adapted or specialized for a specific task through additional training on a smaller, targeted dataset.
In the context of LLMs, this means taking a model such as GPT-2, Llama 3, Mistral, or Qwen — trained on billions of web pages of text — and adjusting it to answer technical support questions, generate product descriptions, translate legal jargon into plain language, or even adopt a specific humorous or formal tone.
Fine-tuning does not start from scratch. The model already possesses general knowledge of language, grammar, basic reasoning, and some inferential ability. Fine-tuning “teaches” it to apply that knowledge in a specific, contextualized manner aligned with a concrete objective.
Despite its power, full fine-tuning (full fine-tuning) of modern LLMs presents technical and economic challenges that make it unfeasible for most developers, independent researchers, and small businesses. These challenges can be summarized in three critical dimensions:
Current LLMs range in size from hundreds of millions to hundreds of billions of parameters. During training, each parameter must be stored in GPU memory (VRAM) as a 32-bit floating-point (FP32) or 16-bit (FP16/BF16) number. This implies:
Thus, training a 7B model may easily require 56 GB of VRAM or more, necessitating multiple high-performance GPUs (e.g., A100 or H100) running in parallel.
Training a complete LLM involves performing millions of mathematical operations per optimization step. Even with accelerated hardware, training time can extend over days or weeks, depending on dataset and model size. This translates into cloud costs that can exceed hundreds or thousands of dollars per experiment.
When all model parameters are updated during fine-tuning, there is a significant risk that the model will “forget” general knowledge acquired during pretraining, especially if the fine-tuning dataset is small or unrepresentative of general knowledge. This phenomenon, known as catastrophic forgetting, can drastically degrade the model’s performance on tasks unrelated to the specialization.