📚 Module 6: Dataset Preparation and Instruction Format

6.1 Dataset Format for Instruction Fine-Tuning

For text generation tasks (chat, instructions, QA), the most common format is the Alpaca format, consisting of a JSON with three fields per example:

{
  "instruction": "Write a short description for a technology product.",
  "input": "Product: Wireless headphones with noise cancellation. Price: $129.99.",
  "output": "Enjoy your music without distractions with these high-fidelity wireless headphones. With active noise cancellation and up to 30 hours of battery life, they’re ideal for travel, work, or simply relaxing. Just $129.99."
}
  • instruction: The task the model must perform.
  • input: Additional context or input (optional).
  • output: The desired response.

6.2 Tokenization and Packing

This format must be converted into tensors the model can understand. The model’s tokenizer converts text into IDs, and a chat template is applied if required (as in Qwen or Llama 3).

def format_instruction(example):
    return f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""

# Tokenization
def tokenize_function(example):
    text = format_instruction(example)
    tokenized = tokenizer(
        text,
        truncation=True,
        max_length=512,
        padding="max_length",
    )
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

Important: In instruct models, it's common to mask input tokens (instruction + input) in labels, so the model computes loss only on the output. This is done by assigning -100 to those tokens (ignored by PyTorch’s loss function).