📚 Module 5: Practical Configuration — Hyperparameters, target_modules, and Environment

5.1 Environment Setup (Google Colab)

To run QLoRA on Google Colab (free), follow these steps:

# Install dependencies
!pip install -q bitsandbytes transformers accelerate peft trl

# Verify GPU
import torch
print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")

Note: In Colab, ensure you select a T4 GPU (Runtime → Change runtime type → GPU).

5.2 Loading the Model with 4-bit Quantization

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load model and tokenizer
model_name = "Qwen/Qwen2.5-0.5B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",  # Automatically distributes layers across GPU/CPU
    trust_remote_code=True  # Required for some models like Qwen
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

5.3 Configuring LoRA/QLoRA

from peft import LoraConfig

lora_config = LoraConfig(
    r=8,                         # Low-rank matrix dimension
    lora_alpha=16,               # Scaling factor (typically 2x r)
    target_modules=["q_proj", "v_proj"],  # Modules to apply LoRA to
    lora_dropout=0.05,           # Regularization dropout
    bias="none",                 # Do not train biases
    task_type="CAUSAL_LM"        # Task type: causal language modeling
)

# Apply PEFT to the model
from peft import get_peft_model

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Expected output:
trainable params: 327,680 || all params: 510,550,016 || trainable%: 0.0642

This confirms that only 0.06% of parameters are trained — over 510 million frozen!

5.4 Choosing target_modules

target_modules are the model layers where LoRA matrices are inserted. Correct selection is crucial for performance.

In standard architectures (Llama, Mistral, Qwen):

  • Recommended: ["q_proj", "v_proj"] — Query and Value projections in attention layers.
  • Alternative: ["q_proj", "k_proj", "v_proj", "o_proj"] — All attention projections (more parameters, possible improvement on complex tasks).
  • Optional: Add MLP dense layers: ["gate_proj", "up_proj", "down_proj"] (in Llama) or ["fc1", "fc2"] (in others).

How to discover module names in your model:

# List all linear modules in the model
from peft.utils import TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING

# Or inspect manually
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Linear):
        print(name)

Tip: Start with ["q_proj", "v_proj"]. If performance is insufficient, experiment by adding more modules.

5.5 Other Key Hyperparameters

  • r (rank): Start with 8. If the model doesn’t learn well, try 16 or 32. If overfitting occurs, reduce to 4.
  • lora_alpha: Usually set to 2 * r. If r=8, alpha=16. Controls the “strength” of LoRA updates.
  • lora_dropout: 0.05 or 0.1 for small datasets. 0.0 for large datasets.
  • bias: "none" is most common. "all" or "lora_only" rarely improve performance.