Even with QLoRA, memory exhaustion in Colab is possible. Strategies:
per_device_train_batch_sizeStart with 1 or 2. Compensate with gradient_accumulation_steps.
max_seq_lengthLower from 512 to 256 or 384 if content permits.
torch.compile (experimental)model = torch.compile(model)
May accelerate training and reduce memory, but isn't always stable.
torch.cuda.empty_cache()
Useful after loading the model or between experiments.
CUDA out of memorygradient_accumulation_steps.max_seq_length.Some weights of the model checkpoint ... were not usedNormal if loading with trust_remote_code=True or using PEFT. Not critical.
ValueError: Attempting to unscale FP16 gradients.Use optim="adamw_bnb_8bit" or optim="paged_adamw_8bit" in TrainingArguments.
The model is not in eval modeIgnore. Trainer handles mode automatically.