One of PEFT’s greatest benefits is that you only need to save the adapter parameters (LoRA), not the full base model. This has enormous practical implications:
After training, the LoRA adapter is saved as additional weights. The base model remains untouched.
# Save the LoRA adapter
model.save_pretrained("./lora_adapter")
# Save tokenizer (if modified, though rare)
tokenizer.save_pretrained("./lora_adapter")
This creates a directory ./lora_adapter with files like:
adapter_config.json — LoRA configuration (r, alpha, target_modules, etc.)adapter_model.bin — LoRA weights (A and B)README.md (optional) — MetadataImportant: The base model is not saved here. You must retain access to the original base model (e.g., from Hugging Face Hub) to later load the adapter.
To use the trained adapter:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
# Quantization config (optional for efficient inference)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# Load base model
model_name = "Qwen/Qwen2.5-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Load LoRA adapter
model = PeftModel.from_pretrained(model, "./lora_adapter")
# Model now has specialized behavior
model.eval() # Set to evaluation mode for inference
Although dynamic adapter loading is flexible, for production deployment or faster inference, merging LoRA weights into the base model is useful. This creates a complete, specialized model requiring no PEFT infrastructure during inference.
# Merge LoRA adapter with base model
model = model.merge_and_unload()
# Now the model is a complete model with updated weights
# Save as a standard Hugging Face model
model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")
Warning:
- Once merged, you cannot reload another adapter without reloading the original base model.
- The merged model occupies the same disk space as the original base model (~1GB for Qwen2.5-0.5B in FP16).
- Merging is only possible if the model is in full precision (FP16/BF16). If quantized to 4-bit, first dequantize (requires more memory).
# If model is 4-bit, first dequantize (requires more VRAM)
model = model.dequantize() # Converts weights to BF16/FP16
# Then merge
model = model.merge_and_unload()
# Save
model.save_pretrained("./merged_model_full_precision")
Once merged and saved, the model behaves like any standard Hugging Face model:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"./merged_model",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("./merged_model", trust_remote_code=True)
# Ready for inference without PEFT!