Understand why a model performing well on training data may fail in the real world. Learn to detect, prevent, and measure overfitting using robust validation techniques.
Overfitting occurs when a model learns the training data too well—including noise and random patterns—instead of general patterns. Result:
Imagine a student who memorizes exam answers instead of understanding concepts. On the real exam, they fail.
Simplest method: compare performance on training vs. validation/test.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
modelo = LogisticRegression()
modelo.fit(X_train, y_train)
# Training performance
y_train_pred = modelo.predict(X_train)
train_acc = accuracy_score(y_train, y_train_pred)
# Test performance
y_test_pred = modelo.predict(X_test)
test_acc = accuracy_score(y_test, y_test_pred)
print(f"Training Accuracy: {train_acc:.4f}")
print(f"Test Accuracy: {test_acc:.4f}")
# If train_acc >> test_acc → Overfitting!
⚠️ Typical example:
Adds a penalty to the loss function to prevent coefficients from becoming too large.
# Logistic regression with L2 regularization (Ridge)
modelo_l2 = LogisticRegression(penalty='l2', C=1.0) # smaller C = more regularization
# With L1 (Lasso) for automatic feature selection
modelo_l1 = LogisticRegression(penalty='l1', solver='liblinear', C=0.1)
✅ L1 (Lasso): Can zero out coefficients → feature selection.
✅ L2 (Ridge): Shrinks coefficients but doesn’t zero them → more stable.
Simple but powerful. More representative data helps the model learn more general patterns.
The best tool to evaluate a model’s generalization ability before seeing test data.
Simple validation (train/test split) can be misleading if the split is lucky or unlucky.
Cross-validation (CV) splits data into K folds, trains K times, each time using a different fold as validation.
Result: a more robust, reliable performance estimate.
from sklearn.model_selection import cross_val_score
modelo = LogisticRegression()
# 5-fold cross-validation
scores = cross_val_score(modelo, X_train, y_train, cv=5, scoring='roc_auc')
print(f"AUC-ROC per fold: {scores}")
print(f"Average AUC-ROC: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")
✅ Advantages:
Splits into K equal parts. Each fold is used once as validation.
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
Maintains class proportions in each fold. ESPECIALLY IMPORTANT FOR IMBALANCED DATASETS.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(modelo, X_train, y_train, cv=skf, scoring='roc_auc')
Each fold is a single observation. Very computationally expensive; only for very small datasets.
import numpy as np
# Train and evaluate with simple train/test
modelo.fit(X_train, y_train)
test_score = roc_auc_score(y_test, modelo.predict_proba(X_test)[:,1])
# Evaluate with CV on training set
cv_scores = cross_val_score(modelo, X_train, y_train, cv=5, scoring='roc_auc')
plt.figure(figsize=(8,5))
plt.axhline(y=test_score, color='red', linestyle='--', label=f'Test Score: {test_score:.4f}')
plt.plot(range(1,6), cv_scores, 'bo-', label='CV Scores per Fold')
plt.axhline(y=cv_scores.mean(), color='blue', linestyle='-', label=f'CV Average: {cv_scores.mean():.4f}')
plt.title("Comparison: Cross-Validation vs Final Test")
plt.xlabel("Fold")
plt.ylabel("AUC-ROC")
plt.legend()
plt.grid()
plt.show()
Dataset: fraud_features.csv (preprocessed with selected features)
Tasks:
C=0.01) and compare.