📉 MODULE 4: “Metrics That Matter: Beyond Accuracy”

Objective:

Understand that accuracy isn’t everything. Learn to choose, calculate, and interpret appropriate metrics based on the problem—especially in imbalanced contexts like fraud detection.


4.1 The Accuracy Trap

Imagine a fraud dataset:

  • 99% legitimate transactions
  • 1% fraudulent transactions

A model that always predicts “NO FRAUD” will have 99% accuracy… but it’s useless! It detects ZERO fraud.


4.2 Confusion Matrix: Your Best Friend

The confusion matrix shows:

  • True Positives (TP): Fraud correctly detected.
  • False Negatives (FN): Fraud the model labeled as legitimate (serious error!).
  • False Positives (FP): Legitimate transactions flagged as fraud (customer annoyance).
  • True Negatives (TN): Legitimate transactions correctly identified.
from sklearn.metrics import confusion_matrix
import seaborn as sns

y_pred = modelo.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Legítimo', 'Fraude'],
            yticklabels=['Legítimo', 'Fraude'])
plt.title("Confusion Matrix")
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.show()

4.3 Precision, Recall, and F1-Score

➤ Precision

Of all transactions I flagged as fraud, how many were actually fraud?

Precision = TP / (TP + FP)

✅ Important when false positive cost is high (e.g., blocking a legitimate card).


➤ Recall (Sensitivity, True Positive Rate)

Of all actual frauds, how many did I detect?

Recall = TP / (TP + FN)

CRITICAL in fraud detection. You want to minimize FN (undetected fraud).


➤ F1-Score

Harmonic mean of precision and recall. Useful when seeking balance.

F1 = 2 * (Precision * Recall) / (Precision + Recall)
from sklearn.metrics import precision_score, recall_score, f1_score

print("Precision:", precision_score(y_test, y_pred))
print("Recall:   ", recall_score(y_test, y_pred))
print("F1-Score: ", f1_score(y_test, y_pred))

4.4 AUC-ROC: The Gold Standard for Binary Classification

The ROC curve shows the trade-off between True Positive Rate (Recall) and False Positive Rate at various decision thresholds.

AUC (Area Under Curve) is a number between 0 and 1:

  • 0.5 = random model
  • 1.0 = perfect model
from sklearn.metrics import roc_auc_score, roc_curve

y_proba = modelo.predict_proba(X_test)[:, 1]  # probability of positive class
auc = roc_auc_score(y_test, y_proba)
print("AUC-ROC:", auc)

# Plot curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc:.2f})')
plt.plot([0,1], [0,1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid()
plt.show()

AUC advantage: Invariant to class imbalance. Perfect for fraud.


4.5 Full Classification Report

Scikit-learn provides a professional summary:

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred,
      target_names=['Legítimo', 'Fraude']))

Typical output:

              precision    recall  f1-score   support

   Legítimo       0.99      1.00      0.99     19800
     Fraude       0.85      0.50      0.63       200

    accuracy                           0.99     20000
   macro avg       0.92      0.75      0.81     20000
weighted avg       0.99      0.99      0.99     20000

📝 Exercise 4.1: Deep Evaluation on an Imbalanced Dataset

Dataset: fraud_preprocessed.csv (preprocessed from previous module)

Tasks:

  1. Train a Logistic Regression model.
  2. Calculate and display the confusion matrix.
  3. Calculate precision, recall, F1-score, and AUC-ROC.
  4. Interpret results: Is the model good? Which metric matters most here and why?
  5. Adjust the decision threshold (default 0.5) to improve recall (even if precision worsens). Use predict_proba and a 0.3 threshold.
  6. Compare metrics before and after threshold adjustment.

💡 Additional Notes:

  • In fraud, recall > precision. You’d rather manually review some legitimate cases (FP) than let a fraud slip (FN).
  • AUC-ROC is your best ally for comparing models on imbalanced datasets.
  • Never use accuracy as the main metric in imbalanced problems.
  • Scikit-learn has precision_recall_curve to plot the precision-recall trade-off (useful when negatives are the majority).

Course Info

Course: AI-course1

Language: EN

Lesson: Module4