🎯 MODULE 6: “Final Project: Building a Fraud Detection System with Real Business Metrics”

Objective:

Apply EVERYTHING learned in a realistic integrative project: from raw data loading to a model evaluated with business metrics, including preprocessing, encoding, scaling, feature selection, and robust validation.


6.1 Project Dataset: transacciones_fraude.csv

Simulated features:

  • monto_transaccion (float)
  • edad_cliente (int)
  • tipo_tarjeta (categorical: “Visa”, “Mastercard”, “Amex”)
  • pais_origen (categorical)
  • hora_del_dia (int 0-23)
  • dias_desde_ultima_transaccion (int)
  • es_fraude (bool: 0 or 1) → only 1.5% fraud!

6.2 Phase 1: Initial Diagnosis and Cleaning

# Load and explore
df = pd.read_csv("transacciones_fraude.csv")
print(df.info())
print(df.isnull().sum())

# Impute age with median, card type with mode
df['edad_cliente'].fillna(df['edad_cliente'].median(), inplace=True)
df['tipo_tarjeta'].fillna(df['tipo_tarjeta'].mode()[0], inplace=True)

# Cap transaction amount using IQR
Q1 = df['monto_transaccion'].quantile(0.25)
Q3 = df['monto_transaccion'].quantile(0.75)
IQR = Q3 - Q1
lim_inf, lim_sup = Q1 - 1.5*IQR, Q3 + 1.5*IQR
df['monto_transaccion'] = df['monto_transaccion'].clip(lim_inf, lim_sup)

6.3 Phase 2: Encoding and Scaling

# Encoding
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Label Encoding for ordinal variables (none here, but just in case)
# One-Hot for nominal categorical variables
categorical_features = ['tipo_tarjeta', 'pais_origen']
numeric_features = ['monto_transaccion', 'edad_cliente', 'hora_del_dia', 'dias_desde_ultima_transaccion']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_features)
    ])

X = df.drop('es_fraude', axis=1)
y = df['es_fraude']

X_processed = preprocessor.fit_transform(X)

6.4 Phase 3: Feature Selection and Training

from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, stratify=y, random_state=42)

# Feature selection
selector = SelectKBest(score_func=f_classif, k=10)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Train model
modelo = LogisticRegression(penalty='l2', C=1.0, random_state=42)
modelo.fit(X_train_selected, y_train)

6.5 Phase 4: Deep Evaluation with Business Metrics

from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
import seaborn as sns

y_pred = modelo.predict(X_test_selected)
y_proba = modelo.predict_proba(X_test_selected)[:,1]

print("=== CLASSIFICATION REPORT ===")
print(classification_report(y_test, y_pred, target_names=['Legítimo', 'Fraude']))

print(f"\nAUC-ROC: {roc_auc_score(y_test, y_proba):.4f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Legítimo', 'Fraude'],
            yticklabels=['Legítimo', 'Fraude'])
plt.title("Confusion Matrix - Fraud Detection")
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.show()

# Cross-validation for confidence
cv_scores = cross_val_score(modelo, X_train_selected, y_train, cv=StratifiedKFold(5), scoring='roc_auc')
print(f"\nCross-Validation AUC-ROC: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")

6.6 Phase 5: Threshold Tuning to Maximize Recall

from sklearn.metrics import precision_recall_curve

# Find threshold maximizing F1 or recall
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)

# Calculate F1 for each threshold
f1_scores = 2 * (precision * recall) / (precision + recall)
best_threshold = thresholds[np.argmax(f1_scores)]

print(f"Best threshold for F1: {best_threshold:.3f}")

# Predict with new threshold
y_pred_new = (y_proba >= best_threshold).astype(int)

print("\n=== WITH ADJUSTED THRESHOLD ===")
print(classification_report(y_test, y_pred_new, target_names=['Legítimo', 'Fraude']))

📝 Project Deliverables:

  1. Jupyter Notebook with all phases documented.
  2. Graphs: initial distribution, confusion matrix, ROC curve, Precision-Recall curve.
  3. Comparison table: metrics at 0.5 threshold vs. optimal threshold.
  4. Written conclusion: How good is the model? Which metric would you prioritize in production and why? What would you improve in the next iteration?

💡 Final Course Notes:

Preprocessing isn’t a necessary evil—it’s your superpower.
Never trust accuracy in imbalanced problems. Use AUC-ROC, Recall, F1.
Stratified cross-validation is your ally for building robust models.
Document every step. Your future self (and your team) will thank you.
In production, monitor not just accuracy, but your data distribution. It can drift over time!

Course Info

Course: AI-course1

Language: EN

Lesson: Module6