🧩 MODULE 2: “Translating the Real World: Encoding Categorical Variables”

Objective:

Learn to convert text, labels, and categories into numbers that models can understand—without introducing bias or distortion.


2.1 Why Encode?

ML algorithms (regression, SVM, neural networks) only understand numbers. They cannot process “Visa”, “Mastercard”, or “Spain” directly.

But… beware! It’s not just about assigning arbitrary numbers. How you encode affects model performance and interpretation.


2.2 Label Encoding

Assigns a unique integer to each category.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['tipo_tarjeta_encoded'] = le.fit_transform(df['tipo_tarjeta'])

# Example: ["Visa", "Mastercard", "Amex"] → [0, 1, 2]

⚠️ Serious problem: Introduces artificial order. The model may interpret “2” > “1” > “0”, as if “Amex” were “better” than “Visa”. This is incorrect if no real order exists.

When to use: Only for ORDINAL variables (e.g., “Low”, “Medium”, “High”) or decision trees (which don’t assume linear order).


2.3 One-Hot Encoding

Creates a binary column (0/1) for each category.

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

ohe = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' avoids multicollinearity
categorias_encoded = ohe.fit_transform(df[['tipo_tarjeta']])

# Convert to DataFrame with names
df_ohe = pd.DataFrame(categorias_encoded, columns=ohe.get_feature_names_out(['tipo_tarjeta']))
df = pd.concat([df.reset_index(drop=True), df_ohe], axis=1)

Advantages:

  • Doesn’t introduce false order.
  • Ideal for linear models, SVM, neural networks.

⚠️ Disadvantages:

  • Greatly increases dimensionality (50 countries → 50 columns).
  • Can cause the “curse of dimensionality.”

2.4 Target Encoding

Replaces each category with the mean of the target variable for that category.

# Example: tipo_tarjeta → mean of "es_fraude" for that card type
target_mean = df.groupby('tipo_tarjeta')['es_fraude'].mean()
df['tipo_tarjeta_target_encoded'] = df['tipo_tarjeta'].map(target_mean)

Advantages:

  • Captures predictive information directly.
  • Doesn’t increase dimensionality.

⚠️ Risks:

  • Data leakage if computed before splitting train/test.
  • Overfitting if a category has few examples.

Solution: Use cross-validation or add noise.


📊 Visualization: Comparing Encodings

fig, ax = plt.subplots(1, 3, figsize=(15, 5))

# Original
sns.countplot(data=df, x='tipo_tarjeta', ax=ax[0])
ax[0].set_title("Original")

# Label Encoded
sns.histplot(df['tipo_tarjeta_encoded'], bins=3, ax=ax[1])
ax[1].set_title("Label Encoding")

# One-Hot (show one column)
sns.histplot(df['tipo_tarjeta_Mastercard'], bins=2, ax=ax[2])
ax[2].set_title("One-Hot (e.g., Mastercard)")

plt.tight_layout()
plt.show()

📝 Exercise 2.1: Intelligent Encoding

Dataset: fraud_clean.csv (from previous module)

Tasks:

  1. Apply Label Encoding to the nivel_riesgo column (assume it’s ordinal: "Bajo", "Medio", "Alto").
  2. Apply One-Hot Encoding to tipo_tarjeta and pais. Use drop='first'.
  3. Apply Target Encoding to ciudad (use only the training set to compute means—avoid data leakage!).
  4. Compare DataFrame shape before and after. How many new columns were created?
  5. Remove original text columns (keep only encoded ones).

💡 Additional Notes:

  • Always encode AFTER splitting train/test to avoid data leakage.
  • One-Hot is standard for most models.
  • Target Encoding is powerful but dangerous. Use cautiously with cross-validation.
  • Scikit-learn has OrdinalEncoder for ordinal variables (better than LabelEncoder for multiple columns).


Course Info

Course: AI-course1

Language: EN

Lesson: Module2