Understand why preprocessing is not optional—it’s the core of any successful ML model. Learn to diagnose dataset health before touching any algorithm.
“Data science is 80% data cleaning, 20% complaining about cleaning data.” — Anonymous
In the real world, data never comes clean, ordered, and ready to use. It arrives with:
NaN, None, "", ?)Direct consequence: Feeding dirty data into a model causes it to learn incorrect patterns. Garbage In → Garbage Out.
Before doing anything, explore your dataset like a detective.
import pandas as pd
# Load dataset
df = pd.read_csv("datos_fraude.csv")
# Quick preview
print(df.head())
print(df.info()) # Data types and non-null counts
print(df.describe()) # Descriptive statistics (numeric only)
# View unique values in categorical columns
print(df['tipo_transaccion'].unique())
print(df['pais'].value_counts())
# Check for null values
print(df.isnull().sum())
import seaborn as sns
import matplotlib.pyplot as plt
# Histogram of a numeric variable
sns.histplot(df['monto_transaccion'], bins=50, kde=True)
plt.title("Transaction Amount Distribution")
plt.show()
# Boxplot to detect outliers
sns.boxplot(x=df['monto_transaccion'])
plt.title("Boxplot: Finding Outliers in Amounts")
plt.show()
# Bar plot for categorical variables
sns.countplot(data=df, x='tipo_tarjeta')
plt.title("Distribution of Card Types")
plt.xticks(rotation=45)
plt.show()
Option 1: Delete rows or columns
# Delete rows with any NaN
df_clean = df.dropna()
# Delete columns with more than 50% NaN
df_clean = df.dropna(axis=1, thresh=len(df)*0.5)
⚠️ Caution: Only advisable if you lose few data points. If you delete 30% of your rows, your model may become biased.
Option 2: Imputation (fill with values)
from sklearn.impute import SimpleImputer
# Impute mean for numeric variables
imputer_num = SimpleImputer(strategy='mean')
df[['edad', 'monto_transaccion']] = imputer_num.fit_transform(df[['edad', 'monto_transaccion']])
# Impute mode for categorical variables
imputer_cat = SimpleImputer(strategy='most_frequent')
df[['tipo_tarjeta']] = imputer_cat.fit_transform(df[['tipo_tarjeta']])
✅ Best practice: Use ColumnTransformer to apply different strategies to different columns (we’ll cover this in detail in Module 3).
A value that significantly deviates from the rest of the data. It may be:
Q1 = df['monto_transaccion'].quantile(0.25)
Q3 = df['monto_transaccion'].quantile(0.75)
IQR = Q3 - Q1
limite_inferior = Q1 - 1.5 * IQR
limite_superior = Q3 + 1.5 * IQR
outliers = df[(df['monto_transaccion'] < limite_inferior) | (df['monto_transaccion'] > limite_superior)]
print(f"Outliers detected: {len(outliers)}")
df['monto_transaccion'] = df['monto_transaccion'].clip(lower=limite_inferior, upper=limite_superior)
df['log_monto'] = np.log1p(df['monto_transaccion']) # log(1+x) to avoid log(0)
Suggested dataset: fraud_data.csv (simulated, with columns: user_id, monto, edad, pais, tipo_tarjeta, hora_dia, es_fraude)
Tasks:
.info() and .describe().monto. Apply IQR-based capping.edad distribution. Are there impossible values (e.g., age < 0 or > 120)? Correct them.fraud_clean.csv.