Master how to transform, select, and create features so models learn efficiently, stably, and without numerical bias. Learn why not all variables are useful—and how to make the useful ones shine.
Many ML algorithms (SVM, KNN, logistic regression, neural networks) are sensitive to feature scales.
Imagine:
edad → range 18 to 90ingreso_anual → range 20,000 to 500,000Without scaling, the algorithm will give MUCH more weight to ingreso_anual simply because its numbers are larger—even if edad is more predictive!
Transforms data to have mean = 0 and standard deviation = 1.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['edad', 'ingreso_anual']] = scaler.fit_transform(df[['edad', 'ingreso_anual']])
✅ When to use: When data approximately follows a normal distribution. Ideal for linear models, SVM, neural networks.
Transforms data to a fixed range, typically [0, 1].
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['monto_transaccion', 'antiguedad_cliente']] = scaler.fit_transform(df[['monto_transaccion', 'antiguedad_cliente']])
✅ When to use: When you know min/max bounds, or when using neural networks with activation functions like sigmoid or tanh.
⚠️ Beware of outliers: A single extreme value can compress the entire rest of the range.
Uses the median and interquartile range (IQR). Robust to outliers.
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
df[['monto_transaccion']] = scaler.fit_transform(df[['monto_transaccion']])
✅ When to use: When you have many outliers and don’t want to remove or transform them.
import matplotlib.pyplot as plt
import seaborn as sns
# Original data
original = df_original['monto_transaccion'].values.reshape(-1, 1)
standard_scaled = StandardScaler().fit_transform(original)
minmax_scaled = MinMaxScaler().fit_transform(original)
robust_scaled = RobustScaler().fit_transform(original)
fig, ax = plt.subplots(2, 2, figsize=(12, 8))
sns.histplot(original, bins=30, ax=ax[0,0], kde=True)
ax[0,0].set_title("Original")
sns.histplot(standard_scaled, bins=30, ax=ax[0,1], kde=True)
ax[0,1].set_title("StandardScaler")
sns.histplot(minmax_scaled, bins=30, ax=ax[1,0], kde=True)
ax[1,0].set_title("MinMaxScaler")
sns.histplot(robust_scaled, bins=30, ax=ax[1,1], kde=True)
ax[1,1].set_title("RobustScaler")
plt.tight_layout()
plt.show()
Not all variables are useful. Some are redundant, irrelevant, or noisy. Keeping them:
If a variable barely changes (e.g., 99% of values are 0), it adds no information.
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01) # removes columns with variance < 0.01
X_high_variance = selector.fit_transform(X)
Uses statistical tests to measure the relationship between each feature and the target variable.
from sklearn.feature_selection import SelectKBest, f_classif
# Select top 10 features using ANOVA F-test
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)
# View scores
scores = selector.scores_
feature_names = X.columns
plt.figure(figsize=(10,6))
sns.barplot(x=scores, y=feature_names)
plt.title("Feature Importance (ANOVA F-test)")
plt.show()
Trains a model and iteratively removes least important features.
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
modelo_base = RandomForestClassifier(n_estimators=10, random_state=42)
rfe = RFE(estimator=modelo_base, n_features_to_select=8)
X_rfe = rfe.fit_transform(X, y)
# View selected features
selected_features = X.columns[rfe.support_]
print("Selected features:", selected_features.tolist())
Instead of selecting, create new features from existing ones.
Reduces dimensionality by transforming original variables into a smaller set of uncorrelated variables (principal components).
from sklearn.decomposition import PCA
pca = PCA(n_components=5) # reduce to 5 components
X_pca = pca.fit_transform(X_scaled)
# View variance explained per component
plt.figure(figsize=(8,5))
plt.plot(range(1,6), pca.explained_variance_ratio_.cumsum(), marker='o')
plt.title("Cumulative Variance Explained by PCA Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Variance")
plt.grid()
plt.show()
✅ When to use PCA: When you have many correlated variables, or for visualization (reduce to 2D/3D).
⚠️ Disadvantage: Lose interpretability. “Component 1” has no clear business meaning.
Dataset: fraud_encoded.csv (from previous module, already encoded)
Tasks:
X) from target variable (y = es_fraude).StandardScaler to continuous numeric variables (e.g., edad, ingreso, monto).VarianceThreshold to remove features with variance < 0.01.SelectKBest with f_classif to select the 12 most relevant features.transform() (not fit_transform()) on test.