📘 Lesson 2: The Treasure Map — Understanding the Workflow of a Machine Learning Project

\"Don't start cooking without a recipe. Don't build a model without a plan.\"

⏱️ Estimated duration of this lesson: 60-75 minutes

🧭 Why is this lesson so important?

Many beginners make the same mistake: they want to jump directly to the model.

\"I want to train an AI like ChatGPT NOW!\"

But that's like wanting to build an airplane without knowing what a screw is.

In this lesson, you won't just learn the steps of an ML project… you'll learn why each step exists, what happens if you skip it, and how to think like a data scientist from minute one.

At the end, you'll have a clear mental map that you can apply to any project: spam classification, price prediction, fraud detection, medical diagnosis, anything!

🗺️ The Treasure Map: The 5 Golden Steps of an ML Project

Imagine you're a pirate looking for buried treasure. You can't just start digging anywhere. You need:

A map → Define the problem.
Look for clues → Get and explore the data.
Prepare your tools → Clean and transform the data.
Dig in the right place → Train the model.
Verify it's real gold → Evaluate the model.

That's exactly what we'll do in ML!

Let's break down each step in detail, with real examples, common mistakes, and expert tips.

🧩 Step 1: Define the Problem — What do you want to predict? For whom? Why?

\"A well-defined problem is half solved.\"

Before touching code, before looking for data… stop and think.

🔍 Key questions you should ask yourself:

What do I want to predict?
- A category? → Classification (spam/ham, cat/dog, pass/fail).
- A number? → Regression (house price, temperature, monthly sales).
- A sequence? → Generation (text, music, code).
Who will use this prediction?
- A doctor? → You need high precision, explainability.
- An app user? → You need speed, simplicity.
- A CEO? → You need clear business metrics.
Why is it important to solve this?
- Does it save money? Save lives? Improve user experience?
- If you can't answer this… maybe it's not worth doing.

🎯 Example 1: Spam Classifier

What: Predict if a message is \"spam\" or \"ham\" (not spam).
Who: Email or SMS users (they want less spam).
Why: Reduce time wasted, prevent fraud, improve productivity.

🎯 Example 2: House Price Predictor

What: Predict the price (number) of a house given its characteristics.
Who: Buyers, sellers, real estate agents.
Why: Help set fair prices, speed up sales, reduce uncertainty.

❌ Common Error #1: Skipping this step

\"I'm going to use this Titanic dataset because it's cool.\"

No! The dataset is not the goal. The problem is the goal. The dataset is just a tool to solve it.

📦 Step 2: Get and Explore the Data — The Gold Mine (or Coal Mine)

\"Data is the new oil… but sometimes it comes full of mud.\"

Once the problem is defined, you need data. Without data, there's no ML.

🔍 Where to get data?

Public datasets: Kaggle, Hugging Face, UCI Machine Learning Repository, Google Dataset Search.
Your own data: Surveys, app logs, sales history, etc.
APIs: Twitter, Reddit, Google Trends, etc.
Web scraping (carefully and ethically): BeautifulSoup, Scrapy.

🎯 Example: SMS Spam Dataset

We'll use this in this course. It's on Kaggle and is small, clean, and perfect for starting.

import pandas as pd

url = \"https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv\"
data = pd.read_csv(url, sep='\\t', names=['label', 'message'])

🔍 Initial Exploration: Get to know your data!

Never assume the data is clean. Always explore it first.

Ask yourself these questions:

How many rows and columns does it have?

print(data.shape)  # (5572, 2) → 5572 messages, 2 columns

How do the first rows look?

print(data.head())

  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...

How many unique values are in the label?

print(data['label'].value_counts())

ham     4825
spam     747
Name: label, dtype: int64

→ We have many more \"ham\" than \"spam\"! This is important (we'll see in evaluation).

Are there null values?

print(data.isnull().sum())

→ In this case, no. But in real life, there almost always are!

How is the message length distributed?

data['length'] = data['message'].apply(len)
print(data['length'].describe())

count    5572.000000
mean       80.489052
std        59.942492
min         2.000000
25%        36.000000
50%        61.000000
75%       111.000000
max       910.000000

→ There are messages up to 910 characters! Will they be spam? Will they be normal?

import matplotlib.pyplot as plt
import seaborn as sns

sns.histplot(data=data, x='length', hue='label', bins=50)
plt.title(\"Message length distribution by type\")
plt.show()

→ You'll see that spam messages tend to be longer. That's a valuable clue!

🧹 Step 3: Prepare the Data — Clean, Transform, Split

\"Garbage in, garbage out.\" — Garbage In, Garbage Out (GIGO) Law

ML models are like Formula 1 engines: very powerful, but very sensitive to fuel quality.

🔧 Common preparation tasks:

1. Handling null values

# If there were nulls, you could:
# data = data.dropna()  # Remove rows with nulls
# or
# data['column'] = data['column'].fillna(data['column'].mean())  # Fill with mean

2. Label encoding (if it's classification)

# Convert 'ham'/'spam' to 0/1
data['label'] = data['label'].map({'ham': 0, 'spam': 1})

3. Split into train and test

Never, never, never train and evaluate with the same data!

from sklearn.model_selection import train_test_split

X = data['message']  # Features
y = data['label']    # Target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,   # 20% for test
    random_state=42  # For reproducible results
)

print(f\"Train: {len(X_train)} messages\")
print(f\"Test: {len(X_test)} messages\")

4. Vectorization (convert text to numbers)

Models don't understand text. They understand numbers.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)  # Learn vocabulary + transform
X_test_vec = vectorizer.transform(X_test)        # Only transform (don't learn!)

print(f\"Vocabulary: {len(vectorizer.vocabulary_)} unique words\")
print(f\"Shape of X_train_vec: {X_train_vec.shape}\")  # (4457, 7358) → 4457 messages, 7358 words

📌 What does CountVectorizer do?

Creates a vocabulary with all unique words in X_train.
For each message, counts how many times each word appears.
Converts each message into a vector of numbers (frequencies).

Example:

Message: \"free money now\"
Vocabulary: ['free', 'money', 'now', 'click', 'here', ...]

Vector: [1, 1, 1, 0, 0, ...] → \"free\" appears 1 time, \"money\" 1 time, etc.

🤖 Step 4: Train the Model — The Magic (that isn't magic)

\"This is where the computer learns… but you give it the tools.\"

Now, it's time to train!

🔧 Choosing an algorithm

For text classification, a good starting point is Multinomial Naive Bayes.

Why?

It's simple, fast, and works surprisingly well with text.
It doesn't need much power.
It's robust against imbalanced data (like our dataset, with more \"ham\" than \"spam\").

from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train_vec, y_train)  # Train the model!

📌 What does fit() do?

Learns the probabilities:
- Which words appear more in spam?
- Which words appear more in ham?
Saves those probabilities internally.

That's it! Your model now \"knows\" how to distinguish spam from ham.

📈 Step 5: Evaluate the Model — Does it work… or does it just think it works?

\"Don't trust your model. Put it to the test.\"

Training is easy. Evaluating well is what separates amateurs from professionals.

🔍 Predict on the test set

y_pred = model.predict(X_test_vec)

📊 Calculate metrics

Accuracy (overall precision)

from sklearn.metrics import accuracy_score

acc = accuracy_score(y_test, y_pred)
print(f\"Accuracy: {acc:.4f}\")  # E.g.: 0.9825 → 98.25% correct

→ Looks excellent! But is it enough?

Confusion matrix

from sklearn.metrics import confusion_matrix
import seaborn as sns

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Ham', 'Spam'], 
            yticklabels=['Ham', 'Spam'])
plt.title(\"Confusion Matrix\")
plt.ylabel(\"True\")
plt.xlabel(\"Predicted\")
plt.show()

→ It will show you something like:

          Predicted
          Ham  Spam
True
Ham       950    5
Spam       10   150

→ What does this mean?

True Negatives (TN): 950 ham messages correctly classified.
False Positives (FP): 5 ham messages classified as spam (serious error!).
False Negatives (FN): 10 spam messages classified as ham (serious error!).
True Positives (TP): 150 spam messages correctly classified.

Classification report

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))

              precision    recall  f1-score   support

         Ham       0.99      0.99      0.99       955
        Spam       0.97      0.94      0.95       160

    accuracy                           0.98      1115
   macro avg       0.98      0.97      0.97      1115
weighted avg       0.98      0.98      0.98      1115

📌 What do these metrics mean?

Precision: Of all those I said were spam, how many really were?
→ Spam: 0.97 → 97% of messages marked as spam were spam. Good!
Recall (Sensitivity): Of all spam that existed, how many did I detect?
→ Spam: 0.94 → I detected 94% of spam. Very good!
F1-Score: Harmonic average of precision and recall. Ideal for imbalanced data.

🔄 The Improvement Cycle: It never ends here

Don't stop at the first version!

Now that you have a base, you can:

Try another vectorizer (TfidfVectorizer).
Try another model (LogisticRegression, SVM).
Add more data.
Improve text cleaning (remove stop words, lemmatize).
Tune hyperparameters.

Data science is iterative. There's never a \"final version.\" There's always room for improvement.

❌ Common Errors in this Step (Avoid Them!)

Evaluating on training data → Gives you a false sense of success.
Not splitting train/test → You don't know if your model generalizes.
Using only accuracy on imbalanced data → Can hide serious errors.
Not exploring data first → You miss patterns, errors, opportunities.
Not documenting what you do → Tomorrow you won't remember why you did something.

✅ Checklist for this lesson — What should you understand now?

☐ The 5 steps of an ML project and why each is crucial.
☐ How to explore a dataset before using it.
☐ Why splitting into train/test is mandatory.
☐ How to convert text to numbers (vectorization).
☐ How to train a model with fit().
☐ How to evaluate it with accuracy, confusion matrix, and classification report.
☐ That the first model is never the last… there's always room for improvement!

🎯 Quote to remember:

\"In ML, the most important thing is not the model… it's the process.\"

← Previous: Lesson 1: Welcome to AI | Next: Lesson 3: Data Exploration →

← 1 welcome to ai 3 data exploration →

Course Info

Course: AI-course0

Language: EN

Lesson: 2 ml workflow