🧩 MODULE 3: Puzzle Pieces — Embeddings, Position, and Multi-Head Attention

Estimated duration of this module: 1.5–2 hours
Objective: Understand how the initial representation of words is built, why knowing their position in the sentence is crucial, and how the Transformer “views” text from multiple perspectives simultaneously.

Lesson 3.1 — What Is an “Embedding”? Translating Words into Smart Numbers

Before the Transformer can apply the attention mechanism, it needs to convert words into something computers understand: numbers.

But not just any numbers. It’s not about assigning arbitrary IDs (like “cat = 1”, “dog = 2”). That doesn’t capture meaning.

This is where embeddings come in.

🔹 Simple definition:
An embedding is a dense vector representation of a word (or token), where words with similar meanings or uses have similar vectors.

Example:

“king” → [0.85, -0.2, 0.67, ...]

“queen” → [0.82, -0.18, 0.65, ...]

“table” → [-0.3, 0.9, 0.1, ...]

“king” and “queen” are close in vector space. “table” is far away.

Lesson 3.2 — Analogy of the “Semantic Map”

Imagine all words in the language are located on a giant map of meanings.

In the “animal neighborhood”: dog, cat, elephant, giraffe.
In the “emotion neighborhood”: happy, sad, excited, bored.
In the “object neighborhood”: chair, table, lamp, window.

An embedding is like the GPS coordinates of a word on that map.

When the model sees the word “cat,” it doesn’t see the letter ‘c’ or ‘a’. It sees its vector: a point on that semantic map. And thanks to that, it can reason:

“If ‘cat’ is close to ‘dog’, and ‘dog’ often goes with ‘leash’, maybe ‘cat’ also relates to ‘leash’... though not exactly the same.”

This map isn’t hand-coded. It’s learned automatically during training!

Lesson 3.3 — How Are Embeddings Generated?

There are two main ways:

1. Static Embeddings (Pre-trained)

Like Word2Vec or GloVe. Trained once on large text corpora and then used as a fixed layer. Limitation: each word has only one vector, regardless of context.

“bank” always has the same vector, whether it means “financial” or “park bench.”

2. Contextual Embeddings (Dynamic)

Here’s where the Transformer shines! Instead of assigning a fixed vector, the model generates an embedding specific to each context.

In “I went to the bank to deposit,” “bank” gets a vector close to “money.”
In “I sat on the park bench,” it gets a vector close to “wood” or “rest.”

This is achieved by combining:

The initial embedding (lookup in an embedding table).
Positional encoding (see next lesson).
Then, attention layers that refine the vector based on context.

Lesson 3.4 — The Position Problem: What If All Words Are Processed at Once?

Here’s a key challenge:

If the Transformer processes all words simultaneously... how does it know which comes first, which is in the middle, and which is last?

Positions matter!

“The dog bit the man” ≠ “The man bit the dog”

In RNNs, order was implicit in the processing sequence. In the Transformer, it’s not.

Solution: Positional Encoding

Lesson 3.5 — Positional Encoding: Giving Each Word “Temporal GPS”

The idea is simple: add to each word’s embedding a numerical signal indicating its position in the sequence.

But not just any signal. A simple counter (position 1, 2, 3...) doesn’t scale well and doesn’t capture relative relationships (“word 5 is close to word 6”).

The original Transformer’s solution: sinusoidal functions.

🔹 Formula (for reference only):

For position pos and dimension i of the vector:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where d_model is the embedding dimension (e.g., 512 or 768).

Lesson 3.6 — Analogy of “Radio Waves”

Imagine each position in the sentence broadcasts a “unique frequency,” like a radio station.

Position 1: broadcasts on frequency A
Position 2: broadcasts on frequency B
...
Position 100: broadcasts on frequency Z

Each word receives its semantic embedding + its “position wave.”

When the model adds both, it gets a vector that says:

“I am the word ‘bank,’ and I’m in position 5 of the sentence.”

And the smartest part: these sinusoidal functions allow the model to learn relative relationships.

“The word in position 5 can learn that the word in position 6 is ‘close,’ even if it’s never seen a 100-word sentence during training.”

Lesson 3.7 — Multi-Head Attention: Why One “Look” Isn’t Enough?

Imagine you’re analyzing a theater play.

A literary critic looks at themes, metaphors, symbolism.
A director looks at emotions, pauses, intonations.
A costume designer looks at outfits, colors, textures.

They’re all watching the same play... but from different perspectives. And all are valid.

That’s what Multi-Head Attention does.

Instead of computing a single attention matrix, the Transformer computes multiple “attention heads” in parallel, each with its own learned Q, K, V matrices.

Lesson 3.8 — How Does Multi-Head Attention Work?

Each word’s vector is projected into h different spaces (e.g., 8 heads).
In each space, independent attention is computed (with its own Q, K, V).
Each head “learns” to focus on a different type of relationship:
- One head may focus on syntactic relations (subject-verb).
- Another on semantic relations (synonyms, antonyms).
- Another on references (pronouns → nouns).
Outputs from all heads are concatenated and projected back into a single vector.

🔹 Result: The model doesn’t have just one way to “look” at the sentence. It has multiple specialized lenses.

Lesson 3.9 — Visual Example (described) of Multi-Head Attention

Take the sentence:

“The scientist who discovered the vaccine received an award.”

Head 1 (syntactic): Connects “scientist” with “discovered” and “received” → action relationships.
Head 2 (semantic): Connects “scientist” with “vaccine” and “award” → achievement/field relationships.
Head 3 (referential): Connects “who” with “scientist” → relative pronoun resolution.

Each head contributes a piece of the puzzle. Together, they yield full understanding.

✍️ Reflection Exercise 3.1

Take the sentence: “The musician who played the violin moved the audience.”
Imagine three distinct “attention heads.” Describe which words each would connect and why. Use categories like: syntax, semantics, emotion, instrument, etc.

📊 Conceptual Diagram 3.1 — Embedding + Position + Multi-Head Attention

Word: "scientist"
Embedding: [0.7, -0.3, 0.5, ...] → "base meaning"
Position 3: [0.1, 0.05, -0.2, ...] → "sinusoidal wave for pos=3"
Initial vector = Embedding + Position → [0.8, -0.25, 0.3, ...]

Projected into 3 heads:
Head 1: Q1, K1, V1 → attention to verbs
Head 2: Q2, K2, V2 → attention to objects
Head 3: Q3, K3, V3 → attention to awards/achievements

Head outputs:
Head 1: [0.6, 0.1, ...]
Head 2: [-0.2, 0.8, ...]
Head 3: [0.4, 0.5, ...]

Concatenated: [0.6, 0.1, -0.2, 0.8, 0.4, 0.5, ...]
Final projection: [0.55, 0.3, 0.45, ...] → enriched final representation

🧠 Module 3 Conclusion

The Transformer doesn’t start from scratch. It builds understanding in layers:

Embeddings give initial semantic meaning.

Positional encoding gives awareness of order.

Multi-head attention allows viewing text from multiple angles simultaneously.

It’s like a team of experts analyzing a text: each contributes their perspective, resulting in a richer, more nuanced understanding than any single analyst could achieve.

Now that we understand the fundamental pieces, it’s time to assemble them: How are these pieces organized to form a complete Transformer? What’s the difference between an encoder and a decoder? Why do BERT and GPT, though both use Transformers, work so differently?

We’ll explore that in the next module.

← Module2 Module4 →

Course Info

Course: AI-course2

Language: EN

Lesson: Module3