Estimated duration of this module: 1.5â2 hours
Objective: Understand how the initial representation of words is built, why knowing their position in the sentence is crucial, and how the Transformer âviewsâ text from multiple perspectives simultaneously.
Before the Transformer can apply the attention mechanism, it needs to convert words into something computers understand: numbers.
But not just any numbers. Itâs not about assigning arbitrary IDs (like âcat = 1â, âdog = 2â). That doesnât capture meaning.
This is where embeddings come in.
đč Simple definition:
An embedding is a dense vector representation of a word (or token), where words with similar meanings or uses have similar vectors.
Example:
- âkingâ â [0.85, -0.2, 0.67, ...]
- âqueenâ â [0.82, -0.18, 0.65, ...]
- âtableâ â [-0.3, 0.9, 0.1, ...]
âkingâ and âqueenâ are close in vector space. âtableâ is far away.
Imagine all words in the language are located on a giant map of meanings.
An embedding is like the GPS coordinates of a word on that map.
When the model sees the word âcat,â it doesnât see the letter âcâ or âaâ. It sees its vector: a point on that semantic map. And thanks to that, it can reason:
âIf âcatâ is close to âdogâ, and âdogâ often goes with âleashâ, maybe âcatâ also relates to âleashâ... though not exactly the same.â
This map isnât hand-coded. Itâs learned automatically during training!
There are two main ways:
Like Word2Vec or GloVe. Trained once on large text corpora and then used as a fixed layer. Limitation: each word has only one vector, regardless of context.
âbankâ always has the same vector, whether it means âfinancialâ or âpark bench.â
Hereâs where the Transformer shines! Instead of assigning a fixed vector, the model generates an embedding specific to each context.
In âI went to the bank to deposit,â âbankâ gets a vector close to âmoney.â
In âI sat on the park bench,â it gets a vector close to âwoodâ or ârest.â
This is achieved by combining:
Hereâs a key challenge:
If the Transformer processes all words simultaneously... how does it know which comes first, which is in the middle, and which is last?
Positions matter!
âThe dog bit the manâ â âThe man bit the dogâ
In RNNs, order was implicit in the processing sequence. In the Transformer, itâs not.
Solution: Positional Encoding
The idea is simple: add to each wordâs embedding a numerical signal indicating its position in the sequence.
But not just any signal. A simple counter (position 1, 2, 3...) doesnât scale well and doesnât capture relative relationships (âword 5 is close to word 6â).
The original Transformerâs solution: sinusoidal functions.
đč Formula (for reference only):
For position pos and dimension i of the vector:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Where d_model is the embedding dimension (e.g., 512 or 768).
Imagine each position in the sentence broadcasts a âunique frequency,â like a radio station.
Each word receives its semantic embedding + its âposition wave.â
When the model adds both, it gets a vector that says:
âI am the word âbank,â and Iâm in position 5 of the sentence.â
And the smartest part: these sinusoidal functions allow the model to learn relative relationships.
âThe word in position 5 can learn that the word in position 6 is âclose,â even if itâs never seen a 100-word sentence during training.â
Imagine youâre analyzing a theater play.
Theyâre all watching the same play... but from different perspectives. And all are valid.
Thatâs what Multi-Head Attention does.
Instead of computing a single attention matrix, the Transformer computes multiple âattention headsâ in parallel, each with its own learned Q, K, V matrices.
h different spaces (e.g., 8 heads).đč Result: The model doesnât have just one way to âlookâ at the sentence. It has multiple specialized lenses.
Take the sentence:
âThe scientist who discovered the vaccine received an award.â
Each head contributes a piece of the puzzle. Together, they yield full understanding.
Take the sentence: âThe musician who played the violin moved the audience.â
Imagine three distinct âattention heads.â Describe which words each would connect and why. Use categories like: syntax, semantics, emotion, instrument, etc.
Word: "scientist"
Embedding: [0.7, -0.3, 0.5, ...] â "base meaning"
Position 3: [0.1, 0.05, -0.2, ...] â "sinusoidal wave for pos=3"
Initial vector = Embedding + Position â [0.8, -0.25, 0.3, ...]
Projected into 3 heads:
Head 1: Q1, K1, V1 â attention to verbs
Head 2: Q2, K2, V2 â attention to objects
Head 3: Q3, K3, V3 â attention to awards/achievements
Head outputs:
Head 1: [0.6, 0.1, ...]
Head 2: [-0.2, 0.8, ...]
Head 3: [0.4, 0.5, ...]
Concatenated: [0.6, 0.1, -0.2, 0.8, 0.4, 0.5, ...]
Final projection: [0.55, 0.3, 0.45, ...] â enriched final representation
The Transformer doesnât start from scratch. It builds understanding in layers:
- Embeddings give initial semantic meaning.
- Positional encoding gives awareness of order.
- Multi-head attention allows viewing text from multiple angles simultaneously.
Itâs like a team of experts analyzing a text: each contributes their perspective, resulting in a richer, more nuanced understanding than any single analyst could achieve.
Now that we understand the fundamental pieces, itâs time to assemble them: How are these pieces organized to form a complete Transformer? Whatâs the difference between an encoder and a decoder? Why do BERT and GPT, though both use Transformers, work so differently?
Weâll explore that in the next module.