Estimated duration of this module: 1.5 - 2 hours
Objective: Build a complete extractive question-answering system that takes a context text and a question, and returns the most probable answer extracted directly from the text.
Requirements: Only what you learned in previous modules + a code editor (or Google Colab).
Before we start, let’s clarify the type of QA we’ll build.
🔹 Generative QA:
The model invents a new answer, in its own words.
Question: “What is a Transformer?”
Generated Answer: “A Transformer is a neural network architecture that uses attention mechanisms to process sequences...”
🔹 Extractive QA (what we’ll do):
The model extracts a literal fragment from the context text.
Question: “What is a Transformer?”
Context: “...the Transformer, introduced in 2017, is an architecture based on attention that processes all words simultaneously...”
Extracted Answer: “an architecture based on attention that processes all words simultaneously”
✅ Advantages of Extractive QA:
For this project, we’ll use an encoder-only model, specifically trained for extractive QA.
🔹 Chosen Model: deepset/roberta-base-squad2
🌐 You can view it on the Model Hub
First, install (if not done before) and load the model and tokenizer.
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
# Model name
model_name = "deepset/roberta-base-squad2"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
# Optional: create a pipeline to simplify
qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)
We’ll use an example text. It can be anything: an article, a book excerpt, a manual, etc.
context = """
Transformers are a neural network architecture introduced in 2017 by Vaswani et al.
in the paper "Attention Is All You Need". Unlike recurrent networks,
Transformers process all words in a sequence simultaneously,
using a mechanism called "attention" that allows each word to relate
to any other in the sentence. This architecture is the basis for models like
BERT, GPT, T5, and many others dominating natural language processing today.
"""
question = "What mechanism do Transformers use to relate words?"
result = qa_pipeline(question=question, context=context)
print("Question:", question)
print("Answer:", result['answer'])
print("Score:", result['score'])
print("Start:", result['start'])
print("End:", result['end'])
Expected output:
Question: What mechanism do Transformers use to relate words?
Answer: attention
Score: 0.9321
Start: 234
End: 242
It works! The model extracted the word “attention” as the answer.
Now, let’s do it step by step, as in Module 5, to see what happens internally.
# Tokenize question + context (together)
inputs = tokenizer(question, context, return_tensors="pt", truncation=True)
# Pass through the model
outputs = model(**inputs)
# Get start and end logits
start_logits = outputs.start_logits
end_logits = outputs.end_logits
# Find positions with highest probability
start_index = torch.argmax(start_logits)
end_index = torch.argmax(end_logits)
# Convert tokens back to text
answer_tokens = inputs.input_ids[0][start_index:end_index + 1]
answer = tokenizer.decode(answer_tokens)
print("Answer (manual):", answer)
Output:
Answer (manual): attention
🔹 What does the model do?
It predicts two things:
Then, it extracts all tokens between those two positions.
Sometimes, the model can make mistakes if it only takes the start_index and end_index with the highest score. A better practice is to consider valid combinations (start <= end) and choose the one with the highest combined score.
import torch
def get_best_answer(start_logits, end_logits, input_ids, tokenizer, top_k=5):
# Get top_k indices for start and end
start_probs, start_indices = torch.topk(start_logits, top_k)
end_probs, end_indices = torch.topk(end_logits, top_k)
best_score = -float('inf')
best_answer = ""
# Test valid combinations
for i in range(top_k):
for j in range(top_k):
start = start_indices[i].item()
end = end_indices[j].item()
if start <= end: # valid
score = (start_probs[i] + end_probs[j]).item()
if score > best_score:
best_score = score
answer_tokens = input_ids[0][start:end+1]
best_answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
return best_answer, best_score
# Use the function
answer, score = get_best_answer(start_logits, end_logits, inputs.input_ids, tokenizer)
print("Best answer:", answer)
print("Best score:", score)
This makes the system more robust against occasional errors.
Now it’s your turn to experiment! Try:
context2 = """
Generative artificial intelligence enables the creation of new content: text, images, music, code.
Models like DALL-E, Stable Diffusion, and GPT-4 are popular examples.
These models learn patterns from large datasets and then generate original outputs
based on prompts or instructions given by the user.
"""
question2 = "What kind of content can generative AI create?"
result2 = qa_pipeline(question=question2, context=context2)
print(result2['answer']) # Expected: "text, images, music, code"
Or in English:
context_es = """
Barcelona is a city located on the Mediterranean coast of Spain.
It is known for its unique architecture, especially the works of Antoni Gaudí,
such as the Sagrada Familia and Park Güell. It is also famous for its cuisine,
beaches, and vibrant cultural life.
"""
question_en = "Which architect is famous in Barcelona?"
result_en = qa_pipeline(question=question_en, context=context_es)
print(result_en['answer']) # Expected: "Antoni Gaudí"
🔹 Limitation 1: The model can only answer if the answer is in the text.
If you ask “What is the capital of France?” and the text doesn’t mention Paris, the model may invent something or give a wrong answer.
🔹 Solution:
handle_impossible_answer=True (if the model supports it, like SQuAD 2.0). if result['score'] < 0.1:
print("I couldn't find a reliable answer in the text.")
else:
print("Answer:", result['answer'])
🔹 Limitation 2: Context has a length limit (~512 tokens).
If the text is very long, it gets truncated and information is lost.
🔹 Solution:
Take a Wikipedia article (or a book chapter) you like.
Copy 3-4 paragraphs as context.
Formulate 5 different questions (easy, hard, ambiguous).
Run the QA system and evaluate:
- How many answers are correct?
- Where does it fail?
- How could you improve it?
[Question + Context] → Tokenizer → input_ids → QA Model → start_logits + end_logits →
↑ ↑ ↑ ↑
Plain text Converts to Token IDs Predicts start
tokens + (question + and end positions
separators context)
→ Select best start-end combination → Extract tokens → Decode → [Final Answer]
Congratulations! You’ve just built a functional artificial intelligence system, based on one of the world’s most advanced models (Transformer), without training anything, without expensive GPUs, and in less than 50 lines of code.
This system can be the foundation for:
- A chatbot for technical manuals.
- An assistant for reading scientific articles.
- A tutor that answers questions about a book.
And best of all: you now know how it works inside! It’s not a black box. You know what an embedding is, what attention does, how position is encoded, and how the model chooses the answer.