Why These 12 Concepts Decide Your Interview

AI engineering interviews in 2026 have shifted. Panels no longer ask you to implement a linked list or solve a coin-change problem. Instead, they ask: "Explain how attention works and why we scale by √d_k." Or: "Walk me through how you'd design a RAG pipeline for a legal document search tool." These questions have one thing in common — they require deep conceptual fluency, not just surface-level awareness.

This guide covers the 12 concepts that appear most frequently in senior AI engineering interviews at companies like Google DeepMind, Anthropic, OpenAI, Cohere, and fast-growing AI product companies. For each concept, we explain what it is, how it works mechanically, a concrete example, and the exact interview angle interviewers use to test depth. We've also added two bonus concepts that many candidates miss entirely.

Before You Read
If you're just starting your AI engineering journey, visit our GenAI Introduction page first for foundational context. If you're ready to benchmark your current skills, try our Skills Gap Analyser — it'll show you exactly which of these concepts you're weakest on.

1Embeddings: How Machines Understand Meaning

An embedding is a dense, fixed-length numerical vector that represents the semantic meaning of a piece of text, an image, an audio clip, or any other data. The core idea is that semantically similar things should live close together in vector space — measurable by cosine similarity or dot product.

Early approaches like Word2Vec and GloVe created word-level embeddings trained on co-occurrence statistics. The famous analogy king − man + woman ≈ queen demonstrates that arithmetic in embedding space corresponds to real-world relationships. Modern systems use transformer-based models like Sentence-BERT (SBERT) and OpenAI's text-embedding-3-small to produce sentence and document embeddings — capturing meaning at a much richer level.

How Embeddings Are Created

A neural network is trained (usually with contrastive loss or masked language modelling) so that its intermediate layer activations form a useful representation. For text, you pass the input through a transformer, then pool the last hidden states — typically by taking the mean of all token vectors, or by using the [CLS] token specifically added for classification. The resulting vector might be 384, 768, or 1536 dimensions.

Interview Insight
"What's the difference between sparse and dense embeddings?" Dense embeddings (Word2Vec, SBERT) represent meaning in a compact continuous space — every dimension contributes. Sparse embeddings (TF-IDF, BM25) are high-dimensional vectors where most values are zero, and each non-zero dimension maps directly to a vocabulary term. Sparse excels at exact keyword matching; dense excels at semantic similarity. Hybrid retrieval systems use both.
ModelDimensionsBest ForCost
text-embedding-3-small1536General purpose, RAG$0.02 / 1M tokens
text-embedding-3-large3072High-accuracy retrieval$0.13 / 1M tokens
SBERT (all-MiniLM-L6)384On-device, fastFree / open source
Cohere embed-v31024Multilingual$0.10 / 1M tokens
BGE-M31024Hybrid dense+sparseFree / open source

2Tokenization: The Hidden Step Before Every LLM Call

Before any text reaches a language model, it is converted into tokens — integer IDs from a fixed vocabulary. A token is roughly 4 characters of English text on average, but this varies significantly: "the" is 1 token, "ChatGPT" might be 3 tokens, and a rare word like "xylophone" could be 4 tokens split at subword boundaries.

The dominant algorithm is Byte Pair Encoding (BPE), used by GPT-4, LLaMA, and Mistral. BPE starts with individual characters and iteratively merges the most frequent adjacent pair into a new token until reaching the desired vocabulary size (typically 32k–100k tokens). Google's models use SentencePiece, which operates at the byte level and handles any language without a language-specific tokenizer.

Why Tokenization Matters for Engineers

Context windows are measured in tokens, not words or characters. GPT-4o's 128k context = roughly 96,000 words or about 300 pages of text. If you're processing code, tokens are spent more liberally — Python code is about 1 token per 3 characters. This affects cost, latency, and chunking strategy in RAG pipelines.

Python — inspect tokens with tiktoken
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
text = "AI engineers who understand tokenization write better prompts."
tokens = enc.encode(text)

print(len(tokens))           # → 11 tokens
print(enc.decode(tokens))    # → reconstructed text
# Each token decoded:
for t in tokens:
    print(repr(enc.decode([t])))
Interview Insight
"Why does tokenisation affect model performance on non-English languages?" English text is well-represented in most tokenizer vocabularies, so a word maps to 1–2 tokens. Languages like Hindi, Chinese, or Arabic have larger per-character token ratios — the same meaning in Hindi might cost 3× the tokens as English. This means non-English prompts consume context window faster, which directly affects cost and capability.

3The Attention Mechanism: Why Transformers Changed Everything

The attention mechanism, introduced in the 2017 paper "Attention Is All You Need", is the mathematical engine inside every modern language model. It allows the model to weigh how relevant each word in an input is to every other word simultaneously — something recurrent networks (RNNs/LSTMs) could not do efficiently because they processed text sequentially.

Query, Key, Value — the Core Formula

Each token is projected into three vectors: a Query (Q), a Key (K), and a Value (V). The attention score between two positions is computed as:

Attention Formula
Attention(Q, K, V) = softmax( Q × Kᵀ / √d_k ) × V

# Q × Kᵀ  → dot product: how much each query "matches" each key
# / √d_k  → scale by key dimension (prevents vanishing gradients)
# softmax → convert raw scores to probabilities (sum = 1)
# × V     → weighted sum of value vectors

Think of it like a search engine for each word: the query is "what am I looking for?", the keys are "what does each other word offer?", and the values are "what information do I actually extract?" The word "bank" in "river bank" attends strongly to "river" and "water" — but the same word in "bank account" attends to "money" and "deposit". This is how transformers resolve ambiguity that stumped previous NLP approaches.

Multi-Head Attention

Instead of computing one attention function, transformers run h parallel attention heads (typically 8–96), each learning a different relationship pattern. One head might track syntactic dependencies, another coreference, another semantic similarity. Their outputs are concatenated and projected back to the model dimension. This parallel structure is also why transformers train so efficiently on GPUs.

Interview Insight
"Why do we divide by √d_k in the attention formula?" When d_k is large, dot products grow in magnitude, pushing softmax into regions with extremely small gradients (the "vanishing gradient in softmax" problem). Dividing by √d_k keeps dot products in a stable range, ensuring softmax produces meaningful probability distributions rather than nearly one-hot outputs. This was a key insight that made transformers trainable at scale.

4Fine-Tuning: Teaching an LLM Your Specific Domain

Fine-tuning adapts a pre-trained foundation model to a specific task, domain, or style by continuing training on a curated dataset. The result is a model that retains the broad knowledge from pre-training but performs significantly better on the target distribution. A legal LLM, a medical coding assistant, and a code-review bot are all examples of fine-tuned models.

Full Fine-Tuning vs Parameter-Efficient Methods

Full fine-tuning updates all model weights. For a 7B parameter model, that means training 7 billion floats — requiring 140GB+ of GPU memory in FP16. This is expensive and risks catastrophic forgetting — the model can lose general capabilities while gaining task-specific ones.

LoRA (Low-Rank Adaptation) is the dominant efficient alternative. Instead of modifying the weight matrices directly, LoRA adds a pair of small rank-decomposed matrices (A and B) to each attention layer. Only A and B are trained — typically less than 1% of total parameters. At inference, the LoRA weights are merged into the base model with zero added latency. QLoRA goes further by quantizing the base model to 4-bit, allowing 70B model fine-tuning on a single 48GB GPU.

MethodParams TrainedGPU Memory (7B)Best For
Full Fine-Tuning100%~140 GBMaximum accuracy, large budget
LoRA (r=16)~0.5%~20 GBMost production use cases
QLoRA (4-bit)~0.5%~6 GBConsumer GPU fine-tuning
Prompt Tuning<0.01%~16 GBStyle/format adaptation only
Interview Insight
"When should you fine-tune vs use RAG?" Fine-tune when you need the model to change its behaviour, style, or format — e.g., always respond in legal language, generate structured JSON, avoid certain topics. Use RAG when you need the model to access fresh or private knowledge — e.g., your latest product docs, last week's news. Combining both is often the production answer: fine-tune for behaviour, RAG for knowledge.

5Quantization: Running Bigger Models on Smaller Hardware

Quantization reduces the numerical precision of model weights — converting 32-bit floating point (FP32) values into lower-precision formats like FP16, INT8, or INT4. The primary benefit is reduced memory usage and faster inference; the cost is a small drop in accuracy. In practice, well-quantized models lose less than 1% on most benchmarks, making quantization standard for production deployment.

Precision Formats at a Glance

FormatBits7B Model Size70B Model SizeQuality Loss
FP323228 GB280 GBNone (baseline)
BF16 / FP161614 GB140 GBNegligible
INT887 GB70 GB~0.1–0.5%
INT4 (GPTQ)43.5 GB35 GB~1–3%
INT4 (GGUF Q4_K_M)4–5 avg~4.1 GB~41 GB~1–2%

Key Quantization Methods

GPTQ (post-training quantization) minimises the reconstruction error layer by layer using a small calibration dataset. AWQ (Activation-aware Weight Quantization) protects the most important weight channels from aggressive quantization, achieving better accuracy than GPTQ at the same bit-width. GGUF is the file format used by llama.cpp for CPU and mixed CPU/GPU inference — it supports mixed-precision quantization (e.g., Q4_K_M quantizes most layers to 4-bit but keeps attention layers at higher precision).

Interview Insight
"What's the difference between PTQ and QAT?" Post-Training Quantization (PTQ) applies quantization after training is complete — fast and no retraining required, but slightly lower accuracy. Quantization-Aware Training (QAT) simulates quantization during training so the model learns to be robust to low-precision arithmetic — much higher accuracy but requires the full training loop. For LLMs, PTQ (GPTQ, AWQ) is the standard due to retraining cost.

6Vector Databases: Long-Term Memory for AI Applications

A vector database stores embeddings alongside their source data and metadata, and specialises in Approximate Nearest Neighbour (ANN) search — finding the most semantically similar items to a query vector in milliseconds, even across billions of entries. This is the infrastructure layer that makes RAG pipelines and semantic search scalable.

How ANN Search Works: HNSW

The most widely used ANN algorithm is HNSW (Hierarchical Navigable Small World). It builds a multi-layer graph where each node is connected to its closest neighbours. At query time, the search starts at the top (sparse) layer and greedily navigates down toward the query vector, pruning the search space at each step. This achieves sub-millisecond retrieval at 99%+ recall for typical workloads — a brute-force exhaustive search at the same scale would take seconds.

DatabaseHostingBest ForStandout Feature
PineconeManaged cloudProduction at scaleServerless, namespace isolation
QdrantCloud + self-hostedFiltered searchBest payload filtering performance
WeaviateCloud + self-hostedHybrid searchBuilt-in BM25 + vector hybrid
pgvectorPostgres extensionExisting PG stackZero new infra, ACID transactions
FAISSIn-process (library)Research / local devExtremely fast, no server
ChromaDBEmbedded / cloudPrototypingSimplest API for getting started
Interview Insight
"What is the trade-off between pre-filtering and post-filtering in vector search?" Pre-filtering applies metadata filters before the ANN search, reducing the candidate set but potentially missing relevant results if the filter is too restrictive. Post-filtering retrieves top-K by vector similarity first, then applies filters — safer for recall but may return fewer than K results. Most production vector DBs now support indexed pre-filtering (Qdrant's payload indexes, Pinecone namespaces) which avoids this trade-off entirely.

7Prompt Engineering: The Art of Talking to LLMs

Prompt engineering is the discipline of designing inputs to language models to elicit desired outputs reliably, accurately, and efficiently. Done well, it can close most of the gap between a base model and a fine-tuned one — without touching model weights. Done poorly, it's the most common source of LLM failures in production.

Core Techniques

Zero-shot prompting asks the model to perform a task with no examples: "Classify this review as positive or negative." Few-shot prompting provides 2–5 demonstrations before the actual input, significantly improving accuracy for structured tasks. Chain-of-Thought (CoT) adds the phrase "Let's think step by step" or provides reasoning examples — it forces the model to externalise intermediate reasoning, dramatically improving performance on multi-step logic and maths. Google's 2022 paper showed CoT improved arithmetic accuracy from 18% to 57% on GSM8K with PaLM.

Python — structured output with function calling
from openai import OpenAI
import json

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Extract job details as JSON."},
        {"role": "user", "content": "Senior ML Engineer at Anthropic, London, £180k"}
    ],
    response_format={"type": "json_object"}
)
data = json.loads(response.choices[0].message.content)
# → {"title": "Senior ML Engineer", "company": "Anthropic",
#    "location": "London", "salary": "£180k"}
Interview Insight
"What are the main failure modes of prompt engineering?" The four most common are: (1) Ambiguity — vague instructions that the model interprets differently across runs. (2) Prompt injection — user input that overrides system instructions. (3) Over-length context — the "lost in the middle" problem where models ignore information buried in the centre of a long prompt. (4) Fragility to paraphrasing — prompts that work perfectly in one phrasing but fail with minor wording changes.

For hands-on practice with prompt engineering, our AI Interview Prep tool has 150+ practice questions specifically on prompt design, covering system prompts, CoT, structured outputs, and safety techniques used at leading AI labs.

8RAG Pipelines: Giving LLMs Access to Fresh Knowledge

Retrieval-Augmented Generation (RAG) combines an LLM with an external knowledge store, allowing the model to answer questions about information that was not in its training data — your internal documents, last week's news, product knowledge bases. The model is not fine-tuned; instead, relevant context is retrieved and injected into the prompt at query time.

The Seven Stages of a Production RAG Pipeline

A naive RAG implementation is straightforward. A production RAG system has seven distinct stages, each with its own failure modes:

  1. Ingest — Load source documents (PDFs, web pages, databases, APIs).
  2. Parse — Extract clean text, preserving structure (tables, headings, code blocks).
  3. Chunk — Split documents into overlapping segments. Strategy matters: fixed-size (simple), sentence (natural), semantic (highest quality, highest cost).
  4. Embed — Convert each chunk to a vector using an embedding model.
  5. Index — Store vectors in a vector database alongside source text and metadata.
  6. Retrieve — At query time, embed the user question and fetch the top-K most similar chunks.
  7. Generate — Inject retrieved chunks into the LLM prompt as context and generate the final answer.

Advanced Retrieval: Hybrid Search and Re-Ranking

Simple dense retrieval misses keyword-critical queries. Hybrid retrieval combines dense (embedding) similarity with sparse (BM25 keyword) scoring — typically weighted 70/30. After retrieval, a cross-encoder re-ranker (e.g., Cohere Rerank, BGE-Reranker) scores each retrieved chunk against the full query, reordering the results for maximum relevance before passing them to the LLM. This two-stage approach (fast ANN retrieval → accurate cross-encoder re-ranking) is the industry standard for production RAG.

Interview Insight
"What are the main failure modes in a RAG system?" (1) Retrieval failures — wrong chunks retrieved because the query embedding doesn't match the document embedding style. Fix: query rewriting, HyDE (hypothetical document embeddings). (2) Lost-in-the-middle — LLMs pay less attention to context in the middle of long prompts. Fix: put most relevant chunks at the start and end. (3) Faithfulness failures — LLM hallucinates facts not in the retrieved context. Fix: add a faithfulness instruction in the system prompt; use LLM-as-judge evaluation.

9AI Agents: LLMs That Can Take Action

An AI agent is an LLM-powered system that can perceive its environment, reason about a goal, select actions (tools), execute them, observe results, and repeat — until the goal is reached or a stopping condition is hit. The shift from an LLM answering a single question to an agent completing a multi-step task is one of the most consequential transitions in applied AI engineering. See our full Agentic AI Roadmap for a complete learning path.

The ReAct Pattern

The most widely used agent pattern is ReAct (Reason + Act). The LLM alternates between two steps: Thought (reasoning about what to do next) and Action (calling a tool). After each action, the agent receives an Observation (tool result) and decides whether to continue or finish. This loop continues until the agent produces a final answer.

ReAct Agent Loop (pseudocode)
goal = "What is Anthropic's latest funding round amount?"

while not done:
    thought = llm.think(goal, history)    # "I need to search for this"
    action  = llm.choose_tool(thought)   # web_search("Anthropic funding 2026")
    result  = tools[action.name](action.args)
    history.append(thought, action, result)

    if llm.is_final_answer(result):
        return llm.format_answer(result)

Agent Memory Types

Agents use four types of memory: Short-term (the current context window — what the agent can see right now), Long-term (a vector store of past interactions and facts), Episodic (a record of past task completions used for planning), and Semantic (general world knowledge from the base model's training). Production agent systems like those built on LangGraph or CrewAI manage all four explicitly.

Interview Insight
"What's the difference between a chain and an agent?" A chain is a fixed, deterministic sequence of LLM calls and tool invocations — the flow is hardcoded by the developer. An agent is dynamic — the LLM itself decides what action to take next based on its observations. Chains are more predictable and safer for regulated use cases. Agents are more capable for open-ended tasks but introduce non-determinism and require more robust evaluation and guardrails.

10Mixture of Experts: How the Largest Models Scale Efficiently

Mixture of Experts (MoE) is an architecture that replaces each dense feed-forward layer in a transformer with a set of expert sub-networks, plus a router that selects which experts to activate for each token. The key insight: you can have a model with 47 billion total parameters but activate only 13 billion per token — getting the knowledge capacity of a large model at the inference cost of a smaller one.

How the Router Works

The router is a learned linear layer that takes a token's hidden state as input and outputs a probability distribution over all experts. Top-K routing (K=1 or K=2) selects the highest-scoring experts per token. Mistral's Mixtral 8x7B uses 8 experts with top-2 routing — each token activates 2 of the 8 experts, giving 13B active parameters (close to a standard 13B dense model) while the full model has 47B parameters stored in memory.

ModelTotal ParamsActive Params/Tokenvs Dense Equivalent
Mixtral 8x7B47B13BBeats LLaMA 2 70B on most benchmarks
Mixtral 8x22B141B39BNear GPT-4 on coding tasks
GPT-4 (rumoured)~1.8T~220B16 experts, top-2 routing
DeepSeek-V3671B37BState-of-art open MoE, 2025
Interview Insight
"Why do MoE models use more memory but similar compute to dense models?" All expert weights must be loaded into GPU memory (VRAM) even though only K are active per forward pass — so Mixtral 8x7B needs ~90GB of VRAM in full precision to serve, despite only computing 13B parameters per token. This is the core MoE trade-off: compute-efficient but memory-hungry. It's why MoE models are typically distributed across multiple GPUs (expert parallelism) in production.

11RLHF & Alignment: How Models Learn to Be Helpful

Reinforcement Learning from Human Feedback (RLHF) is the training technique that turns a capable but raw pre-trained language model into a helpful, harmless, and honest AI assistant. Without RLHF, a model trained purely on internet text will confidently produce harmful content, hallucinations, and responses that don't follow user intent. RLHF is why ChatGPT, Claude, and Gemini behave the way they do.

The Three-Stage RLHF Pipeline

  1. Supervised Fine-Tuning (SFT) — Fine-tune the base model on high-quality human-written demonstrations of desired behaviour. This produces the SFT model — helpful but not yet aligned.
  2. Reward Model (RM) Training — Present human labellers with pairs of model responses to the same prompt and ask them which is better. Train a separate neural network (the reward model) to predict human preference scores. This encodes "what humans like" into a differentiable signal.
  3. RL Optimisation (PPO) — Use the reward model as a reward signal to fine-tune the SFT model via Proximal Policy Optimisation (PPO). The model learns to generate responses that score highly on the reward model, with a KL-divergence penalty to prevent it from drifting too far from the SFT model (which prevents reward hacking).

DPO: The Simpler Alternative

Direct Preference Optimisation (DPO), introduced in 2023, achieves RLHF-level alignment without the separate reward model or RL optimisation loop. It reformulates the RLHF objective as a supervised classification problem directly on the preference pairs. DPO is simpler, more stable, and nearly as effective — most open-source fine-tuned models (Llama 3 Instruct, Mistral Instruct) now use DPO or its variants.

Interview Insight
"What is reward hacking and how do you prevent it?" Reward hacking occurs when the model finds ways to score highly on the reward model without actually producing better responses — e.g., generating very long responses (if human raters preferred verbosity) or using flattery. Prevention: (1) KL-divergence penalty keeps the policy close to SFT model. (2) Constitutional AI (Anthropic's approach) uses a set of written principles and red-teaming to catch reward model failures. (3) Regular reward model updates as hacking patterns are discovered.

12Evaluation & Hallucination: Knowing When Your AI Is Actually Working

Evaluation is widely considered the hardest unsolved problem in LLM engineering. Unlike traditional software where a test either passes or fails, LLM outputs exist on a spectrum of quality — and that quality depends on the user's intent, the task, and often the cultural context. Getting evaluation right is what separates AI products that ship reliably from those that require constant firefighting.

Reference-Based Metrics

BLEU (Bilingual Evaluation Understudy) and ROUGE measure n-gram overlap between the model output and a human reference. They're cheap and automatic, making them useful for machine translation and summarisation benchmarks. Their weakness: a response can score 0 BLEU while being semantically identical to the reference (just paraphrased), or score high BLEU while being factually wrong.

LLM-as-Judge

The dominant modern approach: use a powerful LLM (GPT-4o, Claude Opus) to score model outputs on dimensions like correctness, helpfulness, conciseness, and safety. This scales to thousands of samples without human raters and correlates well with human judgement (typically 80–85% agreement). The key risk is judge bias — LLMs tend to prefer longer, verbose answers and responses from the same model family.

RAG-Specific Evaluation: RAGAs

RAGAs (Retrieval Augmented Generation Assessment) is the standard evaluation framework for RAG systems. It measures four dimensions without requiring human labels: Faithfulness (does the answer contain only facts from the retrieved context?), Answer Relevancy (does the answer address the question?), Context Precision (how relevant are the retrieved chunks to the question?), and Context Recall (did we retrieve all the relevant chunks?).

Interview Insight
"How would you detect and reduce hallucinations in a production RAG system?" Three-layer approach: (1) Detect — use RAGAs Faithfulness score to flag responses that introduce facts not in the retrieved context. (2) Reduce — improve retrieval recall so the model has the right information; add a system prompt instruction like "Only use information explicitly present in the provided context. If unsure, say you don't know." (3) Monitor — log all flagged responses, sample for human review weekly, and retrain or update prompts based on failure patterns.
"Understanding these 12 concepts deeply — not just being able to recite definitions — is what separates the engineers who get offers from those who don't."

What to Study Next

Beyond the 12 concepts above, strong candidates in 2026 are expected to have working knowledge of context window management (KV cache, sliding window attention, positional encoding variants like RoPE and ALiBi), model serving infrastructure (vLLM, TGI, ONNX Runtime for inference optimisation), and AI safety basics (red-teaming, jailbreaking patterns, constitutional constraints). These topics appear in senior interviews at AI-first companies and frontier labs.

Practice These Concepts
Our AI Interview Prep tool contains 200+ questions specifically covering all 12 concepts in this article, with model answers and difficulty ratings. Use our Skills Gap Analyser to identify which concepts you're weakest on and get a personalised study plan. If you want a 1:1 mock interview session with detailed feedback, book a free session here.

Frequently Asked Questions

How long does it take to become comfortable with all 12 concepts?
With a structured study plan of 2 hours per day, most engineers reach interview-ready depth on all 12 in 6–8 weeks. The concepts cluster naturally: start with embeddings + tokenization + attention (foundational), then fine-tuning + quantization + vector databases (infrastructure), then RAG + agents + evaluation (application layer), then MoE + RLHF (advanced theory). Our GenAI Engineer Roadmap sequences this exactly.
Do I need to implement these from scratch or just understand them conceptually?
It depends on the role. For senior AI engineer roles at product companies, conceptual depth plus practical API-level experience is sufficient. For ML research or inference engineering roles at frontier labs (Anthropic, DeepMind, OpenAI), you will be asked to implement attention from scratch, derive the RLHF objective, or debug quantization error at the weight level. Use the job description as a guide — if it mentions CUDA, Triton, or training infrastructure, expect deeper implementation questions.
Which concept trips up candidates the most in interviews?
Evaluation — by a wide margin. Most candidates can explain RAG and attention clearly but struggle when asked "how do you know your AI system is working correctly?" Interviewers at senior levels want to hear about hallucination detection, faithfulness scoring, LLM-as-judge design, and production monitoring — not just "I check accuracy on a test set." Building or contributing to an evaluation system is the fastest way to differentiate yourself.
Are these concepts relevant for MLOps and data engineering roles too?
Yes, increasingly so. MLOps engineers building LLM serving infrastructure need deep knowledge of quantization, KV-cache management, and evaluation pipelines. Data engineers building RAG systems need to understand embeddings, chunking strategies, and vector database indexing. Even product managers at AI companies are expected to understand tokenization, context windows, and hallucination patterns to write accurate specs and interpret model behaviour reports.
Is RAG being replaced by longer context windows?
Not yet, and probably not fully. While models now support 1M+ token contexts, simply stuffing all your documents into the context window has problems: cost scales linearly with context length, latency increases, and the "lost in the middle" effect means the model ignores relevant information that isn't near the start or end. RAG retrieves only what's needed — it's faster, cheaper, and often more accurate for large knowledge bases. The practical answer in 2026: use RAG for large corpora (>500 documents), and direct context injection for small, focused knowledge sets.