Why These 12 Concepts Decide Your Interview
AI engineering interviews in 2026 have shifted. Panels no longer ask you to implement a linked list or solve a coin-change problem. Instead, they ask: "Explain how attention works and why we scale by √d_k." Or: "Walk me through how you'd design a RAG pipeline for a legal document search tool." These questions have one thing in common — they require deep conceptual fluency, not just surface-level awareness.
This guide covers the 12 concepts that appear most frequently in senior AI engineering interviews at companies like Google DeepMind, Anthropic, OpenAI, Cohere, and fast-growing AI product companies. For each concept, we explain what it is, how it works mechanically, a concrete example, and the exact interview angle interviewers use to test depth. We've also added two bonus concepts that many candidates miss entirely.
1Embeddings: How Machines Understand Meaning
An embedding is a dense, fixed-length numerical vector that represents the semantic meaning of a piece of text, an image, an audio clip, or any other data. The core idea is that semantically similar things should live close together in vector space — measurable by cosine similarity or dot product.
Early approaches like Word2Vec and GloVe created word-level embeddings trained on co-occurrence statistics. The famous analogy king − man + woman ≈ queen demonstrates that arithmetic in embedding space corresponds to real-world relationships. Modern systems use transformer-based models like Sentence-BERT (SBERT) and OpenAI's text-embedding-3-small to produce sentence and document embeddings — capturing meaning at a much richer level.
How Embeddings Are Created
A neural network is trained (usually with contrastive loss or masked language modelling) so that its intermediate layer activations form a useful representation. For text, you pass the input through a transformer, then pool the last hidden states — typically by taking the mean of all token vectors, or by using the [CLS] token specifically added for classification. The resulting vector might be 384, 768, or 1536 dimensions.
| Model | Dimensions | Best For | Cost |
|---|---|---|---|
| text-embedding-3-small | 1536 | General purpose, RAG | $0.02 / 1M tokens |
| text-embedding-3-large | 3072 | High-accuracy retrieval | $0.13 / 1M tokens |
| SBERT (all-MiniLM-L6) | 384 | On-device, fast | Free / open source |
| Cohere embed-v3 | 1024 | Multilingual | $0.10 / 1M tokens |
| BGE-M3 | 1024 | Hybrid dense+sparse | Free / open source |
2Tokenization: The Hidden Step Before Every LLM Call
Before any text reaches a language model, it is converted into tokens — integer IDs from a fixed vocabulary. A token is roughly 4 characters of English text on average, but this varies significantly: "the" is 1 token, "ChatGPT" might be 3 tokens, and a rare word like "xylophone" could be 4 tokens split at subword boundaries.
The dominant algorithm is Byte Pair Encoding (BPE), used by GPT-4, LLaMA, and Mistral. BPE starts with individual characters and iteratively merges the most frequent adjacent pair into a new token until reaching the desired vocabulary size (typically 32k–100k tokens). Google's models use SentencePiece, which operates at the byte level and handles any language without a language-specific tokenizer.
Why Tokenization Matters for Engineers
Context windows are measured in tokens, not words or characters. GPT-4o's 128k context = roughly 96,000 words or about 300 pages of text. If you're processing code, tokens are spent more liberally — Python code is about 1 token per 3 characters. This affects cost, latency, and chunking strategy in RAG pipelines.
import tiktoken enc = tiktoken.encoding_for_model("gpt-4o") text = "AI engineers who understand tokenization write better prompts." tokens = enc.encode(text) print(len(tokens)) # → 11 tokens print(enc.decode(tokens)) # → reconstructed text # Each token decoded: for t in tokens: print(repr(enc.decode([t])))
3The Attention Mechanism: Why Transformers Changed Everything
The attention mechanism, introduced in the 2017 paper "Attention Is All You Need", is the mathematical engine inside every modern language model. It allows the model to weigh how relevant each word in an input is to every other word simultaneously — something recurrent networks (RNNs/LSTMs) could not do efficiently because they processed text sequentially.
Query, Key, Value — the Core Formula
Each token is projected into three vectors: a Query (Q), a Key (K), and a Value (V). The attention score between two positions is computed as:
Attention(Q, K, V) = softmax( Q × Kᵀ / √d_k ) × V # Q × Kᵀ → dot product: how much each query "matches" each key # / √d_k → scale by key dimension (prevents vanishing gradients) # softmax → convert raw scores to probabilities (sum = 1) # × V → weighted sum of value vectors
Think of it like a search engine for each word: the query is "what am I looking for?", the keys are "what does each other word offer?", and the values are "what information do I actually extract?" The word "bank" in "river bank" attends strongly to "river" and "water" — but the same word in "bank account" attends to "money" and "deposit". This is how transformers resolve ambiguity that stumped previous NLP approaches.
Multi-Head Attention
Instead of computing one attention function, transformers run h parallel attention heads (typically 8–96), each learning a different relationship pattern. One head might track syntactic dependencies, another coreference, another semantic similarity. Their outputs are concatenated and projected back to the model dimension. This parallel structure is also why transformers train so efficiently on GPUs.
4Fine-Tuning: Teaching an LLM Your Specific Domain
Fine-tuning adapts a pre-trained foundation model to a specific task, domain, or style by continuing training on a curated dataset. The result is a model that retains the broad knowledge from pre-training but performs significantly better on the target distribution. A legal LLM, a medical coding assistant, and a code-review bot are all examples of fine-tuned models.
Full Fine-Tuning vs Parameter-Efficient Methods
Full fine-tuning updates all model weights. For a 7B parameter model, that means training 7 billion floats — requiring 140GB+ of GPU memory in FP16. This is expensive and risks catastrophic forgetting — the model can lose general capabilities while gaining task-specific ones.
LoRA (Low-Rank Adaptation) is the dominant efficient alternative. Instead of modifying the weight matrices directly, LoRA adds a pair of small rank-decomposed matrices (A and B) to each attention layer. Only A and B are trained — typically less than 1% of total parameters. At inference, the LoRA weights are merged into the base model with zero added latency. QLoRA goes further by quantizing the base model to 4-bit, allowing 70B model fine-tuning on a single 48GB GPU.
| Method | Params Trained | GPU Memory (7B) | Best For |
|---|---|---|---|
| Full Fine-Tuning | 100% | ~140 GB | Maximum accuracy, large budget |
| LoRA (r=16) | ~0.5% | ~20 GB | Most production use cases |
| QLoRA (4-bit) | ~0.5% | ~6 GB | Consumer GPU fine-tuning |
| Prompt Tuning | <0.01% | ~16 GB | Style/format adaptation only |
5Quantization: Running Bigger Models on Smaller Hardware
Quantization reduces the numerical precision of model weights — converting 32-bit floating point (FP32) values into lower-precision formats like FP16, INT8, or INT4. The primary benefit is reduced memory usage and faster inference; the cost is a small drop in accuracy. In practice, well-quantized models lose less than 1% on most benchmarks, making quantization standard for production deployment.
Precision Formats at a Glance
| Format | Bits | 7B Model Size | 70B Model Size | Quality Loss |
|---|---|---|---|---|
| FP32 | 32 | 28 GB | 280 GB | None (baseline) |
| BF16 / FP16 | 16 | 14 GB | 140 GB | Negligible |
| INT8 | 8 | 7 GB | 70 GB | ~0.1–0.5% |
| INT4 (GPTQ) | 4 | 3.5 GB | 35 GB | ~1–3% |
| INT4 (GGUF Q4_K_M) | 4–5 avg | ~4.1 GB | ~41 GB | ~1–2% |
Key Quantization Methods
GPTQ (post-training quantization) minimises the reconstruction error layer by layer using a small calibration dataset. AWQ (Activation-aware Weight Quantization) protects the most important weight channels from aggressive quantization, achieving better accuracy than GPTQ at the same bit-width. GGUF is the file format used by llama.cpp for CPU and mixed CPU/GPU inference — it supports mixed-precision quantization (e.g., Q4_K_M quantizes most layers to 4-bit but keeps attention layers at higher precision).
6Vector Databases: Long-Term Memory for AI Applications
A vector database stores embeddings alongside their source data and metadata, and specialises in Approximate Nearest Neighbour (ANN) search — finding the most semantically similar items to a query vector in milliseconds, even across billions of entries. This is the infrastructure layer that makes RAG pipelines and semantic search scalable.
How ANN Search Works: HNSW
The most widely used ANN algorithm is HNSW (Hierarchical Navigable Small World). It builds a multi-layer graph where each node is connected to its closest neighbours. At query time, the search starts at the top (sparse) layer and greedily navigates down toward the query vector, pruning the search space at each step. This achieves sub-millisecond retrieval at 99%+ recall for typical workloads — a brute-force exhaustive search at the same scale would take seconds.
| Database | Hosting | Best For | Standout Feature |
|---|---|---|---|
| Pinecone | Managed cloud | Production at scale | Serverless, namespace isolation |
| Qdrant | Cloud + self-hosted | Filtered search | Best payload filtering performance |
| Weaviate | Cloud + self-hosted | Hybrid search | Built-in BM25 + vector hybrid |
| pgvector | Postgres extension | Existing PG stack | Zero new infra, ACID transactions |
| FAISS | In-process (library) | Research / local dev | Extremely fast, no server |
| ChromaDB | Embedded / cloud | Prototyping | Simplest API for getting started |
7Prompt Engineering: The Art of Talking to LLMs
Prompt engineering is the discipline of designing inputs to language models to elicit desired outputs reliably, accurately, and efficiently. Done well, it can close most of the gap between a base model and a fine-tuned one — without touching model weights. Done poorly, it's the most common source of LLM failures in production.
Core Techniques
Zero-shot prompting asks the model to perform a task with no examples: "Classify this review as positive or negative." Few-shot prompting provides 2–5 demonstrations before the actual input, significantly improving accuracy for structured tasks. Chain-of-Thought (CoT) adds the phrase "Let's think step by step" or provides reasoning examples — it forces the model to externalise intermediate reasoning, dramatically improving performance on multi-step logic and maths. Google's 2022 paper showed CoT improved arithmetic accuracy from 18% to 57% on GSM8K with PaLM.
from openai import OpenAI import json client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "Extract job details as JSON."}, {"role": "user", "content": "Senior ML Engineer at Anthropic, London, £180k"} ], response_format={"type": "json_object"} ) data = json.loads(response.choices[0].message.content) # → {"title": "Senior ML Engineer", "company": "Anthropic", # "location": "London", "salary": "£180k"}
For hands-on practice with prompt engineering, our AI Interview Prep tool has 150+ practice questions specifically on prompt design, covering system prompts, CoT, structured outputs, and safety techniques used at leading AI labs.
8RAG Pipelines: Giving LLMs Access to Fresh Knowledge
Retrieval-Augmented Generation (RAG) combines an LLM with an external knowledge store, allowing the model to answer questions about information that was not in its training data — your internal documents, last week's news, product knowledge bases. The model is not fine-tuned; instead, relevant context is retrieved and injected into the prompt at query time.
The Seven Stages of a Production RAG Pipeline
A naive RAG implementation is straightforward. A production RAG system has seven distinct stages, each with its own failure modes:
- Ingest — Load source documents (PDFs, web pages, databases, APIs).
- Parse — Extract clean text, preserving structure (tables, headings, code blocks).
- Chunk — Split documents into overlapping segments. Strategy matters: fixed-size (simple), sentence (natural), semantic (highest quality, highest cost).
- Embed — Convert each chunk to a vector using an embedding model.
- Index — Store vectors in a vector database alongside source text and metadata.
- Retrieve — At query time, embed the user question and fetch the top-K most similar chunks.
- Generate — Inject retrieved chunks into the LLM prompt as context and generate the final answer.
Advanced Retrieval: Hybrid Search and Re-Ranking
Simple dense retrieval misses keyword-critical queries. Hybrid retrieval combines dense (embedding) similarity with sparse (BM25 keyword) scoring — typically weighted 70/30. After retrieval, a cross-encoder re-ranker (e.g., Cohere Rerank, BGE-Reranker) scores each retrieved chunk against the full query, reordering the results for maximum relevance before passing them to the LLM. This two-stage approach (fast ANN retrieval → accurate cross-encoder re-ranking) is the industry standard for production RAG.
9AI Agents: LLMs That Can Take Action
An AI agent is an LLM-powered system that can perceive its environment, reason about a goal, select actions (tools), execute them, observe results, and repeat — until the goal is reached or a stopping condition is hit. The shift from an LLM answering a single question to an agent completing a multi-step task is one of the most consequential transitions in applied AI engineering. See our full Agentic AI Roadmap for a complete learning path.
The ReAct Pattern
The most widely used agent pattern is ReAct (Reason + Act). The LLM alternates between two steps: Thought (reasoning about what to do next) and Action (calling a tool). After each action, the agent receives an Observation (tool result) and decides whether to continue or finish. This loop continues until the agent produces a final answer.
goal = "What is Anthropic's latest funding round amount?" while not done: thought = llm.think(goal, history) # "I need to search for this" action = llm.choose_tool(thought) # web_search("Anthropic funding 2026") result = tools[action.name](action.args) history.append(thought, action, result) if llm.is_final_answer(result): return llm.format_answer(result)
Agent Memory Types
Agents use four types of memory: Short-term (the current context window — what the agent can see right now), Long-term (a vector store of past interactions and facts), Episodic (a record of past task completions used for planning), and Semantic (general world knowledge from the base model's training). Production agent systems like those built on LangGraph or CrewAI manage all four explicitly.
10Mixture of Experts: How the Largest Models Scale Efficiently
Mixture of Experts (MoE) is an architecture that replaces each dense feed-forward layer in a transformer with a set of expert sub-networks, plus a router that selects which experts to activate for each token. The key insight: you can have a model with 47 billion total parameters but activate only 13 billion per token — getting the knowledge capacity of a large model at the inference cost of a smaller one.
How the Router Works
The router is a learned linear layer that takes a token's hidden state as input and outputs a probability distribution over all experts. Top-K routing (K=1 or K=2) selects the highest-scoring experts per token. Mistral's Mixtral 8x7B uses 8 experts with top-2 routing — each token activates 2 of the 8 experts, giving 13B active parameters (close to a standard 13B dense model) while the full model has 47B parameters stored in memory.
| Model | Total Params | Active Params/Token | vs Dense Equivalent |
|---|---|---|---|
| Mixtral 8x7B | 47B | 13B | Beats LLaMA 2 70B on most benchmarks |
| Mixtral 8x22B | 141B | 39B | Near GPT-4 on coding tasks |
| GPT-4 (rumoured) | ~1.8T | ~220B | 16 experts, top-2 routing |
| DeepSeek-V3 | 671B | 37B | State-of-art open MoE, 2025 |
11RLHF & Alignment: How Models Learn to Be Helpful
Reinforcement Learning from Human Feedback (RLHF) is the training technique that turns a capable but raw pre-trained language model into a helpful, harmless, and honest AI assistant. Without RLHF, a model trained purely on internet text will confidently produce harmful content, hallucinations, and responses that don't follow user intent. RLHF is why ChatGPT, Claude, and Gemini behave the way they do.
The Three-Stage RLHF Pipeline
- Supervised Fine-Tuning (SFT) — Fine-tune the base model on high-quality human-written demonstrations of desired behaviour. This produces the SFT model — helpful but not yet aligned.
- Reward Model (RM) Training — Present human labellers with pairs of model responses to the same prompt and ask them which is better. Train a separate neural network (the reward model) to predict human preference scores. This encodes "what humans like" into a differentiable signal.
- RL Optimisation (PPO) — Use the reward model as a reward signal to fine-tune the SFT model via Proximal Policy Optimisation (PPO). The model learns to generate responses that score highly on the reward model, with a KL-divergence penalty to prevent it from drifting too far from the SFT model (which prevents reward hacking).
DPO: The Simpler Alternative
Direct Preference Optimisation (DPO), introduced in 2023, achieves RLHF-level alignment without the separate reward model or RL optimisation loop. It reformulates the RLHF objective as a supervised classification problem directly on the preference pairs. DPO is simpler, more stable, and nearly as effective — most open-source fine-tuned models (Llama 3 Instruct, Mistral Instruct) now use DPO or its variants.
12Evaluation & Hallucination: Knowing When Your AI Is Actually Working
Evaluation is widely considered the hardest unsolved problem in LLM engineering. Unlike traditional software where a test either passes or fails, LLM outputs exist on a spectrum of quality — and that quality depends on the user's intent, the task, and often the cultural context. Getting evaluation right is what separates AI products that ship reliably from those that require constant firefighting.
Reference-Based Metrics
BLEU (Bilingual Evaluation Understudy) and ROUGE measure n-gram overlap between the model output and a human reference. They're cheap and automatic, making them useful for machine translation and summarisation benchmarks. Their weakness: a response can score 0 BLEU while being semantically identical to the reference (just paraphrased), or score high BLEU while being factually wrong.
LLM-as-Judge
The dominant modern approach: use a powerful LLM (GPT-4o, Claude Opus) to score model outputs on dimensions like correctness, helpfulness, conciseness, and safety. This scales to thousands of samples without human raters and correlates well with human judgement (typically 80–85% agreement). The key risk is judge bias — LLMs tend to prefer longer, verbose answers and responses from the same model family.
RAG-Specific Evaluation: RAGAs
RAGAs (Retrieval Augmented Generation Assessment) is the standard evaluation framework for RAG systems. It measures four dimensions without requiring human labels: Faithfulness (does the answer contain only facts from the retrieved context?), Answer Relevancy (does the answer address the question?), Context Precision (how relevant are the retrieved chunks to the question?), and Context Recall (did we retrieve all the relevant chunks?).
What to Study Next
Beyond the 12 concepts above, strong candidates in 2026 are expected to have working knowledge of context window management (KV cache, sliding window attention, positional encoding variants like RoPE and ALiBi), model serving infrastructure (vLLM, TGI, ONNX Runtime for inference optimisation), and AI safety basics (red-teaming, jailbreaking patterns, constitutional constraints). These topics appear in senior interviews at AI-first companies and frontier labs.