Interview Prep * AI * GenAI * LLMs * Agents

Ace Your GenAI Engineer Interview

From LLM fundamentals and RAG architectures to agent design and AI system thinking -- the questions that separate candidates who talk AI from engineers who build it.

0Questions
0 / 0Reviewed
Try Live Simulator →
Beginner → Advanced
LLMs
RAG
Agents
System Design
Python · LangChain · LlamaIndex
New — Live Practice
Ready to be tested? The Interview Simulator puts Claude in the interviewer seat — you answer, get scored & receive detailed feedback.
Start Mock Interview →
🔍
No questions match your search.
LLM Fundamentals
Core concepts, embeddings, context, limitations * beginner -> intermediate
12+ questions
Q1 * What is a large language model and how is it trained?
Level: Beginner
Expected answer
An LLM is a transformer‑based neural network trained on massive text corpora to predict the next token in a sequence. Key points:
  • Uses self‑attention to model long‑range dependencies.
  • Training objective is next‑token prediction (or masked token prediction for some models).
  • Pre‑training is followed by alignment steps like instruction tuning and RLHF.
Follow‑up questions
  • How does the transformer architecture differ from RNNs/LSTMs?
  • What is RLHF and why is it used?
  • What are some trade‑offs between model size and latency?
Evaluation rubric
Strong
Mentions transformers, next‑token prediction, large‑scale data, and alignment steps.
OK
Describes "AI that predicts text" but misses architecture or training details.
Weak
Vague answer like "a chatbot" with no mention of training or modeling.
Q2 * What are tokens and how do they affect cost and context?
Level: Beginner
Expected answer
Tokens are the units of text (sub‑words, words, or characters) that the model processes. They matter because:
  • Context length is measured in tokens, not characters.
  • API pricing is usually per 1K tokens.
  • Tokenization can split words in non‑obvious ways, affecting prompt length.
Follow‑up questions
  • What happens when a prompt exceeds the context window?
  • How would you estimate token usage for a given feature?
  • Why do different models have different tokenizers?
Evaluation rubric
Strong
Connects tokens to context, pricing, and tokenizer behavior with concrete examples.
OK
Defines tokens but doesn't connect to cost or context limits.
Weak
Confuses tokens with words or characters.
Q3 * How do embeddings differ from generative LLM calls?
Level: Intermediate
Expected answer
Embeddings map text to dense vectors capturing semantic similarity, while generative calls produce new text by predicting tokens. Differences:
  • Embeddings: representation; used for search, clustering, retrieval.
  • Generation: autoregressive decoding; used for answering, summarization, etc.
  • Often separate models/endpoints optimized for each task.
Follow‑up questions
  • How are embeddings used in RAG?
  • What is cosine similarity and why is it common?
  • When would you choose a smaller embedding model?
Evaluation rubric
Strong
Clearly separates representation vs generation and mentions retrieval/search use cases.
OK
Understands embeddings as "vectors" but not concrete applications.
Weak
Treats embeddings as generic model outputs with no semantic meaning.
Q4 * What is a context window and how does it impact design?
Level: Intermediate
Expected answer
The context window is the maximum number of tokens the model can consider at once (prompt + output). It impacts:
  • How much history or retrieved context you can include.
  • Prompt engineering strategies (summarization, chunking, sliding windows).
  • Cost and latency for long prompts.
Follow‑up questions
  • How would you handle very long documents?
  • What are trade‑offs of using a 200K‑token model?
  • How does context window relate to RAG design?
Evaluation rubric
Strong
Connects context window to architecture decisions (RAG, summarization, cost).
OK
Defines context window but not its practical implications.
Weak
No clear understanding of limits or design impact.
Q5 * What are common limitations of LLMs?
Level: Intermediate
Expected answer
Limitations include:
  • Hallucinations (confident but incorrect answers).
  • Lack of up‑to‑date knowledge (frozen training data).
  • Sensitivity to prompt phrasing.
  • Biases from training data.
  • Non‑determinism and reproducibility challenges.
Follow‑up questions
  • How do you mitigate hallucinations in production?
  • When is fine‑tuning appropriate vs RAG?
  • How do you handle safety and bias concerns?
Evaluation rubric
Strong
Lists multiple limitations and ties them to mitigation strategies.
OK
Mentions hallucinations but not other important limitations.
Weak
Claims LLMs are "almost perfect" or ignores limitations.
Q5 · What is tokenisation? How does it affect cost, latency, and multilingual performance?
Level: Beginner
Expected answer
Tokenisation converts raw text into integer IDs before feeding into an LLM:
  • BPE (Byte-Pair Encoding) — most common; merges frequent character pairs into subword units. GPT models use BPE.
  • Wordpiece — similar to BPE; used by BERT.
  • SentencePiece — language-agnostic; handles any script without pre-tokenisation.
Practical implications:
  • Cost — charged per token; verbose prompts and long outputs increase cost directly.
  • Context window — measured in tokens, not words. 1 English word ≈ 1.3 tokens; code and non-English text can be 2–4× less efficient.
  • Multilingual — languages underrepresented in training data tokenise less efficiently (more tokens per word), costing more and degrading quality.
Follow‑up questions
  • Why does English generally tokenise more efficiently than other languages?
  • How can you reduce token usage without changing your prompt's intent?
  • What is the relationship between token count and model latency?
Evaluation rubric
Strong
Knows BPE mechanics, understands token-based pricing, explains multilingual inefficiency and its quality implications.
OK
Knows tokenisation converts text to numbers and affects cost but can't explain BPE or multilingual issues.
Weak
Thinks tokenisation is just 'splitting text into words'.
Q6 · What causes hallucinations in LLMs and how do you mitigate them?
Level: Intermediate
Expected answer
Hallucinations are confident but factually incorrect outputs generated by LLMs:
  • Causes: LLMs generate statistically likely tokens, not verified facts. They have no access to ground truth; training data has errors; they interpolate between seen patterns.
  • Mitigation strategies:
    • RAG — ground responses in retrieved documents; most effective for factual tasks.
    • Structured output + validation — constrain model to JSON and validate against schema.
    • Self-consistency — sample multiple responses and take majority; reduces but doesn’t eliminate hallucinations.
    • Citation enforcement — require model to quote sources; helps detect but not prevent.
    • LLM-as-judge — a second model checks faithfulness of the answer against retrieved context.
Follow‑up questions
  • Can you completely eliminate hallucinations with RAG? Why or why not?
  • What is faithfulness vs answer relevancy in RAG evaluation?
  • How would you build a production hallucination monitoring pipeline?
Evaluation rubric
Strong
Explains root causes, knows multiple mitigation strategies, understands RAG's limitations, can describe a monitoring pipeline.
OK
Knows RAG reduces hallucinations but can't explain root causes or monitoring approaches.
Weak
Says 'hallucinations happen because the model makes things up' with no mitigation knowledge.
Q7 · Explain the difference between temperature, top-p, and top-k sampling.
Level: Intermediate
Expected answer
These are decoding parameters that control the randomness of LLM outputs:
  • Temperature — scales logits before softmax. Temperature < 1: more deterministic (sharper distribution). Temperature > 1: more random (flatter). Temperature = 0: greedy (always picks highest probability token).
  • Top-k — restricts sampling to the k most probable tokens at each step; ignores the rest. Prevents very unlikely tokens but fixed k can be too narrow or too wide depending on context.
  • Top-p (nucleus sampling) — samples from the smallest set of tokens whose cumulative probability ≥ p. Adapts dynamically to the distribution — better than top-k in most cases.
Practical defaults: temperature 0.0 for factual/deterministic tasks, 0.7–1.0 for creative tasks. Top-p 0.95 is a common default. These settings do not affect training — only inference.
Follow‑up questions
  • When would you set temperature to 0 vs 0.7 in a production application?
  • Why is top-p generally preferred over top-k?
  • What is repetition penalty and when do you need it?
Evaluation rubric
Strong
Correctly explains all three, understands their interaction, gives practical production guidance, knows temperature = 0 means greedy decoding.
OK
Knows temperature controls randomness but confuses top-k and top-p or can't give practical guidance.
Weak
Says 'temperature makes the model more creative' without knowing the mathematical mechanism.
Q8 · What is RLHF and how does it align LLMs with human preferences?
Level: Advanced
Expected answer
RLHF (Reinforcement Learning from Human Feedback) is the primary alignment technique for modern LLMs:
  • Step 1 — Supervised Fine-Tuning (SFT): Fine-tune the base model on high-quality human-written demonstrations of desired behaviour.
  • Step 2 — Reward Model (RM): Collect human preference data (pairs of model outputs ranked by human raters). Train a reward model to predict which output a human would prefer.
  • Step 3 — PPO optimisation: Use Proximal Policy Optimisation to fine-tune the SFT model to maximise reward model scores, with a KL-divergence penalty to prevent reward hacking (staying too far from the SFT baseline).
Alternative: DPO (Direct Preference Optimisation) — skips the reward model; optimises preferences directly. Simpler, often equally effective. Limitation: reward hacking — model learns to game the reward model rather than truly align.
Follow‑up questions
  • What is reward hacking and how does the KL penalty mitigate it?
  • How does DPO differ from PPO-based RLHF and what are the trade-offs?
  • What is Constitutional AI (CAI) and how does it extend RLHF?
Evaluation rubric
Strong
Correctly explains all 3 RLHF stages, understands reward hacking and KL penalty, compares RLHF vs DPO, mentions Constitutional AI.
OK
Knows RLHF uses human feedback to train a reward model and then fine-tunes with RL, but unclear on PPO details or DPO.
Weak
Says 'humans rate responses and the model learns from that' without the three-stage structure.
Prompt Engineering
Patterns, few‑shot, structured outputs, debugging * beginner -> intermediate
12 questions
Q1 * What makes a prompt production‑ready?
Level: Beginner
Expected answer
A production‑ready prompt is:
  • Clear about role, task, and constraints.
  • Explicit about output format and style.
  • Robust to minor input variations.
  • Tested against edge cases and evaluated with metrics.
Follow‑up questions
  • How do you version prompts?
  • How would you A/B test prompts?
  • How do you handle localization or multi‑language prompts?
Evaluation rubric
Strong
Mentions clarity, constraints, format, and testing/metrics.
OK
Talks about "clear instructions" but not evaluation or robustness.
Weak
Only says "ask nicely" or similar vague advice.
Q2 * Compare zero‑shot, few‑shot, and chain‑of‑thought prompting.
Level: Intermediate
Expected answer
  • Zero‑shot: only instructions; good for simple tasks.
  • Few‑shot: add examples; good for style, format, or nuanced tasks.
  • Chain‑of‑thought: encourage step‑by‑step reasoning; good for reasoning and math.
Should mention trade‑offs in context usage and latency.
Follow‑up questions
  • When would you avoid chain‑of‑thought?
  • How do you choose examples for few‑shot prompts?
  • How does this relate to evaluation?
Evaluation rubric
Strong
Clearly distinguishes all three and mentions trade‑offs and use cases.
OK
Knows definitions but not when to use each pattern.
Weak
Confuses few‑shot with training or fine‑tuning.
Q3 * How do you design prompts for structured JSON output?
Level: Intermediate
Expected answer
Strategies:
  • Specify exact JSON schema and field types.
  • Provide one or more valid examples.
  • Instruct the model to output only JSON, no extra text.
  • Use validators and repair logic for malformed JSON.
Follow‑up questions
  • When would you use function/tool calling instead?
  • How do you handle optional fields?
  • How do you test robustness of structured prompts?
Evaluation rubric
Strong
Mentions schema, examples, strict instructions, and validation/repair strategies.
OK
Says "ask for JSON" but not how to enforce or validate it.
Weak
No awareness of structured output challenges.
Q4 * How do you debug a prompt that behaves inconsistently?
Level: Intermediate
Expected answer
Steps:
  • Collect failing examples and categorize failure modes.
  • Simplify the prompt to isolate the cause.
  • Add clarifications, constraints, or examples.
  • Test across a representative evaluation set.
Follow‑up questions
  • How do you know when to stop tweaking prompts and change the model instead?
  • How would you log prompt failures in production?
  • How do you avoid overfitting prompts to a small eval set?
Evaluation rubric
Strong
Treats prompt debugging like normal software debugging with data and evaluation sets.
OK
Suggests "try different wording" without a systematic approach.
Weak
No clear debugging strategy.
RAG System Design
Pipelines, chunking, retrieval, evaluation, access control * intermediate -> advanced
10+ questions
Q1 * Describe the architecture of a RAG system end‑to‑end.
Level: Intermediate
Expected answer
A typical RAG pipeline:
  • Ingestion: load docs, chunk, embed, store in vector DB with metadata.
  • Retrieval: embed query, similarity search, optional filters/reranking.
  • Augmentation: build prompt with user query + retrieved context.
  • Generation: LLM answers using augmented prompt.
  • Evaluation/monitoring: track quality, latency, and failures.
Follow‑up questions
  • How do you choose chunk size and overlap?
  • What are common failure modes of RAG?
  • How would you evaluate RAG quality?
Evaluation rubric
Strong
Covers ingestion, retrieval, augmentation, generation, and evaluation with concrete details.
OK
Describes retrieval + generation but misses ingestion or evaluation.
Weak
Only says "search + LLM" with no structure.
Q2 * How do you choose chunk size and overlap for documents?
Level: Intermediate
Expected answer
Consider:
  • Semantic coherence (don't split mid‑sentence or mid‑concept).
  • Model context window and cost.
  • Task type (FAQ vs long‑form reasoning).
  • Typical ranges: 200-800 tokens with 10-20% overlap.
Follow‑up questions
  • How would you empirically tune chunk size?
  • What happens if chunks are too small or too large?
  • How does chunking interact with reranking?
Evaluation rubric
Strong
Balances semantic coherence, context limits, and cost; suggests empirical tuning.
OK
Suggests a fixed size without reasoning or tuning strategy.
Weak
No understanding of why chunking matters.
Q3 * How do you handle hallucinations in a RAG system?
Level: Intermediate
Expected answer
Strategies:
  • Improve retrieval quality (better embeddings, chunking, filters, reranking).
  • Constrain the model to answer only from provided context.
  • Ask the model to cite sources or say "I don't know".
  • Use secondary verification for critical domains.
Follow‑up questions
  • How would you detect hallucinations automatically?
  • What metrics would you track in production?
  • When is hallucination acceptable vs unacceptable?
Evaluation rubric
Strong
Mentions retrieval quality, prompt constraints, and explicit "I don't know" behavior.
OK
Talks about "improving the model" but not retrieval or constraints.
Weak
No concrete mitigation strategies.
Q5 · What chunking strategies work best for RAG? How do you choose chunk size?
Level: Intermediate
Expected answer
Chunking splits documents into pieces for embedding and retrieval:
  • Fixed-size chunking — simple; chunk by token count (e.g., 512 tokens with 50-token overlap). Risk: splits mid-sentence.
  • Sentence/paragraph chunking — respects natural boundaries; better coherence. Use when documents have clear sentence structure.
  • Semantic chunking — embed sentences; split when cosine distance between adjacent sentences spikes. Computationally expensive but produces semantically coherent chunks.
  • Hierarchical chunking (parent-child) — embed small chunks for retrieval precision, return larger parent chunks for context. Best of both worlds.
Chunk size rule of thumb: smaller chunks = better precision (retrieval); larger chunks = better context (generation). Test empirically with your data — 256–1024 tokens is a typical starting range.
Follow‑up questions
  • What is the impact of chunk overlap on retrieval quality?
  • How do you handle structured data (tables, code) in chunking?
  • What is the parent-document retriever pattern?
Evaluation rubric
Strong
Knows all chunking strategies, explains precision vs context trade-off, discusses parent-child pattern, knows to test empirically.
OK
Knows fixed-size chunking and the need for overlap but unaware of semantic or hierarchical approaches.
Weak
Says 'split by paragraphs' without knowing the retrieval precision vs context trade-off.
Q6 · How does hybrid search improve RAG retrieval quality?
Level: Intermediate
Expected answer
Hybrid search combines dense (semantic) and sparse (keyword) retrieval:
  • Dense retrieval — embedding-based; captures semantic meaning. Excels at paraphrases and concept queries. Struggles with exact terminology, codes, or rare proper nouns.
  • Sparse retrieval (BM25) — keyword-based TF-IDF variant. Excels at exact terms, product codes, names. Fails at semantic similarity.
  • Hybrid = dense + sparse: use Reciprocal Rank Fusion (RRF) to merge ranked results from both. Consistently outperforms either alone.
  • Re-ranking: after hybrid retrieval, a cross-encoder model (e.g., Cohere Rerank, BGE-reranker) re-scores top-N results for higher precision before sending to the LLM.
Typical pipeline: Query → Dense + BM25 → RRF merge → Cross-encoder re-rank → Top-k to LLM.
Follow‑up questions
  • What is Reciprocal Rank Fusion and how does it combine results?
  • When would dense retrieval alone fail and BM25 be critical?
  • What is the computational cost trade-off of adding a cross-encoder re-ranker?
Evaluation rubric
Strong
Explains both retrieval types, knows RRF, explains cross-encoder re-ranking and its latency cost, gives full pipeline.
OK
Knows dense and sparse retrieval exist but can't explain RRF or re-ranking.
Weak
Thinks hybrid just means 'vector database + keyword search' without knowing how results are merged.
Q7 · How do you evaluate a RAG pipeline? What metrics matter?
Level: Advanced
Expected answer
RAG evaluation covers both retrieval and generation quality:
  • Retrieval metrics:
    • Context Precision — fraction of retrieved chunks that are relevant.
    • Context Recall — fraction of relevant content that was retrieved.
  • Generation metrics:
    • Faithfulness — does the answer only use information from retrieved context? (Detects hallucination)
    • Answer Relevancy — does the answer actually address the question?
  • RAGAs framework — open-source; evaluates all four metrics using an LLM-as-judge approach without requiring human labels.
  • End-to-end — create a golden question-answer dataset; measure exact match, BLEU, or LLM-judged correctness.
Follow‑up questions
  • What is the difference between faithfulness and answer relevancy?
  • How do you create a golden QA test set for RAG evaluation?
  • What is LLM-as-judge and what are its limitations?
Evaluation rubric
Strong
Knows all four RAGAs metrics, explains LLM-as-judge approach, can describe golden test set creation, understands limitations.
OK
Knows RAGAs exists and has some metrics but can't distinguish precision/recall from faithfulness/relevancy.
Weak
Says 'test it manually' without any systematic evaluation approach.
Agents
Tool use, planning, loops, safety * intermediate -> advanced
10 questions
Q1 * What is an LLM agent and how is it different from a plain LLM call?
Level: Intermediate
Expected answer
An agent is a system where an LLM:
  • Plans steps toward a goal.
  • Selects and calls tools (APIs, DBs, services).
  • Iterates based on intermediate results and state.
It differs from a plain call by adding tool use, control flow, and memory around the model.
Follow‑up questions
  • What are risks of unconstrained agents?
  • How would you debug an agent that loops?
  • When would you avoid using agents?
Evaluation rubric
Strong
Mentions planning, tools, iteration, and control flow vs single‑shot prompts.
OK
Describes agents as "smart prompts" without tool use or planning details.
Weak
No distinction from basic chatbots.
Q5 · Compare ReAct and Plan-and-Execute agent architectures.
Level: Intermediate
Expected answer
Two dominant patterns for LLM-powered agents:
  • ReAct (Reason + Act) — alternates between generating a reasoning step (Thought) and taking an action (Action/Observation) in a tight loop. Single agent; reactive; adapts in real time. Good for: short tasks, unpredictable environments, debugging (transparent chain of thought). Weakness: can loop indefinitely; no upfront plan.
  • Plan-and-Execute — a planner LLM first creates a full task decomposition; executor agents carry out sub-tasks in sequence or parallel. Good for: long multi-step tasks, when a structured workflow is known upfront. Weakness: brittle if environment changes mid-execution; replanning is expensive.
Hybrid: plan first, but allow replanning if an executor step fails (LangGraph, CrewAI support this). When to use: start with ReAct for simplicity; use Plan-and-Execute when tasks have clear decomposable sub-goals.
Follow‑up questions
  • How does LangGraph implement the Plan-and-Execute pattern?
  • What triggers replanning in a Plan-and-Execute agent?
  • How do you prevent an agent from entering an infinite reasoning loop?
Evaluation rubric
Strong
Clearly explains both patterns with trade-offs, knows when to use each, mentions hybrid/replanning, references LangGraph or similar.
OK
Knows ReAct and Plan-and-Execute exist and are 'different ways of running an agent' but can't explain the architectural difference.
Weak
Only knows 'agents use tools to complete tasks' without the planning patterns.
Q6 · How do you prevent runaway costs and infinite loops in production agents?
Level: Advanced
Expected answer
Agents can consume unbounded tokens/API calls without guardrails:
  • Maximum iteration limits — hard stop after N reasoning steps (LangChain: max_iterations). Catch the exception and return gracefully.
  • Token budget tracking — count tokens across all calls in a session; abort when budget exceeded.
  • Tool call rate limiting — limit calls per tool per session; prevent accidental recursive tool use.
  • Timeout per step — each tool invocation has a max execution time; agent gets an error message instead of hanging.
  • Cost alerts — real-time token cost tracking; trigger alerts or circuit breakers at thresholds.
  • Loop detection — hash recent (thought, action) pairs; if same state repeats 3 times, break the loop.
  • Human-in-the-loop checkpoints — for high-risk actions (write access, emails), require explicit approval.
Follow‑up questions
  • How would you implement loop detection in a ReAct agent?
  • What is a circuit breaker pattern and how does it apply to agents?
  • How do you handle partial task completion when an agent hits a cost limit?
Evaluation rubric
Strong
Covers iteration limits, token budget, loop detection, circuit breakers, and human-in-the-loop; gives concrete implementation approaches.
OK
Knows max_iterations exists and mentions cost concerns but doesn't have a systematic approach.
Weak
Says 'set a timeout' as the only safeguard.
Q7 · What are the four types of agent memory and when do you use each?
Level: Intermediate
Expected answer
Agents use different memory types to manage context and knowledge:
  • Short-term (In-context) — the active context window. Fast but limited by token window size. Cleared between sessions unless persisted.
  • Long-term (External/Vector store) — user facts, learned preferences stored in a vector DB; retrieved via similarity search. Persists across sessions. Use for: personalisation, domain knowledge.
  • Episodic — past conversation summaries or completed task records. Lets the agent recall “last time I helped with X, Y happened”. Use for: continuity, avoiding repeated mistakes.
  • Semantic (Parametric) — knowledge baked into model weights via fine-tuning or pre-training. Doesn’t require retrieval. Use for: stable, universal knowledge.
Tools: Mem0 (managed long-term memory), LangGraph (state management), Redis (session cache).
Follow‑up questions
  • How does Mem0 implement long-term memory for agents?
  • What is the difference between episodic and semantic memory in this context?
  • How do you prevent stale facts in an agent's long-term memory?
Evaluation rubric
Strong
Correctly explains all four types with use cases, names implementation tools, discusses staleness and retrieval strategies.
OK
Knows short-term = context window and long-term = vector DB but unfamiliar with episodic or semantic memory.
Weak
Only knows 'agents remember things in the prompt'.
GenAI System Design
APIs, scaling, evaluation, observability * intermediate -> advanced
8 questions
Q1 * Design a scalable GenAI API for a resume‑review product.
Level: Intermediate
Expected answer
Components:
  • API gateway + auth.
  • Storage for resumes and metadata.
  • Queue/worker layer for LLM calls.
  • LLM provider(s) with fallback and retries.
  • Logging, metrics, prompt/version management.
Should mention latency vs cost trade‑offs and caching.
Follow‑up questions
  • How would you handle rate limits?
  • What data would you log?
  • How do you roll out new prompt versions safely?
Evaluation rubric
Strong
Describes end‑to‑end architecture, scaling, and observability considerations.
OK
Mentions API + LLM but not queues, logging, or versioning.
Weak
Only says "call the LLM from the backend".
Q5 · Design a production-grade customer support chatbot powered by an LLM.
Level: Advanced
Expected answer
A full production design covering all layers:
  • Retrieval layer — hybrid search (dense + BM25) over knowledge base; re-rank with cross-encoder. Chunk size 512 tokens with parent-document retrieval.
  • LLM layer — smaller model (Claude Haiku / GPT-4o Mini) for simple queries, route complex queries to larger model. System prompt enforces persona and safety.
  • Guardrails — input classification (detect jailbreaks, off-topic); output validation (no PII in response, factual grounding check).
  • Caching — exact-match cache for FAQs; semantic cache for near-duplicate questions (saves 30–60% of API calls).
  • Latency — streaming responses; target P95 < 2s first token. Async retrieval in parallel with prompt assembly.
  • Evaluation — RAGAs for faithfulness/relevancy; human review sample; A/B test new model versions on CSAT.
  • Escalation — detect low-confidence responses and route to human agent.
Follow‑up questions
  • How would you handle multi-turn conversation history efficiently?
  • How do you measure and improve CSAT for an LLM chatbot?
  • What is the escalation strategy when the LLM can't answer confidently?
Evaluation rubric
Strong
Covers retrieval, model routing, guardrails, caching, latency, evaluation, and human escalation — all with specific implementation choices.
OK
Covers retrieval and LLM selection but misses caching, guardrails, or evaluation strategy.
Weak
Says 'connect an LLM to a knowledge base' without addressing production concerns.
Q6 · How do you design an LLM system with low latency and high reliability at scale?
Level: Advanced
Expected answer
Production LLM systems must balance speed, cost, and reliability:
  • Latency optimisations:
    • Streaming: return first token immediately rather than waiting for full response.
    • Model right-sizing: use smallest model that meets quality bar; route simple queries to fast/cheap models.
    • Prompt caching: Anthropic and OpenAI support caching prompt prefixes — reuse static system prompts to save 90% of prefix tokens.
    • Parallel calls: run retrieval, tool calls, and sub-tasks in parallel where dependencies allow.
  • Reliability:
    • Multi-provider fallback: primary (Anthropic), fallback (OpenAI), emergency (self-hosted OSS).
    • Retry with exponential backoff for rate limit errors.
    • Circuit breaker: stop calling a provider if error rate > threshold for N seconds.
    • Queue-based architecture: async tasks use message queues; decouple request acceptance from LLM processing.
Follow‑up questions
  • What is prompt caching and which providers support it?
  • How do you implement a multi-provider fallback strategy?
  • What observability metrics would you track for an LLM API?
Evaluation rubric
Strong
Covers streaming, model routing, prompt caching, parallel calls, multi-provider fallback, circuit breakers, and queue architecture.
OK
Knows streaming and model right-sizing but misses prompt caching, fallback strategy, or circuit breakers.
Weak
Says 'use a faster model' as the main latency solution.
Coding Tasks (Python * LangChain * LlamaIndex)
Hands‑on RAG, tools, and pipelines * intermediate -> advanced
8 questions
Q1 * Implement a simple RAG pipeline in Python using LangChain.
Level: Intermediate
Expected answer
Candidate should outline:
  • Load documents and split into chunks.
  • Create embeddings and store in a vector store (FAISS/Chroma).
  • Define a retriever from the vector store.
  • Use a RetrievalQA or similar chain with an LLM.
Exact syntax isn't required, but the flow should be correct.
Follow‑up questions
  • How would you swap the vector store or embedding model?
  • How do you log retrieved documents?
  • How would you add metadata filters?
Evaluation rubric
Strong
Knows LangChain primitives (loaders, splitters, embeddings, vector store, retriever, chain).
OK
Understands conceptually but not specific components or flow.
Weak
Cannot describe a working pipeline even at a high level.
Not sure if you're ready to interview yet?
Read the complete guide on how to become a GenAI Engineer — skills, salary expectations, and the fastest path to your first offer.
How to Become a GenAI Engineer →
← ML Interview Prep Interview Simulator →