Interview Prep * AI * GenAI * LLMs * Agents
Own Your AI Interview
From LLM fundamentals and RAG architectures to agent design and AI system thinking -- the questions that separate candidates who talk AI from engineers who build it.
Beginner → Advanced
LLMs
RAG
Agents
System Design
Python · LangChain · LlamaIndex
LLM Fundamentals
12+ questions
LLM Fundamentals
Q1 * What is a large language model and how is it trained?
Level: Beginner
Expected answer
An LLM is a transformer‑based neural network trained on massive text corpora to predict the next
token in a sequence. Key points:
- Uses self‑attention to model long‑range dependencies.
- Training objective is next‑token prediction (or masked token prediction for some models).
- Pre‑training is followed by alignment steps like instruction tuning and RLHF.
Follow‑up questions
- How does the transformer architecture differ from RNNs/LSTMs?
- What is RLHF and why is it used?
- What are some trade‑offs between model size and latency?
Evaluation rubric
Strong
Mentions transformers, next‑token prediction, large‑scale data, and alignment steps.
OK
Describes "AI that predicts text" but misses architecture or training details.
Weak
Vague answer like "a chatbot" with no mention of training or modeling.
Q2 * What are tokens and how do they affect cost and context?
Level: Beginner
Expected answer
Tokens are the units of text (sub‑words, words, or characters) that the model processes. They matter
because:
- Context length is measured in tokens, not characters.
- API pricing is usually per 1K tokens.
- Tokenization can split words in non‑obvious ways, affecting prompt length.
Follow‑up questions
- What happens when a prompt exceeds the context window?
- How would you estimate token usage for a given feature?
- Why do different models have different tokenizers?
Evaluation rubric
Strong
Connects tokens to context, pricing, and tokenizer behavior with concrete examples.
OK
Defines tokens but doesn't connect to cost or context limits.
Weak
Confuses tokens with words or characters.
Q3 * How do embeddings differ from generative LLM calls?
Level: Intermediate
Expected answer
Embeddings map text to dense vectors capturing semantic similarity, while generative calls produce
new text by predicting tokens. Differences:
- Embeddings: representation; used for search, clustering, retrieval.
- Generation: autoregressive decoding; used for answering, summarization, etc.
- Often separate models/endpoints optimized for each task.
Follow‑up questions
- How are embeddings used in RAG?
- What is cosine similarity and why is it common?
- When would you choose a smaller embedding model?
Evaluation rubric
Strong
Clearly separates representation vs generation and mentions retrieval/search use cases.
OK
Understands embeddings as "vectors" but not concrete applications.
Weak
Treats embeddings as generic model outputs with no semantic meaning.
Q4 * What is a context window and how does it impact design?
Level: Intermediate
Expected answer
The context window is the maximum number of tokens the model can consider at once (prompt + output).
It impacts:
- How much history or retrieved context you can include.
- Prompt engineering strategies (summarization, chunking, sliding windows).
- Cost and latency for long prompts.
Follow‑up questions
- How would you handle very long documents?
- What are trade‑offs of using a 200K‑token model?
- How does context window relate to RAG design?
Evaluation rubric
Strong
Connects context window to architecture decisions (RAG, summarization, cost).
OK
Defines context window but not its practical implications.
Weak
No clear understanding of limits or design impact.
Q5 * What are common limitations of LLMs?
Level: Intermediate
Expected answer
Limitations include:
- Hallucinations (confident but incorrect answers).
- Lack of up‑to‑date knowledge (frozen training data).
- Sensitivity to prompt phrasing.
- Biases from training data.
- Non‑determinism and reproducibility challenges.
Follow‑up questions
- How do you mitigate hallucinations in production?
- When is fine‑tuning appropriate vs RAG?
- How do you handle safety and bias concerns?
Evaluation rubric
Strong
Lists multiple limitations and ties them to mitigation strategies.
OK
Mentions hallucinations but not other important limitations.
Weak
Claims LLMs are "almost perfect" or ignores limitations.
Prompt Engineering
10+ questions
Prompt Engineering
Q1 * What makes a prompt production‑ready?
Level: Beginner
Expected answer
A production‑ready prompt is:
- Clear about role, task, and constraints.
- Explicit about output format and style.
- Robust to minor input variations.
- Tested against edge cases and evaluated with metrics.
Follow‑up questions
- How do you version prompts?
- How would you A/B test prompts?
- How do you handle localization or multi‑language prompts?
Evaluation rubric
Strong
Mentions clarity, constraints, format, and testing/metrics.
OK
Talks about "clear instructions" but not evaluation or robustness.
Weak
Only says "ask nicely" or similar vague advice.
Q2 * Compare zero‑shot, few‑shot, and chain‑of‑thought prompting.
Level: Intermediate
Expected answer
- Zero‑shot: only instructions; good for simple tasks.
- Few‑shot: add examples; good for style, format, or nuanced tasks.
- Chain‑of‑thought: encourage step‑by‑step reasoning; good for reasoning and math.
Follow‑up questions
- When would you avoid chain‑of‑thought?
- How do you choose examples for few‑shot prompts?
- How does this relate to evaluation?
Evaluation rubric
Strong
Clearly distinguishes all three and mentions trade‑offs and use cases.
OK
Knows definitions but not when to use each pattern.
Weak
Confuses few‑shot with training or fine‑tuning.
Q3 * How do you design prompts for structured JSON output?
Level: Intermediate
Expected answer
Strategies:
- Specify exact JSON schema and field types.
- Provide one or more valid examples.
- Instruct the model to output only JSON, no extra text.
- Use validators and repair logic for malformed JSON.
Follow‑up questions
- When would you use function/tool calling instead?
- How do you handle optional fields?
- How do you test robustness of structured prompts?
Evaluation rubric
Strong
Mentions schema, examples, strict instructions, and validation/repair strategies.
OK
Says "ask for JSON" but not how to enforce or validate it.
Weak
No awareness of structured output challenges.
Q4 * How do you debug a prompt that behaves inconsistently?
Level: Intermediate
Expected answer
Steps:
- Collect failing examples and categorize failure modes.
- Simplify the prompt to isolate the cause.
- Add clarifications, constraints, or examples.
- Test across a representative evaluation set.
Follow‑up questions
- How do you know when to stop tweaking prompts and change the model instead?
- How would you log prompt failures in production?
- How do you avoid overfitting prompts to a small eval set?
Evaluation rubric
Strong
Treats prompt debugging like normal software debugging with data and evaluation sets.
OK
Suggests "try different wording" without a systematic approach.
Weak
No clear debugging strategy.
RAG System Design
10+ questions
RAG System Design
Q1 * Describe the architecture of a RAG system end‑to‑end.
Level: Intermediate
Expected answer
A typical RAG pipeline:
- Ingestion: load docs, chunk, embed, store in vector DB with metadata.
- Retrieval: embed query, similarity search, optional filters/reranking.
- Augmentation: build prompt with user query + retrieved context.
- Generation: LLM answers using augmented prompt.
- Evaluation/monitoring: track quality, latency, and failures.
Follow‑up questions
- How do you choose chunk size and overlap?
- What are common failure modes of RAG?
- How would you evaluate RAG quality?
Evaluation rubric
Strong
Covers ingestion, retrieval, augmentation, generation, and evaluation with concrete details.
OK
Describes retrieval + generation but misses ingestion or evaluation.
Weak
Only says "search + LLM" with no structure.
Q2 * How do you choose chunk size and overlap for documents?
Level: Intermediate
Expected answer
Consider:
- Semantic coherence (don't split mid‑sentence or mid‑concept).
- Model context window and cost.
- Task type (FAQ vs long‑form reasoning).
- Typical ranges: 200-800 tokens with 10-20% overlap.
Follow‑up questions
- How would you empirically tune chunk size?
- What happens if chunks are too small or too large?
- How does chunking interact with reranking?
Evaluation rubric
Strong
Balances semantic coherence, context limits, and cost; suggests empirical tuning.
OK
Suggests a fixed size without reasoning or tuning strategy.
Weak
No understanding of why chunking matters.
Q3 * How do you handle hallucinations in a RAG system?
Level: Intermediate
Expected answer
Strategies:
- Improve retrieval quality (better embeddings, chunking, filters, reranking).
- Constrain the model to answer only from provided context.
- Ask the model to cite sources or say "I don't know".
- Use secondary verification for critical domains.
Follow‑up questions
- How would you detect hallucinations automatically?
- What metrics would you track in production?
- When is hallucination acceptable vs unacceptable?
Evaluation rubric
Strong
Mentions retrieval quality, prompt constraints, and explicit "I don't know" behavior.
OK
Talks about "improving the model" but not retrieval or constraints.
Weak
No concrete mitigation strategies.
Agents
8+ questions
Agents
Q1 * What is an LLM agent and how is it different from a plain LLM call?
Level: Intermediate
Expected answer
An agent is a system where an LLM:
- Plans steps toward a goal.
- Selects and calls tools (APIs, DBs, services).
- Iterates based on intermediate results and state.
Follow‑up questions
- What are risks of unconstrained agents?
- How would you debug an agent that loops?
- When would you avoid using agents?
Evaluation rubric
Strong
Mentions planning, tools, iteration, and control flow vs single‑shot prompts.
OK
Describes agents as "smart prompts" without tool use or planning details.
Weak
No distinction from basic chatbots.
GenAI System Design
8+ questions
GenAI System Design
Q1 * Design a scalable GenAI API for a resume‑review product.
Level: Intermediate
Expected answer
Components:
- API gateway + auth.
- Storage for resumes and metadata.
- Queue/worker layer for LLM calls.
- LLM provider(s) with fallback and retries.
- Logging, metrics, prompt/version management.
Follow‑up questions
- How would you handle rate limits?
- What data would you log?
- How do you roll out new prompt versions safely?
Evaluation rubric
Strong
Describes end‑to‑end architecture, scaling, and observability considerations.
OK
Mentions API + LLM but not queues, logging, or versioning.
Weak
Only says "call the LLM from the backend".
Coding Tasks (Python * LangChain * LlamaIndex)
6+ tasks
Coding Tasks (Python * LangChain * LlamaIndex)
Q1 * Implement a simple RAG pipeline in Python using LangChain.
Level: Intermediate
Expected answer
Candidate should outline:
- Load documents and split into chunks.
- Create embeddings and store in a vector store (FAISS/Chroma).
- Define a retriever from the vector store.
- Use a RetrievalQA or similar chain with an LLM.
Follow‑up questions
- How would you swap the vector store or embedding model?
- How do you log retrieved documents?
- How would you add metadata filters?
Evaluation rubric
Strong
Knows LangChain primitives (loaders, splitters, embeddings, vector store, retriever, chain).
OK
Understands conceptually but not specific components or flow.
Weak
Cannot describe a working pipeline even at a high level.