AI Interview Prep | AI CareerStack

LLM Fundamentals

Core concepts, embeddings, context, limitations * beginner -> intermediate

12+ questions

Q1 * What is a large language model and how is it trained?

Level: Beginner

Expected answer

An LLM is a transformer‑based neural network trained on massive text corpora to predict the next token in a sequence. Key points:

Uses self‑attention to model long‑range dependencies.
Training objective is next‑token prediction (or masked token prediction for some models).
Pre‑training is followed by alignment steps like instruction tuning and RLHF.

Follow‑up questions

How does the transformer architecture differ from RNNs/LSTMs?
What is RLHF and why is it used?
What are some trade‑offs between model size and latency?

Evaluation rubric

Strong

Mentions transformers, next‑token prediction, large‑scale data, and alignment steps.

OK

Describes "AI that predicts text" but misses architecture or training details.

Weak

Vague answer like "a chatbot" with no mention of training or modeling.

Q2 * What are tokens and how do they affect cost and context?

Level: Beginner

Expected answer

Tokens are the units of text (sub‑words, words, or characters) that the model processes. They matter because:

Context length is measured in tokens, not characters.
API pricing is usually per 1K tokens.
Tokenization can split words in non‑obvious ways, affecting prompt length.

Follow‑up questions

What happens when a prompt exceeds the context window?
How would you estimate token usage for a given feature?
Why do different models have different tokenizers?

Evaluation rubric

Strong

Connects tokens to context, pricing, and tokenizer behavior with concrete examples.

OK

Defines tokens but doesn't connect to cost or context limits.

Weak

Confuses tokens with words or characters.

Q3 * How do embeddings differ from generative LLM calls?

Level: Intermediate

Expected answer

Embeddings map text to dense vectors capturing semantic similarity, while generative calls produce new text by predicting tokens. Differences:

Embeddings: representation; used for search, clustering, retrieval.
Generation: autoregressive decoding; used for answering, summarization, etc.
Often separate models/endpoints optimized for each task.

Follow‑up questions

How are embeddings used in RAG?
What is cosine similarity and why is it common?
When would you choose a smaller embedding model?

Evaluation rubric

Strong

Clearly separates representation vs generation and mentions retrieval/search use cases.

OK

Understands embeddings as "vectors" but not concrete applications.

Weak

Treats embeddings as generic model outputs with no semantic meaning.

Q4 * What is a context window and how does it impact design?

Level: Intermediate

Expected answer

The context window is the maximum number of tokens the model can consider at once (prompt + output). It impacts:

How much history or retrieved context you can include.
Prompt engineering strategies (summarization, chunking, sliding windows).
Cost and latency for long prompts.

Follow‑up questions

How would you handle very long documents?
What are trade‑offs of using a 200K‑token model?
How does context window relate to RAG design?

Evaluation rubric

Strong

Connects context window to architecture decisions (RAG, summarization, cost).

OK

Defines context window but not its practical implications.

Weak

No clear understanding of limits or design impact.

Q5 * What are common limitations of LLMs?

Level: Intermediate

Expected answer

Limitations include:

Hallucinations (confident but incorrect answers).
Lack of up‑to‑date knowledge (frozen training data).
Sensitivity to prompt phrasing.
Biases from training data.
Non‑determinism and reproducibility challenges.

Follow‑up questions

How do you mitigate hallucinations in production?
When is fine‑tuning appropriate vs RAG?
How do you handle safety and bias concerns?

Evaluation rubric

Strong

Lists multiple limitations and ties them to mitigation strategies.

OK

Mentions hallucinations but not other important limitations.

Weak

Claims LLMs are "almost perfect" or ignores limitations.

Prompt Engineering

Patterns, few‑shot, structured outputs, debugging * beginner -> intermediate

10+ questions

Q1 * What makes a prompt production‑ready?

Level: Beginner

Expected answer

A production‑ready prompt is:

Clear about role, task, and constraints.
Explicit about output format and style.
Robust to minor input variations.
Tested against edge cases and evaluated with metrics.

Follow‑up questions

How do you version prompts?
How would you A/B test prompts?
How do you handle localization or multi‑language prompts?

Evaluation rubric

Strong

Mentions clarity, constraints, format, and testing/metrics.

OK

Talks about "clear instructions" but not evaluation or robustness.

Weak

Only says "ask nicely" or similar vague advice.

Q2 * Compare zero‑shot, few‑shot, and chain‑of‑thought prompting.

Level: Intermediate

Expected answer

Zero‑shot: only instructions; good for simple tasks.
Few‑shot: add examples; good for style, format, or nuanced tasks.
Chain‑of‑thought: encourage step‑by‑step reasoning; good for reasoning and math.

Should mention trade‑offs in context usage and latency.

Follow‑up questions

When would you avoid chain‑of‑thought?
How do you choose examples for few‑shot prompts?
How does this relate to evaluation?

Evaluation rubric

Strong

Clearly distinguishes all three and mentions trade‑offs and use cases.

OK

Knows definitions but not when to use each pattern.

Weak

Confuses few‑shot with training or fine‑tuning.

Q3 * How do you design prompts for structured JSON output?

Level: Intermediate

Expected answer

Strategies:

Specify exact JSON schema and field types.
Provide one or more valid examples.
Instruct the model to output only JSON, no extra text.
Use validators and repair logic for malformed JSON.

Follow‑up questions

When would you use function/tool calling instead?
How do you handle optional fields?
How do you test robustness of structured prompts?

Evaluation rubric

Strong

Mentions schema, examples, strict instructions, and validation/repair strategies.

OK

Says "ask for JSON" but not how to enforce or validate it.

Weak

No awareness of structured output challenges.

Q4 * How do you debug a prompt that behaves inconsistently?

Level: Intermediate

Expected answer

Steps:

Collect failing examples and categorize failure modes.
Simplify the prompt to isolate the cause.
Add clarifications, constraints, or examples.
Test across a representative evaluation set.

Follow‑up questions

How do you know when to stop tweaking prompts and change the model instead?
How would you log prompt failures in production?
How do you avoid overfitting prompts to a small eval set?

Evaluation rubric

Strong

Treats prompt debugging like normal software debugging with data and evaluation sets.

OK

Suggests "try different wording" without a systematic approach.

Weak

No clear debugging strategy.

RAG System Design

Pipelines, chunking, retrieval, evaluation, access control * intermediate -> advanced

10+ questions

Q1 * Describe the architecture of a RAG system end‑to‑end.

Level: Intermediate

Expected answer

A typical RAG pipeline:

Ingestion: load docs, chunk, embed, store in vector DB with metadata.
Retrieval: embed query, similarity search, optional filters/reranking.
Augmentation: build prompt with user query + retrieved context.
Generation: LLM answers using augmented prompt.
Evaluation/monitoring: track quality, latency, and failures.

Follow‑up questions

How do you choose chunk size and overlap?
What are common failure modes of RAG?
How would you evaluate RAG quality?

Evaluation rubric

Strong

Covers ingestion, retrieval, augmentation, generation, and evaluation with concrete details.

OK

Describes retrieval + generation but misses ingestion or evaluation.

Weak

Only says "search + LLM" with no structure.

Q2 * How do you choose chunk size and overlap for documents?

Level: Intermediate

Expected answer

Consider:

Semantic coherence (don't split mid‑sentence or mid‑concept).
Model context window and cost.
Task type (FAQ vs long‑form reasoning).
Typical ranges: 200-800 tokens with 10-20% overlap.

Follow‑up questions

How would you empirically tune chunk size?
What happens if chunks are too small or too large?
How does chunking interact with reranking?

Evaluation rubric

Strong

Balances semantic coherence, context limits, and cost; suggests empirical tuning.

OK

Suggests a fixed size without reasoning or tuning strategy.

Weak

No understanding of why chunking matters.

Q3 * How do you handle hallucinations in a RAG system?

Level: Intermediate

Expected answer

Strategies:

Improve retrieval quality (better embeddings, chunking, filters, reranking).
Constrain the model to answer only from provided context.
Ask the model to cite sources or say "I don't know".
Use secondary verification for critical domains.

Follow‑up questions

How would you detect hallucinations automatically?
What metrics would you track in production?
When is hallucination acceptable vs unacceptable?

Evaluation rubric

Strong

Mentions retrieval quality, prompt constraints, and explicit "I don't know" behavior.

OK

Talks about "improving the model" but not retrieval or constraints.

Weak

No concrete mitigation strategies.

Agents

Tool use, planning, loops, safety * intermediate -> advanced

8+ questions

Q1 * What is an LLM agent and how is it different from a plain LLM call?

Level: Intermediate

Expected answer

An agent is a system where an LLM:

Plans steps toward a goal.
Selects and calls tools (APIs, DBs, services).
Iterates based on intermediate results and state.

It differs from a plain call by adding tool use, control flow, and memory around the model.

Follow‑up questions

What are risks of unconstrained agents?
How would you debug an agent that loops?
When would you avoid using agents?

Evaluation rubric

Strong

Mentions planning, tools, iteration, and control flow vs single‑shot prompts.

OK

Describes agents as "smart prompts" without tool use or planning details.

Weak

No distinction from basic chatbots.

GenAI System Design

APIs, scaling, evaluation, observability * intermediate -> advanced

8+ questions

Q1 * Design a scalable GenAI API for a resume‑review product.

Level: Intermediate

Expected answer

Components:

API gateway + auth.
Storage for resumes and metadata.
Queue/worker layer for LLM calls.
LLM provider(s) with fallback and retries.
Logging, metrics, prompt/version management.

Should mention latency vs cost trade‑offs and caching.

Follow‑up questions

How would you handle rate limits?
What data would you log?
How do you roll out new prompt versions safely?

Evaluation rubric

Strong

Describes end‑to‑end architecture, scaling, and observability considerations.

OK

Mentions API + LLM but not queues, logging, or versioning.

Weak

Only says "call the LLM from the backend".

Coding Tasks (Python * LangChain * LlamaIndex)

Hands‑on RAG, tools, and pipelines * intermediate -> advanced

6+ tasks

Q1 * Implement a simple RAG pipeline in Python using LangChain.

Level: Intermediate

Expected answer

Candidate should outline:

Load documents and split into chunks.
Create embeddings and store in a vector store (FAISS/Chroma).
Define a retriever from the vector store.
Use a RetrievalQA or similar chain with an LLM.

Exact syntax isn't required, but the flow should be correct.

Follow‑up questions

How would you swap the vector store or embedding model?
How do you log retrieved documents?
How would you add metadata filters?

Evaluation rubric

Strong

Knows LangChain primitives (loaders, splitters, embeddings, vector store, retriever, chain).

OK

Understands conceptually but not specific components or flow.

Weak

Cannot describe a working pipeline even at a high level.

Welcome to CareerStack

Own Your AI Interview