Role Roadmap * 8 Stages * 4-6 Months to Job-Ready

Your path to becoming a GenAI Engineer

From Python basics to deploying production LLM systems -- covering RAG, agents, fine-tuning, and system design. Built around what top GenAI teams actually expect on day one.

Stages

~26h

Total Content

4-9

Months to Job-ready

Free

To Start

Your Progress

0 of 8 stages complete

🤖

GenAI Engineer

Design, build, and ship production LLM systems -- from prompt to deployment

$140k

Avg US Salary

Explosive

Job Demand

4-9mo

Time to Job-ready

Python

Primary Language

Skills You'll Build

✓ Python ✓ OpenAI / Anthropic API ✓ LangChain / LlamaIndex ✓ Vector DBs ✓ RAG Pipelines + HuggingFace + FastAPI + Docker + AWS Bedrock + Weights & Biases LoRA / QLoRA vLLM Kubernetes Guardrails AI

Essential Strongly Recommended Nice to Have

Salary Range (US)

$95-120k

Junior

0-2 years

$125-155k

Mid-level

2-4 years

$160-220k+

Senior

4+ years

Which Roles Does This Roadmap Prepare You For?

See all 15 AI roles ->

Directly prepares you Strong overlap -- skills transfer Not covered here

🤖

Generative AI Engineer

✓ Primary target role

⚡

AI Engineer

✓ Strong fit — ~80% overlap

🧩

Prompt Engineer

↗ Stages 2–4 directly apply

🤖

AI Agent Engineer

↗ See Agentic AI Roadmap

🛡️

AI Safety Engineer

↗ Stage 7 (Eval & Safety) applies

🔬

ML / DL Engineer

→ See ML Engineer Roadmap

📦

MLOps Engineer

→ Not covered here

🔭

AI Research Scientist

→ Needs separate PhD-track path

8 Stages * ~26h Total

Foundations

Free Python · APIs · Cloud ~3h * 7 lessons

›

Solid foundations before you touch any LLM. You'll cover modern Python patterns, REST API design, async programming, and just enough cloud to start shipping. Skip if you're already comfortable with these.

Python 3.10+ features (dataclasses, typing, walrus)

REST API consumption with httpx / requests

Async Python -- asyncio, aiohttp

JSON, environment variables & secrets management

Cloud basics -- S3, Lambda, IAM roles

Docker fundamentals -- images, containers, Compose

Git workflows for AI projects

Learning Resources

📘Python Docs -- What's new in 3.10-3.12Docs ↗ 🎥ArjanCodes -- Python Best Practices 2024Video ↗ 💻AWS Free Tier -- Build a Lambda + S3 appLab ↗ 📘Docker Getting Started GuideDocs ↗

🛠️

Mini Project: Async API Aggregator

Build an async Python script that fetches data from 3 public APIs concurrently, transforms the JSON, and writes results to S3. Containerise it with Docker. Estimated: 4-6 hours.

LLM Fundamentals

Free Tokenisation · Embeddings · Inference ~4h * 9 lessons

›

Understand what LLMs actually are -- not just how to call them. You'll learn how tokenisation works, what embeddings represent geometrically, how autoregressive inference happens, and the basics of fine-tuning. This mental model separates strong from average engineers.

Transformer architecture -- attention, keys, queries, values

Tokenisation -- BPE, SentencePiece, token counting

Embeddings -- semantic space, cosine similarity

Autoregressive inference & sampling strategies (temp, top-p, top-k)

Context window mechanics & KV cache

Pre-training vs. instruction tuning vs. RLHF

LoRA & QLoRA fine-tuning intuition

Model families -- GPT-4o, Claude, Gemini, LLaMA, Mistral

Calling APIs -- OpenAI, Anthropic, HuggingFace Inference

Learning Resources

📘Attention Is All You Need -- Vaswani et al. (paper)Paper ↗ 🎥Andrej Karpathy -- Let's Build GPT from ScratchVideo ↗ 💻OpenAI Cookbook -- Quickstart examplesCode ↗ 📘HuggingFace NLP Course -- Chapters 1-3Course ↗

🛠️

Project: Token Counter & Embedding Explorer

Build a CLI tool that tokenises any text, shows token IDs, counts cost, and visualises embedding similarity between sentence pairs using OpenAI's text-embedding-3-small. Plot a UMAP cluster of 50 sentences. Estimated: 5-8 hours.

Prompt Engineering

Free Few-shot · CoT · Structured Output ~3h * 8 lessons

›

Prompt engineering is not just "write clear instructions." It's a systematic discipline with measurable outputs. Learn the techniques used by production teams at Anthropic, OpenAI, and Google -- and how to test them rigorously.

System vs. user vs. assistant roles

Zero-shot, one-shot, few-shot prompting

Chain-of-thought (CoT) & step-back prompting

Structured output -- JSON mode, function calling

Prompt chaining & decomposition

Meta-prompting & self-critique loops

Prompt injection attacks & defence

Prompt versioning with LangSmith / PromptLayer

Learning Resources

📘Anthropic Prompt Engineering GuideDocs ↗ 📘OpenAI Prompt Engineering Best PracticesDocs ↗ 🎥DeepLearning.AI -- ChatGPT Prompt Engineering for DevsCourse ↗ 💻PromptBench -- Benchmark your promptsTool ↗

🛠️

Project: Structured Data Extractor

Build an extraction pipeline that takes unstructured job descriptions and outputs clean JSON (role, skills, salary, location) using function calling / JSON mode. Add a test harness that scores accuracy against 50 golden examples. Estimated: 6-10 hours.

RAG Systems

Free Chunking · Retrieval · Vector DBs · Eval ~5h * 11 lessons

›

Retrieval-Augmented Generation is the most deployed GenAI pattern in production. Build RAG from scratch, understand every failure mode, and learn to evaluate pipelines rigorously. This stage alone can get you hired.

RAG architecture -- naive, advanced, modular

Document loaders -- PDF, HTML, Notion, Confluence

Chunking strategies -- fixed, recursive, semantic, RAPTOR

Embedding models -- choice, dimensions, speed

Vector databases -- Pinecone, Weaviate, Chroma, pgvector

Retrieval -- dense, sparse (BM25), hybrid

Re-ranking with cross-encoders (Cohere, FlashRank)

Query rewriting, HyDE, step-back

RAG evaluation -- RAGAS, faithfulness, relevance, answer correctness

Common failures -- hallucination, retrieval drift, context bleed

Metadata filtering & multi-index routing

Learning Resources

📘LlamaIndex Docs -- Building a RAG AppDocs ↗ 🎥DeepLearning.AI -- Building & Evaluating Advanced RAGCourse ↗ 💻RAGAS -- RAG evaluation framework (GitHub)Tool ↗ 📘Pinecone Learning Center -- Vector Search 101Guide ↗

🛠️

Project: PDF Q&A with RAG Evaluation

Build a production-quality PDF chatbot using LangChain + Chroma. Implement chunking experiments (fixed vs. semantic), hybrid retrieval, and re-ranking. Score your pipeline with RAGAS across 3 chunking strategies. Deploy as a FastAPI endpoint. Estimated: 10-15 hours.

Agents

Pro Tools · Planning · Memory · Workflows ~4h * 9 lessons

›

Agents are LLMs that can take actions in the world. Learn to design reliable agentic systems -- from simple tool use to multi-agent workflows with memory. The hardest part isn't making them work; it's making them work reliably.

ReAct pattern -- Reasoning + Acting loops

Tool / function calling -- design & schema

Agent frameworks -- LangGraph, AutoGen, CrewAI

Planning strategies -- linear, DAG, tree-of-thought

Memory systems -- episodic, semantic, procedural

Long-term memory with vector stores & graph DBs

Multi-agent orchestration & delegation

Human-in-the-loop checkpointing

Debugging agent failures -- tracing, replay

Learning Resources

📘LangGraph Docs -- Building Stateful AgentsDocs ↗ 🎥DeepLearning.AI -- AI Agents in LangGraphCourse ↗ 📘Anthropic -- Building Effective Agents (blog post)Guide ↗ 💻LangSmith -- Tracing & debugging agentsTool ↗

🛠️

Project: Research Agent with Memory

Build a research agent using LangGraph that searches the web, reads articles, deduplicates findings, and writes a structured report. Add episode memory so it recalls previous research sessions. Implement human-in-the-loop for approval before writing output. Estimated: 12-16 hours.

GenAI System Design

Pro Architecture · Latency · Cost · Scale ~3h * 7 lessons

›

Designing GenAI systems at scale requires different thinking from traditional software. Learn how to trade off latency, cost, and quality; architect for LLM fallbacks; and design systems that stay reliable when the model surprises you.

GenAI architecture patterns -- gateway, router, fallback

Latency optimisation -- streaming, caching, batching

Prompt caching & semantic caching (GPTCache)

Model routing -- cost vs. quality trade-off

LLM observability -- tokens, latency, cost dashboards

Guardrail layers -- input sanitisation, output validation

Multi-tenant LLM architecture & rate limit management

Learning Resources

📘Chip Huyen -- Designing ML Systems (chapters 7-9)Book ↗ 📘LiteLLM Docs -- Unified LLM proxy & routingDocs ↗ 🎥ByteByteGo -- LLM System Design for InterviewsVideo ↗

🛠️

Design Challenge: Multi-tenant LLM Gateway

Design (and partially implement) a multi-tenant LLM gateway that routes requests between GPT-4o and Claude based on cost budget, caches identical prompts semantically, and emits per-tenant cost dashboards. Write an architectural decision record (ADR). Estimated: 8-12 hours.

Deployment

Pro Serverless · GPUs · Inference APIs ~3h * 8 lessons

›

Shipping GenAI to production has unique constraints: model size, GPU availability, cold starts, and inference cost. Learn to deploy across managed APIs, serverless containers, and self-hosted GPU infrastructure.

Managed inference APIs -- OpenAI, Bedrock, Vertex AI

FastAPI + streaming responses (SSE)

Serverless containers -- AWS Lambda, Cloud Run

GPU inference -- Modal, RunPod, Replicate

Self-hosted models -- vLLM, Ollama, llama.cpp

CI/CD for GenAI -- GitHub Actions + model registry

Scaling with Kubernetes & horizontal pod autoscaling

Cost monitoring & budget alerts

Learning Resources

📘Modal Docs -- Serverless GPU inferenceDocs ↗ 💻vLLM GitHub -- High-throughput LLM servingTool ↗ 🎥Full Stack Deep Learning -- LLM Deployment LectureVideo ↗ 📘AWS Bedrock Getting Started GuideDocs ↗

🛠️

Project: Production LLM API with CI/CD

Deploy a FastAPI app that streams responses from Claude/GPT-4o, with a fallback to a self-hosted Mistral on Modal. Add GitHub Actions CI that runs integration tests and deploys on merge. Measure p50/p99 latency and track cost per request. Estimated: 10-14 hours.

Evaluation & Safety

Free Evals · Red-teaming · Guardrails ~3h * 8 lessons

›

You can't improve what you can't measure. Evaluation and safety are production requirements, not afterthoughts. Learn to build robust eval suites, run red-teaming exercises, and implement output guardrails that don't destroy UX.

LLM evaluation frameworks -- Evals (OpenAI), HELM, DeepEval

Reference-based vs. LLM-as-judge evaluation

Hallucination detection -- faithfulness, groundedness

Bias & toxicity measurement

Red-teaming techniques -- jailbreaks, prompt injection

Input guardrails -- intent classification, PII detection

Output guardrails -- Guardrails AI, NeMo Guardrails

Responsible AI frameworks -- EU AI Act, NIST RMF basics

Learning Resources

📘DeepEval Docs -- Open-source LLM eval frameworkDocs ↗ 🎥DeepLearning.AI -- Quality and Safety for LLM AppsCourse ↗ 📘NIST AI Risk Management FrameworkGuide ↗ 💻Guardrails AI -- Rail specs & validatorsTool ↗

🛠️

Capstone: GenAI Eval & Safety Suite

Build an automated evaluation suite for your RAG chatbot from Stage 4. Implement LLM-as-judge scoring, a red-team test set with 30 adversarial prompts, PII detection guardrail, and a Streamlit dashboard showing pass/fail trends over time. Estimated: 12-18 hours.

🚀

Ready to Start?

Generate a personalised GenAI roadmap based on your current skills and target role. Takes 2 minutes.

Career Planning

Ready to build your personalized AI career plan?

Start Skill Gap Analysis →

Welcome to CareerStack

Your path to becoming a GenAI Engineer