● Coming Soon Live Training Batch Register Interest →
📅 1:1 Session Book a Session + Resume Review
₹2,999/$29 FREE 🎁 Opening Offer
Book Session →
Generative AI  ·  Beginner to Advanced

Generative AI — Complete Guide

From what GenAI is and how LLMs work, through RAG systems, fine-tuning, agents, and production deployment — everything you need to build real GenAI applications.

8 sections · ~2–4 hrs read · Beginner → Advanced
Beginner to Advanced LLMs Tokens & Embeddings Prompt Engineering RAG Systems Fine-Tuning AI Agents Production Deployment
1 What is Generative AI?
🤖 Generative AI = Models that create new content — text, images, code, audio, video — rather than just classify or predict.

Traditional AI classified inputs into categories or predicted a value. Generative AI produces new outputs that didn’t exist before: write an essay, generate a photo, complete code, synthesise a voice. The outputs look and feel human-created because these models have learned the statistical patterns of human-generated content at internet scale.

Traditional AI vs. Generative AI
Traditional AI / ML
  • Classifies or predicts from a fixed set of outputs
  • “Is this email spam?” → Yes / No
  • “What is the house price?” → $450,000
  • Deterministic, bounded output space
  • Easy to evaluate (accuracy, RMSE)
Generative AI
  • Generates novel outputs from open-ended inputs
  • “Explain this concept simply” → full paragraph
  • “Draw a sunset over mountains” → new image
  • Probabilistic, unbounded output space
  • Harder to evaluate (human feedback, LLM judges)
The four modalities
💬
Text
ChatGPT, Claude, Gemini — reasoning, Q&A, summarisation, translation, code.
🎨
Image
Midjourney, DALL·E 3, Stable Diffusion — photo-realistic & artistic generation.
💻
Code
GitHub Copilot, Cursor, Claude — autocomplete, refactoring, test generation.
🎧
Audio / Video
ElevenLabs (voice), Sora (video), Udio (music) — the fastest-moving frontier.
2 LLMs, Tokens & Embeddings

Large Language Models (LLMs) are neural networks trained on massive text datasets to predict the next token. That simple objective — given this text, what comes next? — gives rise to emergent abilities: reasoning, translation, coding, and more.

How LLMs are trained
LLM training objective (simplified)
# Pre-training: predict next token on internet-scale text
# "The cat sat on the ___" -> model predicts "mat"

# Training data: CommonCrawl, books, GitHub, Wikipedia
# Scale: GPT-4 ~1 trillion tokens, Llama 3 ~15 trillion

# After pre-training: instruction tuning + RLHF
# Makes model helpful, harmless, and honest
What are tokens?

LLMs don’t read characters or words — they read tokens. A token is roughly 3-4 characters or 0.75 words in English. Tokenisation splits text into these sub-word units using algorithms like Byte-Pair Encoding (BPE).

Tokenisation example (tiktoken / OpenAI)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")

text = "Generative AI is transforming software."
tokens = enc.encode(text)
print(tokens)        # [86230, 15592, 15592, ...]
print(len(tokens))   # 8 tokens  (~6 words)

# Cost implication: GPT-4 charges per token
# 1M tokens ~ $30 input / $60 output (approx)
What are embeddings?

Embeddings are dense vector representations of tokens, words, or entire sentences in a high-dimensional space. Semantically similar concepts have nearby embeddings. This is how LLMs “understand” meaning and how RAG systems find relevant documents.

Creating embeddings with OpenAI
from openai import OpenAI
client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="How do neural networks learn?"
)
embedding = response.data[0].embedding
# Returns a list of 1536 floats
# Cosine similarity between two embeddings = semantic similarity
3 Transformers & Attention
⚡ The Transformer architecture (2017) is the foundation of every modern LLM. Understanding it gives you intuition for why these models behave the way they do.

Before Transformers, NLP relied on RNNs that processed text token-by-token, making them slow and bad at long-range dependencies. Self-attention lets every token attend to every other token simultaneously — in one parallel operation.

How self-attention works (intuition)
Self-attention — simplified
# Each token is projected into Q (query), K (key), V (value)
Q = X @ W_q   # "What am I looking for?"
K = X @ W_k   # "What do I contain?"
V = X @ W_v   # "What information do I contribute?"

# Attention scores: how much each token should attend to each other
scores  = Q @ K.T / sqrt(d_k)   # scaled dot-product
weights = softmax(scores)        # probabilities

# Attended representation = weighted sum of Values
output = weights @ V

# Multi-head: run H parallel heads, concatenate
# Each head learns different relationship types
Key model architectures
Encoder-only (BERT family)
  • Sees full input in both directions
  • Great for understanding tasks
  • Classification, NER, embeddings
  • Examples: BERT, RoBERTa, sentence-transformers
Decoder-only (GPT family)
  • Sees only tokens to the left (causal)
  • Great for generation tasks
  • Text generation, code, chat
  • Examples: GPT-4, Claude, Llama, Gemini
Using AI APIs
OpenAI Chat Completions API
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": "Explain RAG in 2 sentences."}
    ],
    temperature=0.7,   # 0 = deterministic, 2 = very random
    max_tokens=200
)
print(response.choices[0].message.content)
4 Build Your First AI App
⭐ Most popular section — get a working app in under 50 lines of Python.
Project 1: Chatbot with memory
Python — chatbot with conversation history
from openai import OpenAI

client   = OpenAI()
history  = [{"role": "system", "content": "You are a helpful AI assistant."}]

def chat(user_input: str) -> str:
    history.append({"role": "user", "content": user_input})
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=history
    )
    reply = response.choices[0].message.content
    history.append({"role": "assistant", "content": reply})
    return reply

# REPL
while True:
    user = input("You: ")
    if user.lower() in ("exit", "quit"): break
    print(f"AI: {chat(user)}")
Streaming responses

Streaming makes your app feel instant by printing tokens as they arrive, rather than waiting for the full response.

Python — streaming with Server-Sent Events
stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain neural networks."}],
    stream=True
)
for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)
Prompt engineering basics
Weak prompt
  • No role or context
  • Vague instruction
  • "Tell me about ML"
  • Output format undefined
  • Gets rambling, generic answers
Strong prompt
  • System role: "You are a senior ML engineer"
  • Specific task + audience + format
  • Constraints: "in 3 bullet points, beginner level"
  • Examples (few-shot) when needed
  • Gets precise, structured, actionable answers
5 RAG — Retrieval-Augmented Generation
🔥 RAG is the most in-demand GenAI engineering skill right now. It solves LLM hallucination by grounding answers in real documents.

LLMs have a training cutoff and no access to your private data. RAG fixes this: at query time, retrieve the most relevant documents from your knowledge base and inject them into the prompt. The model then reasons over real, up-to-date information.

The RAG pipeline
RAG architecture — full flow
# INDEXING (run once)
docs   = load_documents("./docs/")          # PDF, web, DB
chunks = chunk_text(docs, size=500)         # split into segments
embeds = embed(chunks)                      # vectors
store  = VectorDB.upsert(chunks, embeds)    # persist

# QUERYING (every request)
query       = "What is our refund policy?"
query_embed = embed(query)
results     = store.similarity_search(query_embed, top_k=5)
context     = "

".join(r.text for r in results)

prompt = f"""Answer based only on the context below.
Context: {context}
Question: {query}"""

answer = llm.complete(prompt)
Chunking strategies
📋
Fixed-size
Split every N characters or tokens. Fast but may cut sentences mid-thought. Good baseline.
Semantic
Split at sentence/paragraph boundaries. Preserves meaning. Better retrieval quality.
📄
Recursive
LangChain’s RecursiveCharacterTextSplitter. Tries paragraphs, then sentences, then words. Most robust.
📝
Document-aware
Respects structure (Markdown headers, HTML tags, PDF sections). Requires parsing.
Vector databases comparison
DatabaseBest ForNotes
ChromaLocal dev, prototypingRuns in-process (no server). Free, open-source.
PineconeProduction, scaleFully managed SaaS. Best performance at scale.
WeaviateHybrid searchCombines vector + keyword search. Self-hosted or cloud.
pgvectorExisting PostgreSQLAdds vector column to Postgres. No extra infra.
QdrantOpen-source productionHigh performance, Rust-based, good Rust/Python clients.
6 Advanced — Fine-Tuning & Agents
When to fine-tune vs. prompt engineer
Use Prompt Engineering When…
  • You need fast iteration (no training time)
  • The base model is capable enough
  • Task variety is high
  • Data is limited (<1,000 examples)
  • Cost of fine-tuning isn’t justified
Fine-Tune When…
  • Consistent style/format is critical
  • Domain-specific knowledge is needed
  • Prompt alone can’t solve it reliably
  • You have 1,000+ quality examples
  • Lower latency / cost per call matters
LoRA fine-tuning (parameter-efficient)
Python — LoRA fine-tuning with PEFT + HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    load_in_4bit=True   # QLoRA: 4-bit quantisation
)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,               # rank — smaller = fewer params to train
    lora_alpha=32,      # scaling factor
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 8,388,608 / 8,030,261,248 = 0.10%
Building AI agents
Python — LangChain ReAct agent with tools
from langchain.agents import create_react_agent, AgentExecutor
from langchain_openai import ChatOpenAI
from langchain.tools import DuckDuckGoSearchRun, WikipediaQueryRun

llm   = ChatOpenAI(model="gpt-4o", temperature=0)
tools = [DuckDuckGoSearchRun(), WikipediaQueryRun()]

agent   = create_react_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

result = executor.invoke({"input": "What AI breakthroughs happened in 2025?"})
print(result["output"])
Advanced prompting patterns
🧐
Chain-of-Thought
Add “Think step by step” — dramatically improves reasoning on math and logic problems.
📸
Few-Shot
Include 2-5 input/output examples in your prompt — teaches the model exactly what format you want.
🏅
ReAct
Reason + Act: the model interleaves thinking and tool use. Foundation of most production agents.
7 Production & Deployment
🏭 Getting to production requires evaluation, guardrails, cost management, and observability — not just a working prototype.
Evaluation framework
TechniqueHow It WorksWhen to Use
Human evalRaters score outputs on quality dimensionsGround truth for preference; expensive
LLM-as-judgeGPT-4 / Claude scores your model’s outputsScalable; good correlation with human eval
RAGASAutomated RAG pipeline evaluation (faithfulness, relevance, context recall)RAG systems specifically
Unit testsAssert specific outputs for known inputsRegression testing, critical paths
Cost optimisation tactics
  • Model selection — use GPT-4o-mini or Claude Haiku for simple tasks; reserve larger models for complex reasoning
  • Prompt compression — remove redundant instructions; use LLMLingua to compress prompts by 3-20×
  • Semantic caching — cache responses for semantically similar queries (GPTCache, Redis)
  • Batching — batch API calls where latency allows; 50% cheaper on OpenAI
  • Streaming — reduces perceived latency without reducing actual cost
FastAPI deployment
Python — Production LLM API with streaming
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI

app = FastAPI()
client = OpenAI()

@app.post("/chat")
async def chat(message: str):
    async def generate():
        stream = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": message}],
            stream=True
        )
        for chunk in stream:
            delta = chunk.choices[0].delta.content or ""
            yield f"data: {delta}

"
    
    return StreamingResponse(generate(), media_type="text/event-stream")
8 Hands-On Projects

Build real things. Each project below is scoped to take you from zero to working code. Start with Beginner and work up.

💬
🟢 Simple Chatbot
Beginner. A chatbot with conversation memory. OpenAI API. Under 50 lines of Python. Start here.
📄
🟡 PDF Q&A System
Intermediate. Upload any PDF, ask questions about it using RAG. Full end-to-end with LangChain + Chroma.
🟢 Streaming Chat UI
Beginner. Real-time streaming responses with a web interface. Feels like ChatGPT.
🤝
🔴 Research Agent
Advanced. An autonomous agent that searches the web, reads papers, and writes reports.
🔧
🔴 Fine-Tuned Model
Advanced. Fine-tune Llama 3 on your own dataset using LoRA. Full training pipeline with PEFT.
🏭
🟡 Production LLM API
Intermediate. Deploy your LLM app as a production API with auth, rate-limiting, and monitoring.
Career Planning
Ready to build your personalized AI career plan?
Start Skill Gap Analysis →