Generative AI · Beginner to Advanced

Generative AI — Complete Guide

From what GenAI is and how LLMs work, through RAG systems, fine-tuning, agents, and production deployment — everything you need to build real GenAI applications.

8 sections · ~2–4 hrs read · Beginner → Advanced

Beginner to Advanced LLMs Tokens & Embeddings Prompt Engineering RAG Systems Fine-Tuning AI Agents Production Deployment

📚

What you’ll learn

8 sections · ~2–4 hrs read · Beginner → Advanced

What is Generative AI?
LLMs, Tokens & Embeddings
Transformers & Attention
Build Your First AI App
RAG Systems
Advanced — Fine-Tuning & Agents
Production & Deployment
Hands-On Projects

1 What is Generative AI?

🤖 Generative AI = Models that create new content — text, images, code, audio, video — rather than just classify or predict.

Traditional AI classified inputs into categories or predicted a value. Generative AI produces new outputs that didn’t exist before: write an essay, generate a photo, complete code, synthesise a voice. The outputs look and feel human-created because these models have learned the statistical patterns of human-generated content at internet scale.

Traditional AI vs. Generative AI

Traditional AI / ML

Classifies or predicts from a fixed set of outputs
“Is this email spam?” → Yes / No
“What is the house price?” → $450,000
Deterministic, bounded output space
Easy to evaluate (accuracy, RMSE)

Generative AI

Generates novel outputs from open-ended inputs
“Explain this concept simply” → full paragraph
“Draw a sunset over mountains” → new image
Probabilistic, unbounded output space
Harder to evaluate (human feedback, LLM judges)

The four modalities

💬

Text

ChatGPT, Claude, Gemini — reasoning, Q&A, summarisation, translation, code.

🎨

Image

Midjourney, DALL·E 3, Stable Diffusion — photo-realistic & artistic generation.

💻

Code

GitHub Copilot, Cursor, Claude — autocomplete, refactoring, test generation.

🎧

Audio / Video

ElevenLabs (voice), Sora (video), Udio (music) — the fastest-moving frontier.

2 LLMs, Tokens & Embeddings

Large Language Models (LLMs) are neural networks trained on massive text datasets to predict the next token. That simple objective — given this text, what comes next? — gives rise to emergent abilities: reasoning, translation, coding, and more.

How LLMs are trained

LLM training objective (simplified)

# Pre-training: predict next token on internet-scale text
# "The cat sat on the ___" -> model predicts "mat"

# Training data: CommonCrawl, books, GitHub, Wikipedia
# Scale: GPT-4 ~1 trillion tokens, Llama 3 ~15 trillion

# After pre-training: instruction tuning + RLHF
# Makes model helpful, harmless, and honest

What are tokens?

LLMs don’t read characters or words — they read tokens. A token is roughly 3-4 characters or 0.75 words in English. Tokenisation splits text into these sub-word units using algorithms like Byte-Pair Encoding (BPE).

Tokenisation example (tiktoken / OpenAI)

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")

text = "Generative AI is transforming software."
tokens = enc.encode(text)
print(tokens)        # [86230, 15592, 15592, ...]
print(len(tokens))   # 8 tokens  (~6 words)

# Cost implication: GPT-4 charges per token
# 1M tokens ~ $30 input / $60 output (approx)

What are embeddings?

Embeddings are dense vector representations of tokens, words, or entire sentences in a high-dimensional space. Semantically similar concepts have nearby embeddings. This is how LLMs “understand” meaning and how RAG systems find relevant documents.

Creating embeddings with OpenAI

from openai import OpenAI
client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="How do neural networks learn?"
)
embedding = response.data[0].embedding
# Returns a list of 1536 floats
# Cosine similarity between two embeddings = semantic similarity

3 Transformers & Attention

⚡ The Transformer architecture (2017) is the foundation of every modern LLM. Understanding it gives you intuition for why these models behave the way they do.

Before Transformers, NLP relied on RNNs that processed text token-by-token, making them slow and bad at long-range dependencies. Self-attention lets every token attend to every other token simultaneously — in one parallel operation.

How self-attention works (intuition)

Self-attention — simplified

# Each token is projected into Q (query), K (key), V (value)
Q = X @ W_q   # "What am I looking for?"
K = X @ W_k   # "What do I contain?"
V = X @ W_v   # "What information do I contribute?"

# Attention scores: how much each token should attend to each other
scores  = Q @ K.T / sqrt(d_k)   # scaled dot-product
weights = softmax(scores)        # probabilities

# Attended representation = weighted sum of Values
output = weights @ V

# Multi-head: run H parallel heads, concatenate
# Each head learns different relationship types

Key model architectures

Encoder-only (BERT family)

Sees full input in both directions
Great for understanding tasks
Classification, NER, embeddings
Examples: BERT, RoBERTa, sentence-transformers

Decoder-only (GPT family)

Sees only tokens to the left (causal)
Great for generation tasks
Text generation, code, chat
Examples: GPT-4, Claude, Llama, Gemini

Using AI APIs

OpenAI Chat Completions API

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": "Explain RAG in 2 sentences."}
    ],
    temperature=0.7,   # 0 = deterministic, 2 = very random
    max_tokens=200
)
print(response.choices[0].message.content)

4 Build Your First AI App

⭐ Most popular section — get a working app in under 50 lines of Python.

Project 1: Chatbot with memory

Python — chatbot with conversation history

from openai import OpenAI

client   = OpenAI()
history  = [{"role": "system", "content": "You are a helpful AI assistant."}]

def chat(user_input: str) -> str:
    history.append({"role": "user", "content": user_input})
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=history
    )
    reply = response.choices[0].message.content
    history.append({"role": "assistant", "content": reply})
    return reply

# REPL
while True:
    user = input("You: ")
    if user.lower() in ("exit", "quit"): break
    print(f"AI: {chat(user)}")

Streaming responses

Streaming makes your app feel instant by printing tokens as they arrive, rather than waiting for the full response.

Python — streaming with Server-Sent Events

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain neural networks."}],
    stream=True
)
for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)

Prompt engineering basics

Weak prompt

No role or context
Vague instruction
"Tell me about ML"
Output format undefined
Gets rambling, generic answers

Strong prompt

System role: "You are a senior ML engineer"
Specific task + audience + format
Constraints: "in 3 bullet points, beginner level"
Examples (few-shot) when needed
Gets precise, structured, actionable answers

5 RAG — Retrieval-Augmented Generation

🔥 RAG is the most in-demand GenAI engineering skill right now. It solves LLM hallucination by grounding answers in real documents.

LLMs have a training cutoff and no access to your private data. RAG fixes this: at query time, retrieve the most relevant documents from your knowledge base and inject them into the prompt. The model then reasons over real, up-to-date information.

The RAG pipeline

RAG architecture — full flow

# INDEXING (run once)
docs   = load_documents("./docs/")          # PDF, web, DB
chunks = chunk_text(docs, size=500)         # split into segments
embeds = embed(chunks)                      # vectors
store  = VectorDB.upsert(chunks, embeds)    # persist

# QUERYING (every request)
query       = "What is our refund policy?"
query_embed = embed(query)
results     = store.similarity_search(query_embed, top_k=5)
context     = "

".join(r.text for r in results)

prompt = f"""Answer based only on the context below.
Context: {context}
Question: {query}"""

answer = llm.complete(prompt)

Chunking strategies

📋

Fixed-size

Split every N characters or tokens. Fast but may cut sentences mid-thought. Good baseline.

✁

Semantic

Split at sentence/paragraph boundaries. Preserves meaning. Better retrieval quality.

📄

Recursive

LangChain’s RecursiveCharacterTextSplitter. Tries paragraphs, then sentences, then words. Most robust.

📝

Document-aware

Respects structure (Markdown headers, HTML tags, PDF sections). Requires parsing.

Vector databases comparison

Database	Best For	Notes
Chroma	Local dev, prototyping	Runs in-process (no server). Free, open-source.
Pinecone	Production, scale	Fully managed SaaS. Best performance at scale.
Weaviate	Hybrid search	Combines vector + keyword search. Self-hosted or cloud.
pgvector	Existing PostgreSQL	Adds vector column to Postgres. No extra infra.
Qdrant	Open-source production	High performance, Rust-based, good Rust/Python clients.

6 Advanced — Fine-Tuning & Agents

When to fine-tune vs. prompt engineer

Use Prompt Engineering When…

You need fast iteration (no training time)
The base model is capable enough
Task variety is high
Data is limited (<1,000 examples)
Cost of fine-tuning isn’t justified

Fine-Tune When…

Consistent style/format is critical
Domain-specific knowledge is needed
Prompt alone can’t solve it reliably
You have 1,000+ quality examples
Lower latency / cost per call matters

LoRA fine-tuning (parameter-efficient)

Python — LoRA fine-tuning with PEFT + HuggingFace

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    load_in_4bit=True   # QLoRA: 4-bit quantisation
)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,               # rank — smaller = fewer params to train
    lora_alpha=32,      # scaling factor
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 8,388,608 / 8,030,261,248 = 0.10%

Building AI agents

Python — LangChain ReAct agent with tools

from langchain.agents import create_react_agent, AgentExecutor
from langchain_openai import ChatOpenAI
from langchain.tools import DuckDuckGoSearchRun, WikipediaQueryRun

llm   = ChatOpenAI(model="gpt-4o", temperature=0)
tools = [DuckDuckGoSearchRun(), WikipediaQueryRun()]

agent   = create_react_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

result = executor.invoke({"input": "What AI breakthroughs happened in 2025?"})
print(result["output"])

Advanced prompting patterns

🧐

Chain-of-Thought

Add “Think step by step” — dramatically improves reasoning on math and logic problems.

📸

Few-Shot

Include 2-5 input/output examples in your prompt — teaches the model exactly what format you want.

🏅

ReAct

Reason + Act: the model interleaves thinking and tool use. Foundation of most production agents.

7 Production & Deployment

🏭 Getting to production requires evaluation, guardrails, cost management, and observability — not just a working prototype.

Evaluation framework

Technique	How It Works	When to Use
Human eval	Raters score outputs on quality dimensions	Ground truth for preference; expensive
LLM-as-judge	GPT-4 / Claude scores your model’s outputs	Scalable; good correlation with human eval
RAGAS	Automated RAG pipeline evaluation (faithfulness, relevance, context recall)	RAG systems specifically
Unit tests	Assert specific outputs for known inputs	Regression testing, critical paths

Cost optimisation tactics

Model selection — use GPT-4o-mini or Claude Haiku for simple tasks; reserve larger models for complex reasoning
Prompt compression — remove redundant instructions; use LLMLingua to compress prompts by 3-20×
Semantic caching — cache responses for semantically similar queries (GPTCache, Redis)
Batching — batch API calls where latency allows; 50% cheaper on OpenAI
Streaming — reduces perceived latency without reducing actual cost

FastAPI deployment

Python — Production LLM API with streaming

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI

app = FastAPI()
client = OpenAI()

@app.post("/chat")
async def chat(message: str):
    async def generate():
        stream = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": message}],
            stream=True
        )
        for chunk in stream:
            delta = chunk.choices[0].delta.content or ""
            yield f"data: {delta}

"
    
    return StreamingResponse(generate(), media_type="text/event-stream")

8 Hands-On Projects

Build real things. Each project below is scoped to take you from zero to working code. Start with Beginner and work up.

💬

🟢 Simple Chatbot

Beginner. A chatbot with conversation memory. OpenAI API. Under 50 lines of Python. Start here.

📄

🟡 PDF Q&A System

Intermediate. Upload any PDF, ask questions about it using RAG. Full end-to-end with LangChain + Chroma.

⚡

🟢 Streaming Chat UI

Beginner. Real-time streaming responses with a web interface. Feels like ChatGPT.

🤝

🔴 Research Agent

Advanced. An autonomous agent that searches the web, reads papers, and writes reports.

🔧

🔴 Fine-Tuned Model

Advanced. Fine-tune Llama 3 on your own dataset using LoRA. Full training pipeline with PEFT.

🏭

🟡 Production LLM API

Intermediate. Deploy your LLM app as a production API with auth, rate-limiting, and monitoring.

Career Planning

Ready to build your personalized AI career plan?

Start Skill Gap Analysis →

Welcome to CareerStack

Generative AI — Complete Guide