Generative AI — Complete Guide
From what GenAI is and how LLMs work, through RAG systems, fine-tuning, agents, and production deployment — everything you need to build real GenAI applications.
Traditional AI classified inputs into categories or predicted a value. Generative AI produces new outputs that didn’t exist before: write an essay, generate a photo, complete code, synthesise a voice. The outputs look and feel human-created because these models have learned the statistical patterns of human-generated content at internet scale.
- Classifies or predicts from a fixed set of outputs
- “Is this email spam?” → Yes / No
- “What is the house price?” → $450,000
- Deterministic, bounded output space
- Easy to evaluate (accuracy, RMSE)
- Generates novel outputs from open-ended inputs
- “Explain this concept simply” → full paragraph
- “Draw a sunset over mountains” → new image
- Probabilistic, unbounded output space
- Harder to evaluate (human feedback, LLM judges)
Large Language Models (LLMs) are neural networks trained on massive text datasets to predict the next token. That simple objective — given this text, what comes next? — gives rise to emergent abilities: reasoning, translation, coding, and more.
# Pre-training: predict next token on internet-scale text # "The cat sat on the ___" -> model predicts "mat" # Training data: CommonCrawl, books, GitHub, Wikipedia # Scale: GPT-4 ~1 trillion tokens, Llama 3 ~15 trillion # After pre-training: instruction tuning + RLHF # Makes model helpful, harmless, and honest
LLMs don’t read characters or words — they read tokens. A token is roughly 3-4 characters or 0.75 words in English. Tokenisation splits text into these sub-word units using algorithms like Byte-Pair Encoding (BPE).
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
text = "Generative AI is transforming software."
tokens = enc.encode(text)
print(tokens) # [86230, 15592, 15592, ...]
print(len(tokens)) # 8 tokens (~6 words)
# Cost implication: GPT-4 charges per token
# 1M tokens ~ $30 input / $60 output (approx)
Embeddings are dense vector representations of tokens, words, or entire sentences in a high-dimensional space. Semantically similar concepts have nearby embeddings. This is how LLMs “understand” meaning and how RAG systems find relevant documents.
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input="How do neural networks learn?"
)
embedding = response.data[0].embedding
# Returns a list of 1536 floats
# Cosine similarity between two embeddings = semantic similarity
Before Transformers, NLP relied on RNNs that processed text token-by-token, making them slow and bad at long-range dependencies. Self-attention lets every token attend to every other token simultaneously — in one parallel operation.
# Each token is projected into Q (query), K (key), V (value) Q = X @ W_q # "What am I looking for?" K = X @ W_k # "What do I contain?" V = X @ W_v # "What information do I contribute?" # Attention scores: how much each token should attend to each other scores = Q @ K.T / sqrt(d_k) # scaled dot-product weights = softmax(scores) # probabilities # Attended representation = weighted sum of Values output = weights @ V # Multi-head: run H parallel heads, concatenate # Each head learns different relationship types
- Sees full input in both directions
- Great for understanding tasks
- Classification, NER, embeddings
- Examples: BERT, RoBERTa, sentence-transformers
- Sees only tokens to the left (causal)
- Great for generation tasks
- Text generation, code, chat
- Examples: GPT-4, Claude, Llama, Gemini
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain RAG in 2 sentences."}
],
temperature=0.7, # 0 = deterministic, 2 = very random
max_tokens=200
)
print(response.choices[0].message.content)
from openai import OpenAI
client = OpenAI()
history = [{"role": "system", "content": "You are a helpful AI assistant."}]
def chat(user_input: str) -> str:
history.append({"role": "user", "content": user_input})
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=history
)
reply = response.choices[0].message.content
history.append({"role": "assistant", "content": reply})
return reply
# REPL
while True:
user = input("You: ")
if user.lower() in ("exit", "quit"): break
print(f"AI: {chat(user)}")
Streaming makes your app feel instant by printing tokens as they arrive, rather than waiting for the full response.
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Explain neural networks."}],
stream=True
)
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
print(delta, end="", flush=True)
- No role or context
- Vague instruction
- "Tell me about ML"
- Output format undefined
- Gets rambling, generic answers
- System role: "You are a senior ML engineer"
- Specific task + audience + format
- Constraints: "in 3 bullet points, beginner level"
- Examples (few-shot) when needed
- Gets precise, structured, actionable answers
LLMs have a training cutoff and no access to your private data. RAG fixes this: at query time, retrieve the most relevant documents from your knowledge base and inject them into the prompt. The model then reasons over real, up-to-date information.
# INDEXING (run once)
docs = load_documents("./docs/") # PDF, web, DB
chunks = chunk_text(docs, size=500) # split into segments
embeds = embed(chunks) # vectors
store = VectorDB.upsert(chunks, embeds) # persist
# QUERYING (every request)
query = "What is our refund policy?"
query_embed = embed(query)
results = store.similarity_search(query_embed, top_k=5)
context = "
".join(r.text for r in results)
prompt = f"""Answer based only on the context below.
Context: {context}
Question: {query}"""
answer = llm.complete(prompt)
| Database | Best For | Notes |
|---|---|---|
| Chroma | Local dev, prototyping | Runs in-process (no server). Free, open-source. |
| Pinecone | Production, scale | Fully managed SaaS. Best performance at scale. |
| Weaviate | Hybrid search | Combines vector + keyword search. Self-hosted or cloud. |
| pgvector | Existing PostgreSQL | Adds vector column to Postgres. No extra infra. |
| Qdrant | Open-source production | High performance, Rust-based, good Rust/Python clients. |
- You need fast iteration (no training time)
- The base model is capable enough
- Task variety is high
- Data is limited (<1,000 examples)
- Cost of fine-tuning isn’t justified
- Consistent style/format is critical
- Domain-specific knowledge is needed
- Prompt alone can’t solve it reliably
- You have 1,000+ quality examples
- Lower latency / cost per call matters
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
load_in_4bit=True # QLoRA: 4-bit quantisation
)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank — smaller = fewer params to train
lora_alpha=32, # scaling factor
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 8,388,608 / 8,030,261,248 = 0.10%
from langchain.agents import create_react_agent, AgentExecutor
from langchain_openai import ChatOpenAI
from langchain.tools import DuckDuckGoSearchRun, WikipediaQueryRun
llm = ChatOpenAI(model="gpt-4o", temperature=0)
tools = [DuckDuckGoSearchRun(), WikipediaQueryRun()]
agent = create_react_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
result = executor.invoke({"input": "What AI breakthroughs happened in 2025?"})
print(result["output"])
| Technique | How It Works | When to Use |
|---|---|---|
| Human eval | Raters score outputs on quality dimensions | Ground truth for preference; expensive |
| LLM-as-judge | GPT-4 / Claude scores your model’s outputs | Scalable; good correlation with human eval |
| RAGAS | Automated RAG pipeline evaluation (faithfulness, relevance, context recall) | RAG systems specifically |
| Unit tests | Assert specific outputs for known inputs | Regression testing, critical paths |
- Model selection — use GPT-4o-mini or Claude Haiku for simple tasks; reserve larger models for complex reasoning
- Prompt compression — remove redundant instructions; use LLMLingua to compress prompts by 3-20×
- Semantic caching — cache responses for semantically similar queries (GPTCache, Redis)
- Batching — batch API calls where latency allows; 50% cheaper on OpenAI
- Streaming — reduces perceived latency without reducing actual cost
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
app = FastAPI()
client = OpenAI()
@app.post("/chat")
async def chat(message: str):
async def generate():
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": message}],
stream=True
)
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
yield f"data: {delta}
"
return StreamingResponse(generate(), media_type="text/event-stream")
Build real things. Each project below is scoped to take you from zero to working code. Start with Beginner and work up.