LLM Foundations  ·  Intermediate

Large Language Models

From the Transformer architecture and training pipelines to model types, major providers, and production deployment. The complete LLM guide for engineers building in the AI era.

8 sections · ~40 min read · Intermediate
Intermediate Transformers Self-Attention Pre-training RLHF Multimodal RAG Fine-Tuning Prompt Engineering
🧠
What you’ll learn
8 sections  ·  ~40 min read  ·  Intermediate
  1. What Is a Large Language Model?
  2. The Transformer Architecture
  3. How LLMs Are Trained
  4. Types of LLMs — Text, Code, Vision, Audio
  5. Major Providers & Models
  6. Key Concepts Every Engineer Must Know
  7. What LLMs Can Actually Do
  8. LLMs in Production — RAG, Fine-Tuning & Prompting
1 What Is a Large Language Model?

A Large Language Model (LLM) is a deep neural network trained on massive amounts of text to understand and generate human language. The "large" refers to both the volume of training data (trillions of tokens) and the number of model parameters (billions to trillions of weights).

LLMs learn by solving one deceptively simple task: predict the next token. Given "The capital of France is", the model learns to assign high probability to "Paris". Do this across trillions of examples and the model develops rich internal representations of language, facts, reasoning, and code.

💡 Key insight: LLMs don’t understand language the way humans do. They learn statistical patterns at massive scale — and those patterns capture an extraordinary amount of world knowledge.
Three things that make an LLM "large"
  • Parameters: The weights of the neural network. GPT-3 has 175B; modern frontier models range from hundreds of billions to trillions.
  • Training data: Trillions of tokens from web pages, books, code, and papers. Llama 3 was trained on 15 trillion tokens.
  • Compute: Thousands of GPUs running for weeks or months. Training GPT-4 is estimated to have cost $100M+.
What is a token?

Tokens are the basic unit the model processes — not characters, not words, but sub-word chunks. "unhappiness" typically becomes ["un", "happiness"]. Rough rule of thumb: 1 token ≈ 0.75 words in English. Every LLM has a fixed vocabulary of 32K–200K tokens.

2 The Transformer Architecture

Every major LLM today is built on the Transformer, introduced in the landmark 2017 paper "Attention Is All You Need." Before Transformers, sequence models (RNNs, LSTMs) processed tokens one at a time — too slow to scale. Transformers process all tokens in parallel using self-attention.

Decoder-only pipeline (GPT-style)
Input Text
Raw string
Tokeniser
BPE / WordPiece
Embeddings
+ Position encoding
N × Transformer Blocks
Attention + FFN
Next Token
Softmax
Self-attention: the core innovation

Self-attention lets every token "look at" every other token simultaneously. For each token, the model computes Query (Q), Key (K), and Value (V) vectors, then calculates attention scores:

Self-Attention formula
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

# High attention score = "pay a lot of attention to this token"
# Captures long-range dependencies that RNNs could not handle
The three Transformer variants
📋 Encoder-only (BERT-style)
  • Bidirectional — sees full context
  • Best for: classification, embeddings, NER
  • Examples: BERT, RoBERTa, DeBERTa
  • Not generative — outputs representations
📄 Decoder-only (GPT-style)
  • Causal — only sees past tokens
  • Best for: text generation, chat, code
  • Examples: GPT-4, Llama, Claude, Gemini
  • The dominant architecture today
🔭
Feed-Forward Network (FFN): After attention, each token passes through 2 linear layers. This is where most factual knowledge is stored — roughly 2/3 of all parameters. Think of attention as "routing" and FFN as "memory."
3 How LLMs Are Trained

Modern LLMs go through a multi-stage pipeline — each stage shaping different capabilities.

Stage 1 — Pre-training (weeks on thousands of GPUs)

Train on trillions of tokens (web, books, code, papers) using next-token prediction. The model learns language, facts, and reasoning. GPT-3 used 300B tokens; Llama 3 used 15T. This is by far the most expensive stage.

Stage 2 — Supervised Fine-Tuning (SFT)

Fine-tune on thousands of human-written <instruction, ideal response> pairs. Teaches the model to follow instructions and be helpful rather than just continuing text.

Stage 3 — RLHF / DPO

Human raters compare pairs of model outputs and rank them. A Reward Model (RM) is trained on these preferences. Then the LLM is optimised with RL (PPO) or Direct Preference Optimisation (DPO) to produce responses humans rate as more helpful, harmless, and honest.

RLHF pipeline
1. Collect preference data: human picks A vs B response
2. Train Reward Model on rankings
3. Fine-tune LLM with PPO to maximise reward score
4. Repeat with updated preferences
Stage 4 — Constitutional AI / RLAIF

AI feedback replaces some human annotation. The model critiques and revises its own outputs against a set of principles (the "constitution"). Used in Claude. More scalable than pure human labelling at frontier model sizes.

Advanced training concepts
  • Chinchilla scaling laws: Optimal training uses ~20 tokens per parameter. A 7B model → ~140B tokens.
  • Mixture of Experts (MoE): Each token routes to a subset of "expert" FFN layers rather than all of them. Scales parameters without proportional compute increase. Used in GPT-4, Mixtral, DeepSeek-V3.
  • Mixed Precision (BF16): Train in lower precision for speed; master weights stay in FP32 for stability. Reduces memory ~2x.
  • Gradient Checkpointing: Recompute activations during backward pass instead of storing them — trades compute for memory.
📈
Cost reality: GPT-4 training ≈ $100M+. Llama 3 (70B) ≈ $2M. Most engineers apply fine-tuning (<$1,000) or prompting (free) to existing models rather than training from scratch.
4 Types of LLMs — Text, Code, Vision, Audio & More

LLMs have evolved well beyond text generation. Modern models are categorised by their input/output modalities and primary purpose.

📄 Text LLMs

The foundation. Input and output is text. Handle conversation, summarisation, Q&A, translation, writing, and analysis. All other types build on this. Examples: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3.

💻 Code LLMs

Trained heavily on code repositories (GitHub, StackOverflow). Understand syntax, APIs, logic, and debugging patterns. Power GitHub Copilot, Cursor, and Replit. Examples: DeepSeek-Coder-V2, Code Llama, Codestral.

📷 Vision-Language Models (VLMs)

Accept images and text together as input. Describe images, answer visual questions, read charts, understand screenshots and documents. A visual encoder maps images into the same embedding space as text tokens. Examples: GPT-4V, Claude 3.5, Gemini 1.5, LLaVA, Qwen-VL.

🎤 Audio / Speech LLMs

Process or generate audio natively — not just text transcripts. Whisper does speech-to-text. GPT-4o handles audio end-to-end with <300ms latency. AudioCraft and MusicLM generate music. Real-time voice AI agents are becoming mainstream.

🌐 Multimodal LLMs

Handle any combination of text, images, audio, video, and documents in a single unified model. GPT-4o, Gemini 1.5 Pro, and Claude 3.5 process mixed inputs natively. This is where all frontier models are heading.

🧠 Reasoning LLMs

Trained with extended chain-of-thought to solve hard maths, logic, and science problems. Models "think before they answer" — generating an internal scratchpad before responding. Trade speed for accuracy. Examples: OpenAI o1/o3, DeepSeek-R1, Gemini 2.0 Flash Thinking.

🔍 Embedding Models

Convert text into dense vector representations for semantic search, RAG pipelines, clustering, and classification. Not generative — they encode meaning into numbers. Essential for any RAG application. Examples: text-embedding-3, Cohere Embed v3, BGE-M3, E5-large.

💡
The convergence trend: Frontier models are converging on unified multimodal architectures. GPT-4o processes text, images, and audio in one model. Expect video understanding to be fully integrated by 2026–2027.
5 Major Providers & Models

The LLM landscape has consolidated around a handful of frontier labs, plus a vibrant open-source ecosystem.

Commercial providers
  • OpenAI — GPT-4o, o1, o3. Widest API ecosystem, best function calling, DALL-E 3, Whisper, Sora. Most mature production tooling.
  • Anthropic — Claude 3.5 Sonnet, Claude 3 Opus. 200K context window, exceptional coding and writing, Constitutional AI safety approach.
  • Google DeepMind — Gemini 1.5 Pro (1M+ context), Gemini 2.0 Flash. Best-in-class for very long context, video understanding, and Google Search integration.
  • Cohere — Command R+, Embed v3, Rerank 3. Enterprise RAG specialist — best-in-class embeddings and reranking for production search.
  • xAI — Grok-2, Grok-3. Real-time X/Twitter data access, strong reasoning capabilities.
Open-weight models (free to download and run)
  • Meta Llama 3.1 (8B / 70B / 405B) — Most widely used open model. 128K context. Free to fine-tune and deploy commercially. Powers the entire open-source ecosystem.
  • Mistral / Mixtral — Mistral Large 2, Mixtral 8x22B (MoE). Strong European alternative, efficient inference, partially open.
  • DeepSeek — DeepSeek-V3, DeepSeek-R1. Trained at a fraction of frontier model cost. R1 rivals OpenAI o1 on reasoning benchmarks. Fully open weights.
  • Qwen (Alibaba) — Qwen2.5 (7B–72B). Strong multilingual and coding performance. Excellent for Asian language tasks.
  • Gemma (Google) — Gemma 2 (2B–27B). Lightweight, safety-tuned, great for on-device and edge use cases.
Model comparison at a glance
Provider Best model Context Open? Best for
OpenAIGPT-4o128KNoAll-round, tools, vision
AnthropicClaude 3.5 Sonnet200KNoLong docs, coding, safety
GoogleGemini 1.5 Pro1M+PartialVery long context, video
MetaLlama 3.1 405B128KYesSelf-hosting, fine-tuning
MistralMistral Large 2128KPartialEU compliance, efficient
DeepSeekDeepSeek-R164KYesHard reasoning, maths
6 Key Concepts Every Engineer Must Know
Context window

The maximum tokens the model can process at once (input + output combined). GPT-4o: 128K tokens. Claude 3.5: 200K. Gemini 1.5 Pro: 1M+. Larger context = process entire codebases, legal documents, or long conversations without losing earlier information.

Temperature & sampling

Temperature controls randomness: 0 = deterministic (same output every time), 1 = normal sampling, >1 = creative but unpredictable. Top-p (nucleus sampling) selects from tokens whose cumulative probability ≥ p. Use 0–0.2 for code or factual tasks; 0.7–1.0 for creative writing.

Hallucination

When the model generates confident-sounding but false information. Root cause: next-token prediction optimises for fluency, not factual accuracy. The model always outputs something plausible-looking — even when it doesn’t know the answer. Mitigated with RAG (ground the answer in retrieved documents) and citations.

Quantisation

Reduce weight precision (FP32 → FP16 → INT4) to shrink memory footprint. A 70B model in FP16 needs 140GB VRAM; in 4-bit (GGUF/AWQ) only ~35GB — fits on consumer hardware. Slight quality loss, major accessibility gain for self-hosted deployments.

KV Cache

During autoregressive generation, stores computed key/value attention pairs for past tokens so they don’t need recomputing. Dramatically reduces inference latency for long contexts. Essential for any production LLM system.

Tool / Function Calling

The model emits structured JSON to trigger external tools (web search, calculator, database query, API call). The application runs the tool and returns results to the model, which incorporates them into its response. This is the foundation of all agentic AI — the model gains the ability to act, not just generate text.

Common trap: Setting temperature too high. Beyond ~1.2, outputs become incoherent. For most production use: temperature 0.7 as a sensible default; 0 for tasks requiring strict consistency like JSON extraction or code generation.
7 What LLMs Can Actually Do

LLMs are general-purpose reasoning engines. Their applications span nearly every industry and function.

Developer & engineering
  • Code generation & review: Write, debug, refactor, explain, and document code. GitHub Copilot, Cursor, Replit AI.
  • Test generation: Automatically write unit tests from function signatures and docstrings.
  • AI agents: Autonomous task execution — browse the web, run code, call APIs, manage files, book meetings.
  • Documentation: Generate API docs, README files, and technical specs directly from code.
Enterprise & business
  • Enterprise search (RAG): Q&A over internal documents, PDFs, and knowledge bases. Answer questions from company data, not just training data.
  • Document processing: Extract, classify, summarise contracts, invoices, medical notes, and legal filings at scale.
  • Customer support: 24/7 AI agents handling tier-1 support, routing and escalating complex cases to humans.
  • Data analysis: Interpret data, generate SQL queries, explain trends, and build dashboard narratives.
Healthcare, legal & finance
  • Clinical notes: Transcribe and structure physician notes, reducing documentation time by 50%+.
  • Contract review: Flag risk clauses, compare against standard terms, summarise key provisions in seconds.
  • Compliance: Screen communications, generate regulatory reports, monitor for policy violations.
  • Financial analysis: Summarise earnings calls, extract KPIs from reports, generate market commentary.
8 LLMs in Production — RAG, Fine-Tuning & Prompting

Most engineers don’t train LLMs from scratch. They adapt existing models using one of three strategies. The right choice depends on your specific problem.

Prompt Engineering (zero cost — always try first)

System prompts, few-shot examples, chain-of-thought instructions, output format schemas. No training, no infrastructure. Often solves 80% of problems.

Few-shot prompting example
System: You are a JSON extractor. Output only valid JSON.

User: Extract name and email.
Text: "Hi, I'm Sarah (sarah@company.com), please follow up."
Assistant: {"name": "Sarah", "email": "sarah@company.com"}

# No training needed. No infrastructure. Solves format/tone/structure problems instantly.
RAG — Retrieval-Augmented Generation (low cost)

Inject relevant context from a vector database into the prompt at inference time. Keeps the model knowledge current without retraining. Best for knowledge-base Q&A, private document search, and factual grounding.

Fine-Tuning with LoRA / QLoRA (medium cost)

Train the model on domain-specific data to change its behaviour, style, or domain knowledge. LoRA fine-tunes only small adapter layers — practical on a single consumer GPU. Best for consistent output format, domain adaptation, or task specialisation.

When to use RAG
  • Private docs that change frequently
  • Citations and source attribution needed
  • Knowledge cutoff is a problem
  • No GPU, no training budget
When to fine-tune
  • Consistent output format or style
  • Niche domain with specific vocabulary
  • Latency matters (no retrieval step)
  • Prompting alone cannot solve the task
📚 The decision hierarchy: Start with prompting. If that fails, add RAG. If RAG is not enough, add fine-tuning. Only train from scratch if no existing base model covers your domain. See the full RAG vs Fine-Tuning guide →
← Previous
Deep Learning
Next →
GenAI Introduction
ML Introduction →