Large Language Models
From the Transformer architecture and training pipelines to model types, major providers, and production deployment. The complete LLM guide for engineers building in the AI era.
A Large Language Model (LLM) is a deep neural network trained on massive amounts of text to understand and generate human language. The "large" refers to both the volume of training data (trillions of tokens) and the number of model parameters (billions to trillions of weights).
LLMs learn by solving one deceptively simple task: predict the next token. Given "The capital of France is", the model learns to assign high probability to "Paris". Do this across trillions of examples and the model develops rich internal representations of language, facts, reasoning, and code.
- Parameters: The weights of the neural network. GPT-3 has 175B; modern frontier models range from hundreds of billions to trillions.
- Training data: Trillions of tokens from web pages, books, code, and papers. Llama 3 was trained on 15 trillion tokens.
- Compute: Thousands of GPUs running for weeks or months. Training GPT-4 is estimated to have cost $100M+.
Tokens are the basic unit the model processes — not characters, not words, but sub-word chunks. "unhappiness" typically becomes ["un", "happiness"]. Rough rule of thumb: 1 token ≈ 0.75 words in English. Every LLM has a fixed vocabulary of 32K–200K tokens.
Every major LLM today is built on the Transformer, introduced in the landmark 2017 paper "Attention Is All You Need." Before Transformers, sequence models (RNNs, LSTMs) processed tokens one at a time — too slow to scale. Transformers process all tokens in parallel using self-attention.
Raw string
BPE / WordPiece
+ Position encoding
Attention + FFN
Softmax
Self-attention lets every token "look at" every other token simultaneously. For each token, the model computes Query (Q), Key (K), and Value (V) vectors, then calculates attention scores:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V # High attention score = "pay a lot of attention to this token" # Captures long-range dependencies that RNNs could not handle
- Bidirectional — sees full context
- Best for: classification, embeddings, NER
- Examples: BERT, RoBERTa, DeBERTa
- Not generative — outputs representations
- Causal — only sees past tokens
- Best for: text generation, chat, code
- Examples: GPT-4, Llama, Claude, Gemini
- The dominant architecture today
Modern LLMs go through a multi-stage pipeline — each stage shaping different capabilities.
Train on trillions of tokens (web, books, code, papers) using next-token prediction. The model learns language, facts, and reasoning. GPT-3 used 300B tokens; Llama 3 used 15T. This is by far the most expensive stage.
Fine-tune on thousands of human-written <instruction, ideal response> pairs. Teaches the model to follow instructions and be helpful rather than just continuing text.
Human raters compare pairs of model outputs and rank them. A Reward Model (RM) is trained on these preferences. Then the LLM is optimised with RL (PPO) or Direct Preference Optimisation (DPO) to produce responses humans rate as more helpful, harmless, and honest.
1. Collect preference data: human picks A vs B response 2. Train Reward Model on rankings 3. Fine-tune LLM with PPO to maximise reward score 4. Repeat with updated preferences
AI feedback replaces some human annotation. The model critiques and revises its own outputs against a set of principles (the "constitution"). Used in Claude. More scalable than pure human labelling at frontier model sizes.
- Chinchilla scaling laws: Optimal training uses ~20 tokens per parameter. A 7B model → ~140B tokens.
- Mixture of Experts (MoE): Each token routes to a subset of "expert" FFN layers rather than all of them. Scales parameters without proportional compute increase. Used in GPT-4, Mixtral, DeepSeek-V3.
- Mixed Precision (BF16): Train in lower precision for speed; master weights stay in FP32 for stability. Reduces memory ~2x.
- Gradient Checkpointing: Recompute activations during backward pass instead of storing them — trades compute for memory.
LLMs have evolved well beyond text generation. Modern models are categorised by their input/output modalities and primary purpose.
The foundation. Input and output is text. Handle conversation, summarisation, Q&A, translation, writing, and analysis. All other types build on this. Examples: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3.
Trained heavily on code repositories (GitHub, StackOverflow). Understand syntax, APIs, logic, and debugging patterns. Power GitHub Copilot, Cursor, and Replit. Examples: DeepSeek-Coder-V2, Code Llama, Codestral.
Accept images and text together as input. Describe images, answer visual questions, read charts, understand screenshots and documents. A visual encoder maps images into the same embedding space as text tokens. Examples: GPT-4V, Claude 3.5, Gemini 1.5, LLaVA, Qwen-VL.
Process or generate audio natively — not just text transcripts. Whisper does speech-to-text. GPT-4o handles audio end-to-end with <300ms latency. AudioCraft and MusicLM generate music. Real-time voice AI agents are becoming mainstream.
Handle any combination of text, images, audio, video, and documents in a single unified model. GPT-4o, Gemini 1.5 Pro, and Claude 3.5 process mixed inputs natively. This is where all frontier models are heading.
Trained with extended chain-of-thought to solve hard maths, logic, and science problems. Models "think before they answer" — generating an internal scratchpad before responding. Trade speed for accuracy. Examples: OpenAI o1/o3, DeepSeek-R1, Gemini 2.0 Flash Thinking.
Convert text into dense vector representations for semantic search, RAG pipelines, clustering, and classification. Not generative — they encode meaning into numbers. Essential for any RAG application. Examples: text-embedding-3, Cohere Embed v3, BGE-M3, E5-large.
The LLM landscape has consolidated around a handful of frontier labs, plus a vibrant open-source ecosystem.
- OpenAI — GPT-4o, o1, o3. Widest API ecosystem, best function calling, DALL-E 3, Whisper, Sora. Most mature production tooling.
- Anthropic — Claude 3.5 Sonnet, Claude 3 Opus. 200K context window, exceptional coding and writing, Constitutional AI safety approach.
- Google DeepMind — Gemini 1.5 Pro (1M+ context), Gemini 2.0 Flash. Best-in-class for very long context, video understanding, and Google Search integration.
- Cohere — Command R+, Embed v3, Rerank 3. Enterprise RAG specialist — best-in-class embeddings and reranking for production search.
- xAI — Grok-2, Grok-3. Real-time X/Twitter data access, strong reasoning capabilities.
- Meta Llama 3.1 (8B / 70B / 405B) — Most widely used open model. 128K context. Free to fine-tune and deploy commercially. Powers the entire open-source ecosystem.
- Mistral / Mixtral — Mistral Large 2, Mixtral 8x22B (MoE). Strong European alternative, efficient inference, partially open.
- DeepSeek — DeepSeek-V3, DeepSeek-R1. Trained at a fraction of frontier model cost. R1 rivals OpenAI o1 on reasoning benchmarks. Fully open weights.
- Qwen (Alibaba) — Qwen2.5 (7B–72B). Strong multilingual and coding performance. Excellent for Asian language tasks.
- Gemma (Google) — Gemma 2 (2B–27B). Lightweight, safety-tuned, great for on-device and edge use cases.
| Provider | Best model | Context | Open? | Best for |
|---|---|---|---|---|
| OpenAI | GPT-4o | 128K | No | All-round, tools, vision |
| Anthropic | Claude 3.5 Sonnet | 200K | No | Long docs, coding, safety |
| Gemini 1.5 Pro | 1M+ | Partial | Very long context, video | |
| Meta | Llama 3.1 405B | 128K | Yes | Self-hosting, fine-tuning |
| Mistral | Mistral Large 2 | 128K | Partial | EU compliance, efficient |
| DeepSeek | DeepSeek-R1 | 64K | Yes | Hard reasoning, maths |
The maximum tokens the model can process at once (input + output combined). GPT-4o: 128K tokens. Claude 3.5: 200K. Gemini 1.5 Pro: 1M+. Larger context = process entire codebases, legal documents, or long conversations without losing earlier information.
Temperature controls randomness: 0 = deterministic (same output every time), 1 = normal sampling, >1 = creative but unpredictable. Top-p (nucleus sampling) selects from tokens whose cumulative probability ≥ p. Use 0–0.2 for code or factual tasks; 0.7–1.0 for creative writing.
When the model generates confident-sounding but false information. Root cause: next-token prediction optimises for fluency, not factual accuracy. The model always outputs something plausible-looking — even when it doesn’t know the answer. Mitigated with RAG (ground the answer in retrieved documents) and citations.
Reduce weight precision (FP32 → FP16 → INT4) to shrink memory footprint. A 70B model in FP16 needs 140GB VRAM; in 4-bit (GGUF/AWQ) only ~35GB — fits on consumer hardware. Slight quality loss, major accessibility gain for self-hosted deployments.
During autoregressive generation, stores computed key/value attention pairs for past tokens so they don’t need recomputing. Dramatically reduces inference latency for long contexts. Essential for any production LLM system.
The model emits structured JSON to trigger external tools (web search, calculator, database query, API call). The application runs the tool and returns results to the model, which incorporates them into its response. This is the foundation of all agentic AI — the model gains the ability to act, not just generate text.
LLMs are general-purpose reasoning engines. Their applications span nearly every industry and function.
- Code generation & review: Write, debug, refactor, explain, and document code. GitHub Copilot, Cursor, Replit AI.
- Test generation: Automatically write unit tests from function signatures and docstrings.
- AI agents: Autonomous task execution — browse the web, run code, call APIs, manage files, book meetings.
- Documentation: Generate API docs, README files, and technical specs directly from code.
- Enterprise search (RAG): Q&A over internal documents, PDFs, and knowledge bases. Answer questions from company data, not just training data.
- Document processing: Extract, classify, summarise contracts, invoices, medical notes, and legal filings at scale.
- Customer support: 24/7 AI agents handling tier-1 support, routing and escalating complex cases to humans.
- Data analysis: Interpret data, generate SQL queries, explain trends, and build dashboard narratives.
- Clinical notes: Transcribe and structure physician notes, reducing documentation time by 50%+.
- Contract review: Flag risk clauses, compare against standard terms, summarise key provisions in seconds.
- Compliance: Screen communications, generate regulatory reports, monitor for policy violations.
- Financial analysis: Summarise earnings calls, extract KPIs from reports, generate market commentary.
Most engineers don’t train LLMs from scratch. They adapt existing models using one of three strategies. The right choice depends on your specific problem.
System prompts, few-shot examples, chain-of-thought instructions, output format schemas. No training, no infrastructure. Often solves 80% of problems.
System: You are a JSON extractor. Output only valid JSON.
User: Extract name and email.
Text: "Hi, I'm Sarah (sarah@company.com), please follow up."
Assistant: {"name": "Sarah", "email": "sarah@company.com"}
# No training needed. No infrastructure. Solves format/tone/structure problems instantly.
Inject relevant context from a vector database into the prompt at inference time. Keeps the model knowledge current without retraining. Best for knowledge-base Q&A, private document search, and factual grounding.
Train the model on domain-specific data to change its behaviour, style, or domain knowledge. LoRA fine-tunes only small adapter layers — practical on a single consumer GPU. Best for consistent output format, domain adaptation, or task specialisation.
- Private docs that change frequently
- Citations and source attribution needed
- Knowledge cutoff is a problem
- No GPU, no training budget
- Consistent output format or style
- Niche domain with specific vocabulary
- Latency matters (no retrieval step)
- Prompting alone cannot solve the task