GPUs for LLMs
Architecture, Commands & Utilization
Every AI engineer eventually hits a CUDA out-of-memory error at 2am. This guide explains why GPUs matter, the commands you actually need, and the optimization techniques that separate an expensive inference bill from an efficient one.
Why GPUs, Not CPUs?
A CPU has 8–64 powerful cores optimized for sequential, branching logic — great for running an operating system, bad for math at scale. A GPU has thousands of simpler cores optimized for one thing: doing the same operation on many pieces of data simultaneously (SIMD — Single Instruction, Multiple Data).
An LLM forward pass is, at its core, a sequence of matrix multiplications. Multiplying a 4096×4096 weight matrix against a batch of input vectors is exactly the kind of massively parallel, identical-operation workload GPUs were built for.
GPU Architecture 101
Three components determine how a GPU performs on LLM workloads: compute cores, memory bandwidth, and interconnect. Understanding each tells you exactly what you're paying for when you rent or buy a GPU.
VRAM — the number that actually limits you
VRAM (GPU memory) determines the largest model you can load. As a rough rule for inference: VRAM needed ≈ parameters × bytes-per-parameter, plus 10–20% overhead for KV cache and activations — the exact overhead depends on context length and batch size, so treat the table below as a starting estimate, not an exact figure.
| Model size | FP16 (2 bytes/param) | INT8 (1 byte/param) | INT4 (0.5 bytes/param) |
|---|---|---|---|
| 7B | ~16 GB | ~8 GB | ~4 GB |
| 13B | ~28 GB | ~14 GB | ~7 GB |
| 70B | ~150 GB | ~75 GB | ~38 GB |
| 405B | ~850 GB | ~425 GB | ~215 GB |
The NVIDIA GPU Lineup for AI
NVIDIA dominates AI compute (~80%+ market share) because of CUDA’s software ecosystem, not just hardware. Here's what each tier is actually for.
| GPU | VRAM | Memory BW | Best for | Typical use |
|---|---|---|---|---|
| H200 | 141 GB HBM3e | 4.8 TB/s | Largest models, training + inference | Frontier model training, 70B+ inference |
| H100 | 80 GB HBM3 | 3.35 TB/s | Production training & inference | Current industry standard for serious workloads |
| A100 | 40/80 GB HBM2e | 1.5–2 TB/s | Training, fine-tuning | Still widely used, cheaper than H100 |
| L40S | 48 GB GDDR6 | 864 GB/s | Inference-optimized | Cost-efficient serving, not ideal for training |
| RTX 4090 | 24 GB GDDR6X | 1 TB/s | Local dev, fine-tuning small models | Prototyping, 7B–13B LoRA fine-tuning |
| L4 | 24 GB GDDR6 | 300 GB/s | Low-cost cloud inference | Budget-friendly serving for smaller models |
Essential GPU Commands
These are the commands you'll actually type, in order of how often you'll need them.
1. nvidia-smi — your daily driver
$ nvidia-smi # Shows: GPU utilization %, memory used/total, temperature, # power draw, and every process currently using the GPU $ watch -n 1 nvidia-smi # Live-refreshing view, updates every 1 second — use this # while training to watch utilization in real time $ nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv # Scriptable output for logging/monitoring pipelines $ nvidia-smi -l 5 # Loop mode, refreshes every 5 seconds without "watch"
2. CUDA toolkit & driver info
$ nvcc --version # CUDA compiler version — must be compatible with your # PyTorch/TensorFlow build or you'll get silent failures $ nvidia-smi --query-gpu=driver_version --format=csv # Driver version — mismatched driver/CUDA is the #1 cause # of "CUDA not available" errors $ cat /usr/local/cuda/version.json # Installed CUDA toolkit version (path may vary by install)
3. PyTorch GPU commands
import torch torch.cuda.is_available() # True/False — sanity check #1 torch.cuda.device_count() # How many GPUs are visible torch.cuda.current_device() # Which GPU index is active torch.cuda.get_device_name(0) # "NVIDIA H100 80GB HBM3" torch.cuda.memory_allocated() # Bytes currently allocated torch.cuda.memory_reserved() # Bytes reserved by the allocator torch.cuda.empty_cache() # Release unused cached memory torch.cuda.synchronize() # Block until all GPU ops finish
4. Multi-GPU & environment control
$ export CUDA_VISIBLE_DEVICES=0,1 # Restrict a process to only see GPUs 0 and 1 $ CUDA_VISIBLE_DEVICES=2 python train.py # Run a script pinned to GPU index 2 only $ nvidia-smi topo -m # Shows interconnect topology between GPUs (NVLink vs PCIe) # — critical for understanding multi-GPU training speed $ torchrun --nproc_per_node=4 train.py # Launch distributed training across 4 GPUs on one node
RuntimeError: CUDA out of memory. Tried to allocate X GiB — happens when batch size, sequence length, or model size exceeds available VRAM. The traceback rarely tells you the real fix.NVIDIA's Software Stack
NVIDIA's actual moat isn't the silicon — it's a decade of software that makes that silicon usable. This is what each layer does.
| Component | What it does | Why it matters for LLMs |
|---|---|---|
| CUDA | Parallel computing platform & API | Foundation everything else builds on. Without it, the GPU is just silicon. |
| cuDNN | Optimized primitives for deep learning ops | Hand-tuned attention, convolution, and normalization kernels — far faster than naive implementations. |
| NCCL | Multi-GPU/multi-node communication library | Makes distributed training across hundreds of GPUs actually work efficiently. |
| TensorRT-LLM | Inference compiler & runtime | Compiles models into optimized engines — can cut inference latency significantly vs. raw PyTorch. |
| NVLink | High-speed GPU-to-GPU interconnect | 900 GB/s between GPUs (vs ~64 GB/s on PCIe) — essential for model-parallel training. |
| MIG | Multi-Instance GPU partitioning | Splits one physical GPU (A100/H100) into up to 7 isolated instances — better utilization for smaller inference workloads. |
NVIDIA Containers — NIM, NGC & Easy Deployment
Installing CUDA, cuDNN, the right PyTorch build, and a dozen other dependencies by hand is exactly the kind of fragile setup that breaks the moment you move from your laptop to a cloud GPU. NVIDIA's container ecosystem exists to make that problem disappear — and it has expanded significantly with the launch of NIM.
--gpus flag, instead of needing CUDA installed on every machine.NVIDIA Container Toolkit — the foundation
$ docker run --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi # Runs nvidia-smi inside a container with full GPU access — # the container sees the host's GPU directly, no passthrough hacks $ docker run --gpus '"device=0,1"' my-training-image # Restrict a container to specific GPU indices $ docker run --gpus all --shm-size=8g -p 8000:8000 my-inference-image # --shm-size matters: PyTorch dataloaders and multi-process # inference often need more shared memory than Docker's 64MB default
NVIDIA NIM — the big one for LLM deployment
NIM (NVIDIA Inference Microservices) is a catalog of containerized, pre-optimized models — Llama, Mistral, Mixtral, and NVIDIA's own models — that ship with TensorRT-LLM or vLLM already configured underneath. Instead of hand-tuning batching, KV cache, and quantization yourself, you pull a NIM container and get an OpenAI-compatible inference endpoint in minutes.
$ docker run --gpus all -p 8000:8000 \
-e NGC_API_KEY=$NGC_API_KEY \
nvcr.io/nim/meta/llama3-8b-instruct:latest
# Pulls and runs a pre-optimized Llama 3 8B endpoint
$ curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta/llama3-8b-instruct", "messages": [...]}'
# Standard OpenAI-style API — drop-in compatible with
# existing OpenAI SDK client code
| Approach | Setup time | Control | Best for |
|---|---|---|---|
| Raw PyTorch + manual CUDA setup | Hours–days | Full | Research, custom architectures |
| NGC container (PyTorch/Triton) | Minutes | High | Custom training/serving with matched dependencies |
| NVIDIA NIM | Minutes | Moderate | Fast production deployment of standard open models |
| Triton Inference Server | Hours | High | Serving multiple model types/frameworks at scale |
Utilization Techniques That Actually Move the Needle
These are the techniques engineers reach for, ranked by effort-to-impact ratio.
Mixed Precision Training (FP16 / BF16)
Train using 16-bit floats instead of 32-bit. Halves memory usage and roughly doubles throughput on Tensor Cores, with negligible accuracy loss when done correctly (loss scaling handles the precision risk). BF16 is generally preferred over FP16 on modern GPUs (A100+) because it has the same exponent range as FP32, avoiding overflow issues.
Quantization (INT8 / INT4)
Reduce weight precision further for inference. Since most LLM inference is memory-bandwidth bound, smaller weights mean less data movement and faster generation. A 70B model that needs 150GB in FP16 fits in ~38GB at INT4 — the difference between needing 2×H100s and fitting on a single 48GB workstation GPU (RTX 6000 Ada, A6000). It still won't fit on a 24GB consumer card like the RTX 4090 — for that you'd need a 7B–13B model instead.
Flash Attention
A fused attention algorithm that avoids materializing the full attention matrix in HBM, computing it in fast on-chip SRAM instead. Reduces memory reads/writes dramatically and speeds up both training and inference for long sequences — now standard in virtually every serious LLM framework.
Gradient Checkpointing
Trade compute for memory: instead of storing all intermediate activations for backpropagation, recompute them during the backward pass. Cuts activation memory significantly at the cost of ~20–30% more compute time. Use when you're memory-constrained but have compute headroom.
Parallelism Strategies
| Strategy | How it works | Use when |
|---|---|---|
| Data Parallelism | Same model copied across GPUs, each processes a different data batch | Model fits on one GPU; you want faster training via more throughput |
| Tensor Parallelism | Individual layers/matrices split across GPUs | Model is too large for one GPU's memory |
| Pipeline Parallelism | Different layers placed on different GPUs, data flows through in stages | Very deep models, combined with tensor parallelism for huge models |
| ZeRO / FSDP | Shards optimizer states, gradients, and parameters across GPUs | Training large models without full model replication overhead |
Monitoring & Avoiding Wasted Spend
GPU time is expensive. An idle H100 at low utilization is money burning for nothing — and it's surprisingly common.
pip install gpustat for a clean one-line summary per GPU.torch.profiler shows exactly which operations consume the most GPU time — essential before optimizing blindly.nvidia-smi shows high memory usage but low GPU utilization (the "util %" column), your bottleneck is somewhere else — usually data loading, CPU preprocessing, or a synchronization stall. Throwing a bigger GPU at this problem won't help; profile first.Choosing the Right Setup
| Your situation | Recommended setup | Why |
|---|---|---|
| Learning / prototyping | RTX 4090 or cloud T4/L4 | Cheap, sufficient for 7B models with quantization |
| Fine-tuning 7B–13B models | Single A100 (40–80GB) | Enough VRAM for LoRA/QLoRA without multi-GPU complexity |
| Production inference at scale | L40S or A100 cluster | Best cost-per-token for serving, not training-optimized pricing |
| Training large models from scratch | H100/H200 cluster with NVLink | Only realistic option for frontier-scale pre-training |
| Budget-constrained startup | Quantized open models on L4/L40S | INT4 quantization + efficient GPU = 10x cost reduction vs naive FP16 + H100 |