⚙️ Infrastructure Guide — 23 min read

GPUs for LLMs
Architecture, Commands & Utilization

Every AI engineer eventually hits a CUDA out-of-memory error at 2am. This guide explains why GPUs matter, the commands you actually need, and the optimization techniques that separate an expensive inference bill from an efficient one.

GPU architecture basics
15+ commands
NVIDIA stack explained
Production optimization
⚡ At a Glance
1
Why GPUs: LLMs are matrix multiplication at massive scale. GPUs do thousands of multiplications in parallel; CPUs do a handful.
2
Tensor Cores are the real reason modern GPUs train LLMs fast — not just more CUDA cores.
3
nvidia-smi is the single most important command you'll run daily. Learn it cold.
4
Quantization + Flash Attention are the two changes that cut inference cost the most for the least effort.
5
NVIDIA NIM packages an optimized inference engine (TensorRT-LLM/vLLM) into a container with an OpenAI-compatible API — minutes to deploy instead of hand-tuning a serving stack.

Why GPUs, Not CPUs?

A CPU has 8–64 powerful cores optimized for sequential, branching logic — great for running an operating system, bad for math at scale. A GPU has thousands of simpler cores optimized for one thing: doing the same operation on many pieces of data simultaneously (SIMD — Single Instruction, Multiple Data).

An LLM forward pass is, at its core, a sequence of matrix multiplications. Multiplying a 4096×4096 weight matrix against a batch of input vectors is exactly the kind of massively parallel, identical-operation workload GPUs were built for.

CPU vs GPU: Why It Matters for LLMs
💻 CPU
8–64 complex cores. Optimized for sequential logic, branching, low latency per task.
🎯 GPU
Thousands of simple cores. Optimized for identical operations on massive parallel data — exactly what matrix multiplication needs.
Concretely: A single H100 has 16,896 CUDA cores plus 528 Tensor Cores. A high-end server CPU has ~64 cores. For matrix-heavy LLM workloads, this isn't a small advantage — it's a 50–100x throughput difference.

GPU Architecture 101

Three components determine how a GPU performs on LLM workloads: compute cores, memory bandwidth, and interconnect. Understanding each tells you exactly what you're paying for when you rent or buy a GPU.

⚙️ CUDA Cores
General-purpose parallel cores. Handle standard FP32/FP64 arithmetic. Good for general compute, not optimal for the specific math LLMs need.
🎯 Tensor Cores
Specialized circuits that perform entire matrix multiply-accumulate operations in one step. This is what actually accelerates LLM training and inference — not raw CUDA core count.
💾 HBM Memory
High Bandwidth Memory stacked directly on the GPU die. An H100 has 80GB HBM3 at ~3.35 TB/s bandwidth — this is usually the actual bottleneck, not compute.
The bottleneck is usually memory, not compute
Most LLM inference is memory-bandwidth bound, not compute bound. The GPU spends more time moving weights from HBM to the compute cores than actually multiplying them. This is exactly why techniques like quantization (smaller weights = less data to move) speed up inference so dramatically — they reduce memory traffic, not just compute.

VRAM — the number that actually limits you

VRAM (GPU memory) determines the largest model you can load. As a rough rule for inference: VRAM needed ≈ parameters × bytes-per-parameter, plus 10–20% overhead for KV cache and activations — the exact overhead depends on context length and batch size, so treat the table below as a starting estimate, not an exact figure.

Model sizeFP16 (2 bytes/param)INT8 (1 byte/param)INT4 (0.5 bytes/param)
7B~16 GB~8 GB~4 GB
13B~28 GB~14 GB~7 GB
70B~150 GB~75 GB~38 GB
405B~850 GB~425 GB~215 GB

The NVIDIA GPU Lineup for AI

NVIDIA dominates AI compute (~80%+ market share) because of CUDA’s software ecosystem, not just hardware. Here's what each tier is actually for.

GPUVRAMMemory BWBest forTypical use
H200141 GB HBM3e4.8 TB/sLargest models, training + inferenceFrontier model training, 70B+ inference
H10080 GB HBM33.35 TB/sProduction training & inferenceCurrent industry standard for serious workloads
A10040/80 GB HBM2e1.5–2 TB/sTraining, fine-tuningStill widely used, cheaper than H100
L40S48 GB GDDR6864 GB/sInference-optimizedCost-efficient serving, not ideal for training
RTX 409024 GB GDDR6X1 TB/sLocal dev, fine-tuning small modelsPrototyping, 7B–13B LoRA fine-tuning
L424 GB GDDR6300 GB/sLow-cost cloud inferenceBudget-friendly serving for smaller models
Practical guidance
Training or fine-tuning at scale → H100/H200. Cost-efficient production inference → L40S or A100. Local prototyping and LoRA experiments → RTX 4090. Don’t pay for H100 pricing if your workload is inference-only and memory-bandwidth bound — an L40S is often the smarter spend.

Essential GPU Commands

These are the commands you'll actually type, in order of how often you'll need them.

1. nvidia-smi — your daily driver

$ nvidia-smi
# Shows: GPU utilization %, memory used/total, temperature,
# power draw, and every process currently using the GPU

$ watch -n 1 nvidia-smi
# Live-refreshing view, updates every 1 second — use this
# while training to watch utilization in real time

$ nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv
# Scriptable output for logging/monitoring pipelines

$ nvidia-smi -l 5
# Loop mode, refreshes every 5 seconds without "watch"

2. CUDA toolkit & driver info

$ nvcc --version
# CUDA compiler version — must be compatible with your
# PyTorch/TensorFlow build or you'll get silent failures

$ nvidia-smi --query-gpu=driver_version --format=csv
# Driver version — mismatched driver/CUDA is the #1 cause
# of "CUDA not available" errors

$ cat /usr/local/cuda/version.json
# Installed CUDA toolkit version (path may vary by install)

3. PyTorch GPU commands

import torch

torch.cuda.is_available()        # True/False — sanity check #1
torch.cuda.device_count()        # How many GPUs are visible
torch.cuda.current_device()      # Which GPU index is active
torch.cuda.get_device_name(0)    # "NVIDIA H100 80GB HBM3"
torch.cuda.memory_allocated()    # Bytes currently allocated
torch.cuda.memory_reserved()     # Bytes reserved by the allocator
torch.cuda.empty_cache()         # Release unused cached memory
torch.cuda.synchronize()         # Block until all GPU ops finish

4. Multi-GPU & environment control

$ export CUDA_VISIBLE_DEVICES=0,1
# Restrict a process to only see GPUs 0 and 1

$ CUDA_VISIBLE_DEVICES=2 python train.py
# Run a script pinned to GPU index 2 only

$ nvidia-smi topo -m
# Shows interconnect topology between GPUs (NVLink vs PCIe)
# — critical for understanding multi-GPU training speed

$ torchrun --nproc_per_node=4 train.py
# Launch distributed training across 4 GPUs on one node
❌ Most common GPU error: CUDA out of memory
RuntimeError: CUDA out of memory. Tried to allocate X GiB — happens when batch size, sequence length, or model size exceeds available VRAM. The traceback rarely tells you the real fix.
Fix in order of effort: (1) reduce batch size, (2) enable gradient checkpointing, (3) switch to mixed precision (FP16/BF16), (4) use gradient accumulation to simulate larger batches, (5) quantize the model, (6) use a GPU with more VRAM.

NVIDIA's Software Stack

NVIDIA's actual moat isn't the silicon — it's a decade of software that makes that silicon usable. This is what each layer does.

NVIDIA AI Software Stack
Your Application (PyTorch / TensorFlow / vLLM)
TensorRT-LLM — inference optimization & compilation
cuDNN — optimized deep learning primitives (convolutions, attention)
NCCL — multi-GPU communication (all-reduce, broadcast)
CUDA — the parallel computing platform & programming model
GPU Hardware — CUDA cores, Tensor Cores, HBM, NVLink
ComponentWhat it doesWhy it matters for LLMs
CUDAParallel computing platform & APIFoundation everything else builds on. Without it, the GPU is just silicon.
cuDNNOptimized primitives for deep learning opsHand-tuned attention, convolution, and normalization kernels — far faster than naive implementations.
NCCLMulti-GPU/multi-node communication libraryMakes distributed training across hundreds of GPUs actually work efficiently.
TensorRT-LLMInference compiler & runtimeCompiles models into optimized engines — can cut inference latency significantly vs. raw PyTorch.
NVLinkHigh-speed GPU-to-GPU interconnect900 GB/s between GPUs (vs ~64 GB/s on PCIe) — essential for model-parallel training.
MIGMulti-Instance GPU partitioningSplits one physical GPU (A100/H100) into up to 7 isolated instances — better utilization for smaller inference workloads.

NVIDIA Containers — NIM, NGC & Easy Deployment

Installing CUDA, cuDNN, the right PyTorch build, and a dozen other dependencies by hand is exactly the kind of fragile setup that breaks the moment you move from your laptop to a cloud GPU. NVIDIA's container ecosystem exists to make that problem disappear — and it has expanded significantly with the launch of NIM.

📦 NVIDIA Container Toolkit
The foundation. Lets Docker containers access the host GPU directly via the --gpus flag, instead of needing CUDA installed on every machine.
📁 NGC Catalog
NVIDIA's registry of pre-built, performance-tuned containers for PyTorch, TensorFlow, Triton, and more — versions and drivers already matched and tested.
📱 NVIDIA NIM
Pre-packaged, production-ready microservices for deploying LLMs and other models — optimized inference engine included, exposed via an OpenAI-compatible API.

NVIDIA Container Toolkit — the foundation

$ docker run --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
# Runs nvidia-smi inside a container with full GPU access —
# the container sees the host's GPU directly, no passthrough hacks

$ docker run --gpus '"device=0,1"' my-training-image
# Restrict a container to specific GPU indices

$ docker run --gpus all --shm-size=8g -p 8000:8000 my-inference-image
# --shm-size matters: PyTorch dataloaders and multi-process
# inference often need more shared memory than Docker's 64MB default

NVIDIA NIM — the big one for LLM deployment

NIM (NVIDIA Inference Microservices) is a catalog of containerized, pre-optimized models — Llama, Mistral, Mixtral, and NVIDIA's own models — that ship with TensorRT-LLM or vLLM already configured underneath. Instead of hand-tuning batching, KV cache, and quantization yourself, you pull a NIM container and get an OpenAI-compatible inference endpoint in minutes.

$ docker run --gpus all -p 8000:8000 \
    -e NGC_API_KEY=$NGC_API_KEY \
    nvcr.io/nim/meta/llama3-8b-instruct:latest
# Pulls and runs a pre-optimized Llama 3 8B endpoint

$ curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "meta/llama3-8b-instruct", "messages": [...]}'
# Standard OpenAI-style API — drop-in compatible with
# existing OpenAI SDK client code
Why this matters for GPU utilization
NIM containers bake in the optimization techniques covered below — continuous batching, quantization-aware serving, optimized KV cache management — so you get production-grade throughput without manually wiring up TensorRT-LLM yourself. The trade-off is less control over the underlying serving stack in exchange for much faster time-to-deploy.
ApproachSetup timeControlBest for
Raw PyTorch + manual CUDA setupHours–daysFullResearch, custom architectures
NGC container (PyTorch/Triton)MinutesHighCustom training/serving with matched dependencies
NVIDIA NIMMinutesModerateFast production deployment of standard open models
Triton Inference ServerHoursHighServing multiple model types/frameworks at scale

Utilization Techniques That Actually Move the Needle

These are the techniques engineers reach for, ranked by effort-to-impact ratio.

Mixed Precision Training (FP16 / BF16)

Train using 16-bit floats instead of 32-bit. Halves memory usage and roughly doubles throughput on Tensor Cores, with negligible accuracy loss when done correctly (loss scaling handles the precision risk). BF16 is generally preferred over FP16 on modern GPUs (A100+) because it has the same exponent range as FP32, avoiding overflow issues.

Quantization (INT8 / INT4)

Reduce weight precision further for inference. Since most LLM inference is memory-bandwidth bound, smaller weights mean less data movement and faster generation. A 70B model that needs 150GB in FP16 fits in ~38GB at INT4 — the difference between needing 2×H100s and fitting on a single 48GB workstation GPU (RTX 6000 Ada, A6000). It still won't fit on a 24GB consumer card like the RTX 4090 — for that you'd need a 7B–13B model instead.

Flash Attention

A fused attention algorithm that avoids materializing the full attention matrix in HBM, computing it in fast on-chip SRAM instead. Reduces memory reads/writes dramatically and speeds up both training and inference for long sequences — now standard in virtually every serious LLM framework.

Gradient Checkpointing

Trade compute for memory: instead of storing all intermediate activations for backpropagation, recompute them during the backward pass. Cuts activation memory significantly at the cost of ~20–30% more compute time. Use when you're memory-constrained but have compute headroom.

Parallelism Strategies

StrategyHow it worksUse when
Data ParallelismSame model copied across GPUs, each processes a different data batchModel fits on one GPU; you want faster training via more throughput
Tensor ParallelismIndividual layers/matrices split across GPUsModel is too large for one GPU's memory
Pipeline ParallelismDifferent layers placed on different GPUs, data flows through in stagesVery deep models, combined with tensor parallelism for huge models
ZeRO / FSDPShards optimizer states, gradients, and parameters across GPUsTraining large models without full model replication overhead
“The fastest GPU optimization isn't a more expensive card — it's reducing how much data has to move through memory in the first place. Quantization and Flash Attention solve the same root problem from different angles.”

Monitoring & Avoiding Wasted Spend

GPU time is expensive. An idle H100 at low utilization is money burning for nothing — and it's surprisingly common.

📊 nvtop / gpustat
Better visual monitoring than raw nvidia-smi. pip install gpustat for a clean one-line summary per GPU.
🔍 PyTorch Profiler
torch.profiler shows exactly which operations consume the most GPU time — essential before optimizing blindly.
💰 DCGM
NVIDIA's Data Center GPU Manager — fleet-level monitoring for production clusters, integrates with Prometheus/Grafana.
Reality check
If nvidia-smi shows high memory usage but low GPU utilization (the "util %" column), your bottleneck is somewhere else — usually data loading, CPU preprocessing, or a synchronization stall. Throwing a bigger GPU at this problem won't help; profile first.

Choosing the Right Setup

Your situationRecommended setupWhy
Learning / prototypingRTX 4090 or cloud T4/L4Cheap, sufficient for 7B models with quantization
Fine-tuning 7B–13B modelsSingle A100 (40–80GB)Enough VRAM for LoRA/QLoRA without multi-GPU complexity
Production inference at scaleL40S or A100 clusterBest cost-per-token for serving, not training-optimized pricing
Training large models from scratchH100/H200 cluster with NVLinkOnly realistic option for frontier-scale pre-training
Budget-constrained startupQuantized open models on L4/L40SINT4 quantization + efficient GPU = 10x cost reduction vs naive FP16 + H100
❌ Most common mistake: renting H100s for inference-only workloads
H100 pricing is optimized for training throughput. If you're only serving inference traffic, you're paying for compute capability you're not using — remember, inference is memory-bandwidth bound, not compute bound.
Fix: Benchmark your actual workload on an L40S or A100 first. Quantize to INT8/INT4. Only move to H100/H200 if you've measured a genuine bottleneck that cheaper hardware can't solve.
Practice infrastructure questions in a live interview
The Interview Simulator asks you to reason through GPU selection, memory calculations, and optimization trade-offs — scored in real time by Claude.
Start Mock Interview →