⚙️ Infrastructure Guide — 23 min read

GPUs for LLMs
Architecture, Commands & Utilization

Every AI engineer eventually hits a CUDA out-of-memory error at 2am. This guide explains why GPUs matter, the commands you actually need, and the optimization techniques that separate an expensive inference bill from an efficient one.

GPU architecture basics

15+ commands

NVIDIA stack explained

Production optimization

⚡ At a Glance

Why GPUs: LLMs are matrix multiplication at massive scale. GPUs do thousands of multiplications in parallel; CPUs do a handful.

Tensor Cores are the real reason modern GPUs train LLMs fast — not just more CUDA cores.

nvidia-smi is the single most important command you'll run daily. Learn it cold.

Quantization + Flash Attention are the two changes that cut inference cost the most for the least effort.

NVIDIA NIM packages an optimized inference engine (TensorRT-LLM/vLLM) into a container with an OpenAI-compatible API — minutes to deploy instead of hand-tuning a serving stack.

Why GPUs, Not CPUs?

A CPU has 8–64 powerful cores optimized for sequential, branching logic — great for running an operating system, bad for math at scale. A GPU has thousands of simpler cores optimized for one thing: doing the same operation on many pieces of data simultaneously (SIMD — Single Instruction, Multiple Data).

An LLM forward pass is, at its core, a sequence of matrix multiplications. Multiplying a 4096×4096 weight matrix against a batch of input vectors is exactly the kind of massively parallel, identical-operation workload GPUs were built for.

CPU vs GPU: Why It Matters for LLMs

💻 CPU

8–64 complex cores. Optimized for sequential logic, branching, low latency per task.

🎯 GPU

Thousands of simple cores. Optimized for identical operations on massive parallel data — exactly what matrix multiplication needs.

Concretely: A single H100 has 16,896 CUDA cores plus 528 Tensor Cores. A high-end server CPU has ~64 cores. For matrix-heavy LLM workloads, this isn't a small advantage — it's a 50–100x throughput difference.

GPU Architecture 101

Three components determine how a GPU performs on LLM workloads: compute cores, memory bandwidth, and interconnect. Understanding each tells you exactly what you're paying for when you rent or buy a GPU.

⚙️ CUDA Cores

General-purpose parallel cores. Handle standard FP32/FP64 arithmetic. Good for general compute, not optimal for the specific math LLMs need.

🎯 Tensor Cores

Specialized circuits that perform entire matrix multiply-accumulate operations in one step. This is what actually accelerates LLM training and inference — not raw CUDA core count.

💾 HBM Memory

High Bandwidth Memory stacked directly on the GPU die. An H100 has 80GB HBM3 at ~3.35 TB/s bandwidth — this is usually the actual bottleneck, not compute.

The bottleneck is usually memory, not compute

Most LLM inference is memory-bandwidth bound, not compute bound. The GPU spends more time moving weights from HBM to the compute cores than actually multiplying them. This is exactly why techniques like quantization (smaller weights = less data to move) speed up inference so dramatically — they reduce memory traffic, not just compute.

VRAM — the number that actually limits you

VRAM (GPU memory) determines the largest model you can load. As a rough rule for inference: VRAM needed ≈ parameters × bytes-per-parameter, plus 10–20% overhead for KV cache and activations — the exact overhead depends on context length and batch size, so treat the table below as a starting estimate, not an exact figure.

Model size	FP16 (2 bytes/param)	INT8 (1 byte/param)	INT4 (0.5 bytes/param)
7B	~16 GB	~8 GB	~4 GB
13B	~28 GB	~14 GB	~7 GB
70B	~150 GB	~75 GB	~38 GB
405B	~850 GB	~425 GB	~215 GB

The NVIDIA GPU Lineup for AI

NVIDIA dominates AI compute (~80%+ market share) because of CUDA’s software ecosystem, not just hardware. Here's what each tier is actually for.

GPU	VRAM	Memory BW	Best for	Typical use
H200	141 GB HBM3e	4.8 TB/s	Largest models, training + inference	Frontier model training, 70B+ inference
H100	80 GB HBM3	3.35 TB/s	Production training & inference	Current industry standard for serious workloads
A100	40/80 GB HBM2e	1.5–2 TB/s	Training, fine-tuning	Still widely used, cheaper than H100
L40S	48 GB GDDR6	864 GB/s	Inference-optimized	Cost-efficient serving, not ideal for training
RTX 4090	24 GB GDDR6X	1 TB/s	Local dev, fine-tuning small models	Prototyping, 7B–13B LoRA fine-tuning
L4	24 GB GDDR6	300 GB/s	Low-cost cloud inference	Budget-friendly serving for smaller models

Practical guidance

Training or fine-tuning at scale → H100/H200. Cost-efficient production inference → L40S or A100. Local prototyping and LoRA experiments → RTX 4090. Don’t pay for H100 pricing if your workload is inference-only and memory-bandwidth bound — an L40S is often the smarter spend.

Essential GPU Commands

These are the commands you'll actually type, in order of how often you'll need them.

1. nvidia-smi — your daily driver

$ nvidia-smi
# Shows: GPU utilization %, memory used/total, temperature,
# power draw, and every process currently using the GPU

$ watch -n 1 nvidia-smi
# Live-refreshing view, updates every 1 second — use this
# while training to watch utilization in real time

$ nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv
# Scriptable output for logging/monitoring pipelines

$ nvidia-smi -l 5
# Loop mode, refreshes every 5 seconds without "watch"

2. CUDA toolkit & driver info

$ nvcc --version
# CUDA compiler version — must be compatible with your
# PyTorch/TensorFlow build or you'll get silent failures

$ nvidia-smi --query-gpu=driver_version --format=csv
# Driver version — mismatched driver/CUDA is the #1 cause
# of "CUDA not available" errors

$ cat /usr/local/cuda/version.json
# Installed CUDA toolkit version (path may vary by install)

3. PyTorch GPU commands

import torch

torch.cuda.is_available()        # True/False — sanity check #1
torch.cuda.device_count()        # How many GPUs are visible
torch.cuda.current_device()      # Which GPU index is active
torch.cuda.get_device_name(0)    # "NVIDIA H100 80GB HBM3"
torch.cuda.memory_allocated()    # Bytes currently allocated
torch.cuda.memory_reserved()     # Bytes reserved by the allocator
torch.cuda.empty_cache()         # Release unused cached memory
torch.cuda.synchronize()         # Block until all GPU ops finish

4. Multi-GPU & environment control

$ export CUDA_VISIBLE_DEVICES=0,1
# Restrict a process to only see GPUs 0 and 1

$ CUDA_VISIBLE_DEVICES=2 python train.py
# Run a script pinned to GPU index 2 only

$ nvidia-smi topo -m
# Shows interconnect topology between GPUs (NVLink vs PCIe)
# — critical for understanding multi-GPU training speed

$ torchrun --nproc_per_node=4 train.py
# Launch distributed training across 4 GPUs on one node

❌ Most common GPU error: CUDA out of memory

RuntimeError: CUDA out of memory. Tried to allocate X GiB — happens when batch size, sequence length, or model size exceeds available VRAM. The traceback rarely tells you the real fix.

Fix in order of effort: (1) reduce batch size, (2) enable gradient checkpointing, (3) switch to mixed precision (FP16/BF16), (4) use gradient accumulation to simulate larger batches, (5) quantize the model, (6) use a GPU with more VRAM.

NVIDIA's Software Stack

NVIDIA's actual moat isn't the silicon — it's a decade of software that makes that silicon usable. This is what each layer does.

NVIDIA AI Software Stack

Your Application (PyTorch / TensorFlow / vLLM)

↑

TensorRT-LLM — inference optimization & compilation

↑

cuDNN — optimized deep learning primitives (convolutions, attention)

↑

NCCL — multi-GPU communication (all-reduce, broadcast)

↑

CUDA — the parallel computing platform & programming model

↑

GPU Hardware — CUDA cores, Tensor Cores, HBM, NVLink

Component	What it does	Why it matters for LLMs
CUDA	Parallel computing platform & API	Foundation everything else builds on. Without it, the GPU is just silicon.
cuDNN	Optimized primitives for deep learning ops	Hand-tuned attention, convolution, and normalization kernels — far faster than naive implementations.
NCCL	Multi-GPU/multi-node communication library	Makes distributed training across hundreds of GPUs actually work efficiently.
TensorRT-LLM	Inference compiler & runtime	Compiles models into optimized engines — can cut inference latency significantly vs. raw PyTorch.
NVLink	High-speed GPU-to-GPU interconnect	900 GB/s between GPUs (vs ~64 GB/s on PCIe) — essential for model-parallel training.
MIG	Multi-Instance GPU partitioning	Splits one physical GPU (A100/H100) into up to 7 isolated instances — better utilization for smaller inference workloads.

NVIDIA Containers — NIM, NGC & Easy Deployment

Installing CUDA, cuDNN, the right PyTorch build, and a dozen other dependencies by hand is exactly the kind of fragile setup that breaks the moment you move from your laptop to a cloud GPU. NVIDIA's container ecosystem exists to make that problem disappear — and it has expanded significantly with the launch of NIM.

📦 NVIDIA Container Toolkit

The foundation. Lets Docker containers access the host GPU directly via the --gpus flag, instead of needing CUDA installed on every machine.

📁 NGC Catalog

NVIDIA's registry of pre-built, performance-tuned containers for PyTorch, TensorFlow, Triton, and more — versions and drivers already matched and tested.

📱 NVIDIA NIM

Pre-packaged, production-ready microservices for deploying LLMs and other models — optimized inference engine included, exposed via an OpenAI-compatible API.

NVIDIA Container Toolkit — the foundation

$ docker run --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
# Runs nvidia-smi inside a container with full GPU access —
# the container sees the host's GPU directly, no passthrough hacks

$ docker run --gpus '"device=0,1"' my-training-image
# Restrict a container to specific GPU indices

$ docker run --gpus all --shm-size=8g -p 8000:8000 my-inference-image
# --shm-size matters: PyTorch dataloaders and multi-process
# inference often need more shared memory than Docker's 64MB default

NVIDIA NIM — the big one for LLM deployment

NIM (NVIDIA Inference Microservices) is a catalog of containerized, pre-optimized models — Llama, Mistral, Mixtral, and NVIDIA's own models — that ship with TensorRT-LLM or vLLM already configured underneath. Instead of hand-tuning batching, KV cache, and quantization yourself, you pull a NIM container and get an OpenAI-compatible inference endpoint in minutes.

$ docker run --gpus all -p 8000:8000 \
    -e NGC_API_KEY=$NGC_API_KEY \
    nvcr.io/nim/meta/llama3-8b-instruct:latest
# Pulls and runs a pre-optimized Llama 3 8B endpoint

$ curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "meta/llama3-8b-instruct", "messages": [...]}'
# Standard OpenAI-style API — drop-in compatible with
# existing OpenAI SDK client code

Why this matters for GPU utilization

NIM containers bake in the optimization techniques covered below — continuous batching, quantization-aware serving, optimized KV cache management — so you get production-grade throughput without manually wiring up TensorRT-LLM yourself. The trade-off is less control over the underlying serving stack in exchange for much faster time-to-deploy.

Approach	Setup time	Control	Best for
Raw PyTorch + manual CUDA setup	Hours–days	Full	Research, custom architectures
NGC container (PyTorch/Triton)	Minutes	High	Custom training/serving with matched dependencies
NVIDIA NIM	Minutes	Moderate	Fast production deployment of standard open models
Triton Inference Server	Hours	High	Serving multiple model types/frameworks at scale

Utilization Techniques That Actually Move the Needle

These are the techniques engineers reach for, ranked by effort-to-impact ratio.

Mixed Precision Training (FP16 / BF16)

Train using 16-bit floats instead of 32-bit. Halves memory usage and roughly doubles throughput on Tensor Cores, with negligible accuracy loss when done correctly (loss scaling handles the precision risk). BF16 is generally preferred over FP16 on modern GPUs (A100+) because it has the same exponent range as FP32, avoiding overflow issues.

Quantization (INT8 / INT4)

Reduce weight precision further for inference. Since most LLM inference is memory-bandwidth bound, smaller weights mean less data movement and faster generation. A 70B model that needs 150GB in FP16 fits in ~38GB at INT4 — the difference between needing 2×H100s and fitting on a single 48GB workstation GPU (RTX 6000 Ada, A6000). It still won't fit on a 24GB consumer card like the RTX 4090 — for that you'd need a 7B–13B model instead.

Flash Attention

A fused attention algorithm that avoids materializing the full attention matrix in HBM, computing it in fast on-chip SRAM instead. Reduces memory reads/writes dramatically and speeds up both training and inference for long sequences — now standard in virtually every serious LLM framework.

Gradient Checkpointing

Trade compute for memory: instead of storing all intermediate activations for backpropagation, recompute them during the backward pass. Cuts activation memory significantly at the cost of ~20–30% more compute time. Use when you're memory-constrained but have compute headroom.

Parallelism Strategies

Strategy	How it works	Use when
Data Parallelism	Same model copied across GPUs, each processes a different data batch	Model fits on one GPU; you want faster training via more throughput
Tensor Parallelism	Individual layers/matrices split across GPUs	Model is too large for one GPU's memory
Pipeline Parallelism	Different layers placed on different GPUs, data flows through in stages	Very deep models, combined with tensor parallelism for huge models
ZeRO / FSDP	Shards optimizer states, gradients, and parameters across GPUs	Training large models without full model replication overhead

“The fastest GPU optimization isn't a more expensive card — it's reducing how much data has to move through memory in the first place. Quantization and Flash Attention solve the same root problem from different angles.”

Monitoring & Avoiding Wasted Spend

GPU time is expensive. An idle H100 at low utilization is money burning for nothing — and it's surprisingly common.

📊 nvtop / gpustat

Better visual monitoring than raw nvidia-smi. pip install gpustat for a clean one-line summary per GPU.

🔍 PyTorch Profiler

torch.profiler shows exactly which operations consume the most GPU time — essential before optimizing blindly.

💰 DCGM

NVIDIA's Data Center GPU Manager — fleet-level monitoring for production clusters, integrates with Prometheus/Grafana.

Reality check

If nvidia-smi shows high memory usage but low GPU utilization (the "util %" column), your bottleneck is somewhere else — usually data loading, CPU preprocessing, or a synchronization stall. Throwing a bigger GPU at this problem won't help; profile first.

Choosing the Right Setup

Your situation	Recommended setup	Why
Learning / prototyping	RTX 4090 or cloud T4/L4	Cheap, sufficient for 7B models with quantization
Fine-tuning 7B–13B models	Single A100 (40–80GB)	Enough VRAM for LoRA/QLoRA without multi-GPU complexity
Production inference at scale	L40S or A100 cluster	Best cost-per-token for serving, not training-optimized pricing
Training large models from scratch	H100/H200 cluster with NVLink	Only realistic option for frontier-scale pre-training
Budget-constrained startup	Quantized open models on L4/L40S	INT4 quantization + efficient GPU = 10x cost reduction vs naive FP16 + H100

❌ Most common mistake: renting H100s for inference-only workloads

H100 pricing is optimized for training throughput. If you're only serving inference traffic, you're paying for compute capability you're not using — remember, inference is memory-bandwidth bound, not compute bound.

Fix: Benchmark your actual workload on an L40S or A100 first. Quantize to INT8/INT4. Only move to H100/H200 if you've measured a genuine bottleneck that cheaper hardware can't solve.

Practice infrastructure questions in a live interview

The Interview Simulator asks you to reason through GPU selection, memory calculations, and optimization trade-offs — scored in real time by Claude.

Start Mock Interview →

Create your free account

GPUs for LLMs: Architecture, Commands & Utilization Guide

GPUs for LLMs
Architecture, Commands & Utilization

Why GPUs, Not CPUs?

GPU Architecture 101

VRAM — the number that actually limits you

The NVIDIA GPU Lineup for AI

Essential GPU Commands

1. nvidia-smi — your daily driver

2. CUDA toolkit & driver info

3. PyTorch GPU commands

4. Multi-GPU & environment control

NVIDIA's Software Stack

NVIDIA Containers — NIM, NGC & Easy Deployment

NVIDIA Container Toolkit — the foundation

NVIDIA NIM — the big one for LLM deployment

Utilization Techniques That Actually Move the Needle

Mixed Precision Training (FP16 / BF16)

Quantization (INT8 / INT4)

Flash Attention

Gradient Checkpointing

Parallelism Strategies

Monitoring & Avoiding Wasted Spend

Choosing the Right Setup

You've reached the free preview

Create your free account

GPUs for LLMs: Architecture, Commands & Utilization Guide

GPUs for LLMsArchitecture, Commands & Utilization

Why GPUs, Not CPUs?

GPU Architecture 101

VRAM — the number that actually limits you

The NVIDIA GPU Lineup for AI

Essential GPU Commands

1. nvidia-smi — your daily driver

2. CUDA toolkit & driver info

3. PyTorch GPU commands

4. Multi-GPU & environment control

NVIDIA's Software Stack

NVIDIA Containers — NIM, NGC & Easy Deployment

NVIDIA Container Toolkit — the foundation

NVIDIA NIM — the big one for LLM deployment

Utilization Techniques That Actually Move the Needle

Mixed Precision Training (FP16 / BF16)

Quantization (INT8 / INT4)

Flash Attention

Gradient Checkpointing

Parallelism Strategies

Monitoring & Avoiding Wasted Spend

Choosing the Right Setup

You've reached the free preview

GPUs for LLMs
Architecture, Commands & Utilization