Running AI in production is expensive — and most teams discover this only after the bill arrives. A single GPT-4 class model serving moderate traffic can cost more per month than an entire backend engineering team's cloud budget combined. Token costs stack up invisibly, GPU hours overrun estimates, and vector database queries balloon as data grows.
The good news: most AI infrastructure cost problems are engineering problems, not budget problems. The ten techniques in this guide cover every layer of the AI stack — from how you select models to how you store embeddings — and each one can independently cut costs by 30–80% in the right context.
- Smaller models first
- Task-specific models
- Distilled model usage
- Dynamic model selection
- Cheap fallback models
- Tiered inference
- Prompt compression
- Output limits enforced
- Context pruning
- Response caching
- Embedding reuse
- Query deduplication
- Autoscaling enabled
- Spot instances usage
- GPU utilization tracking
- Multi-model usage
- Avoid single dependency
- Cost benchmarking
- Batch inference jobs
- Async processing pipelines
- Queue-based execution
- Stateless services
- Serverless where possible
- Efficient pipelines
- Vector DB tuning
- Data lifecycle policies
- Cold storage usage
- Cost dashboards
- Budget alerts setup
- Usage tracking
1. Model Selection
The most impactful cost lever is the one you set before writing a single line of inference code: which model are you calling? Most engineers default to the most capable frontier model because it is the easiest choice. Most of the time, it is also the most wasteful one.
Start with the smallest model that passes your eval
Define a minimum quality threshold for your task first — then find the cheapest model that clears it. For classification, extraction, and summarisation tasks, models like GPT-4o Mini, Claude Haiku, or Gemini Flash are often 10–20× cheaper than their flagship counterparts and perform within 2–5% on real-world evals.
Use task-specific and distilled models
Fine-tuned smaller models frequently outperform general-purpose large models on narrow tasks. A 7B model fine-tuned on your domain data will beat GPT-4 on your specific benchmark — at a fraction of the cost per token.
| Task type | Recommended model tier | Estimated cost saving vs GPT-4 |
|---|---|---|
| Intent classification | Fine-tuned 3B–7B model | 85–95% |
| Structured data extraction | GPT-4o Mini / Claude Haiku | 75–85% |
| RAG answer generation | GPT-4o Mini / Gemini Flash | 60–75% |
| Long-form content generation | GPT-4o / Claude Sonnet | 10–30% |
| Complex multi-step reasoning | Frontier model required | 0–15% |
2. Model Routing
Model routing is the architectural pattern that gives you the benefits of frontier models where you need them, and cheap models everywhere else — automatically. Instead of a single model endpoint, you build a routing layer that classifies incoming requests and dispatches them to the right model tier.
Dynamic model selection
A lightweight classifier (can itself be a tiny model) scores each request on complexity and routes it accordingly. Simple queries — FAQ lookups, short summaries — go to your cheapest tier. Complex multi-step reasoning or nuanced generation goes to the frontier model. A well-tuned router can send 60–80% of requests to cheaper tiers.
Cheap fallback models
If your primary model times out, is rate-limited, or returns a low-confidence response, your fallback should be a cheaper model — not an error. This both improves reliability and keeps average cost down during traffic spikes.
Tiered inference
Define explicit tiers — typically three — and map request categories to them at design time:
- Tier 1 (cheapest): rule-based or fine-tuned small model — handles deterministic, narrow tasks
- Tier 2 (mid): small frontier model (Haiku, Mini, Flash) — handles moderate complexity
- Tier 3 (expensive): large frontier model — reserved for genuinely hard problems
3. Token Management
Every token is a cost. Prompt tokens and output tokens are both metered, and in most production systems there is significant waste on both sides. Token management is the discipline of eliminating that waste systematically.
Prompt compression
Long system prompts, verbose few-shot examples, and redundant context are the primary sources of prompt token waste. Techniques to reduce them:
- Prompt distillation — iteratively shorten prompts while running evals to ensure quality holds
- LLMLingua / selective compression — use a lightweight model to compress context before passing to the main model
- Retrieved context trimming — in RAG, retrieve more chunks but pass fewer, higher-scored ones to the model
Output limits enforced
Always set max_tokens explicitly. Without a hard limit, models will pad responses. For structured outputs (JSON, classification labels), set very tight limits — a response that should be 20 tokens should not be allowed to run to 500.
Context pruning
In multi-turn conversations, naive systems send the full conversation history with every turn. This leads to quadratic cost growth. Prune intelligently: summarise older turns, drop low-relevance exchanges, and keep only the context the model needs to respond correctly to the current message.
| Token waste source | Fix | Typical saving |
|---|---|---|
| Verbose system prompts | Prompt distillation + compression | 20–40% of prompt tokens |
| Full conversation history | Rolling summarisation + pruning | 30–60% of prompt tokens |
| No output limit set | Enforce max_tokens per task | 10–50% of output tokens |
| Over-retrieved RAG context | Trim to top-K ranked chunks | 25–45% of prompt tokens |
4. Caching Layer
Calling an LLM API for a question you have already answered is one of the most common and most avoidable AI infrastructure costs. A well-designed caching layer can eliminate 30–60% of API calls in most production applications.
Response caching
Cache exact-match and near-match responses. For exact matches, a simple key-value cache (Redis, DynamoDB) indexed on a hash of the prompt is sufficient. For semantic near-matches, store embeddings of past prompts and use cosine similarity to find cached responses above a threshold before calling the API.
Embedding reuse
Embedding generation is cheap relative to generation, but it adds up at scale and introduces latency. Once a document, chunk, or user query is embedded, store and reuse that embedding. Never re-embed the same content twice. This applies especially to your document corpus in RAG — re-embedding on every ingestion run is a common waste pattern.
Query deduplication
In high-traffic systems, many users ask semantically identical questions within short time windows. Detect and collapse these before they reach the LLM. A sliding-window deduplication queue with a semantic similarity threshold (e.g. cosine > 0.97) can serve multiple users from one API call.
5. Infrastructure Usage
Even when you are using the right models efficiently, the compute you provision underneath can be dramatically over-provisioned. Infrastructure efficiency is about using exactly the compute you need — no more.
Autoscaling enabled
AI workloads are rarely steady-state. Traffic spikes during business hours, drops overnight, and surges unpredictably. Configure autoscaling on every inference endpoint — scale to zero when possible, scale out quickly when traffic arrives. On Kubernetes, use KEDA with custom metrics (tokens/second, queue depth) rather than CPU/memory, which are poor proxies for LLM workload pressure.
Spot and preemptible instances
Training jobs and batch inference workloads are excellent candidates for spot instances. With checkpointing enabled, you can achieve 60–80% compute cost reduction versus on-demand pricing. For online serving, use spot for background workers and reserve on-demand only for latency-sensitive inference paths.
GPU utilization tracking
Low GPU utilization is one of the most expensive invisible problems in AI infrastructure. A team paying for a cluster of H100s at 20% average utilization is burning 80% of their GPU budget. Track nvidia-smi metrics, model throughput, and batch size efficiency continuously — and right-size or consolidate before the monthly bill arrives.
6. Vendor Strategy
Single-vendor dependency is both a cost risk and a reliability risk. A deliberate multi-vendor strategy gives you pricing leverage, fallback options, and the ability to route to whichever provider offers the best price-performance for each task type.
Multi-model and multi-provider usage
Maintain integrations with at least two LLM providers. Route tasks to the cheapest provider that meets your quality threshold for that task. Providers update pricing and model capabilities constantly — what is cheapest today may not be cheapest in six months.
Cost benchmarking
Build a lightweight benchmarking harness that runs your real production prompts through competing models and providers monthly. Measure cost-per-1M tokens, latency, and quality score together — never evaluate cost in isolation.
| Provider | Strength | Best for |
|---|---|---|
| OpenAI | Broad capability, large ecosystem | General-purpose, function calling, structured output |
| Anthropic | Long context, instruction following | Document analysis, coding, multi-step reasoning |
| Google Vertex | Multimodal, GCP integration | GCP-native stacks, image+text tasks |
| Self-hosted (Ollama, vLLM) | Zero marginal cost at scale | High-volume, latency-insensitive tasks |
| Groq / Cerebras | Extreme inference speed | Real-time, latency-critical use cases |
7. Batch Processing
Not every AI task needs a real-time response. Identifying workloads that can tolerate delay and shifting them to batch processing is one of the highest-ROI changes you can make — most providers charge 50% less for batch API calls, and self-hosted systems can achieve even greater savings through throughput optimisation.
Batch inference jobs
Tasks like nightly report generation, document summarisation pipelines, embedding refreshes, and offline scoring are natural batch workloads. OpenAI's Batch API, Anthropic's Message Batches API, and AWS SageMaker batch transform all offer significant discounts versus real-time inference endpoints.
Async processing pipelines
Where responses do not need to be synchronous, decouple the request from the response. Accept the user request immediately, queue the LLM call, and deliver the result asynchronously (webhook, polling, WebSocket). This lets you bin-pack requests efficiently and use cheaper compute tiers.
Queue-based execution
A message queue (SQS, Pub/Sub, RabbitMQ) between your API layer and your inference workers lets you absorb traffic spikes without scaling up expensive GPU instances immediately. Workers consume from the queue at a controlled rate optimised for throughput, not latency.
8. Architecture Design
Architectural decisions made early in a project have compounding cost effects. The patterns below are not just good engineering — they are directly tied to keeping infrastructure costs predictable and low at scale.
Stateless services
Stateless inference services can be scaled in and out instantly, share no session state between replicas, and are trivially deployable across spot instance pools. Avoid storing conversation state in-process — use Redis or DynamoDB for session storage and keep your inference containers entirely stateless.
Serverless where possible
For spiky or low-volume workloads, serverless inference (AWS Lambda + model endpoints, Cloud Run, Modal, Replicate) eliminates the cost of idle compute entirely. You pay only for actual invocations. The trade-off is cold start latency — acceptable for async tasks, problematic for latency-sensitive user-facing calls.
Efficient pipelines
Every unnecessary hop in your inference pipeline is latency and compute you are paying for. Audit your pipelines regularly: are you embedding content that you already have embeddings for? Are you calling the LLM for steps that could be handled by deterministic code? Is your retrieval returning chunks that the model then immediately discards?
9. Storage Optimization
AI applications often have large and fast-growing storage footprints — embedding vectors, training data, model artefacts, and inference logs. Unmanaged, these can become a significant and invisible line item.
Vector DB tuning
Vector databases are often provisioned at full recall precision when approximate nearest neighbour (ANN) search at 95–98% recall would cost a fraction as much and deliver imperceptible quality differences. Tune your index type (HNSW vs IVF), M and ef_construction parameters, and consider quantisation (int8 instead of float32) to cut memory footprint by 4×.
Data lifecycle policies
Set explicit retention policies on all AI data: raw documents, embeddings, inference logs, and intermediate pipeline outputs. Most AI applications have no lifecycle policy at all — data accumulates indefinitely at full-price storage tiers. Automate tiering to cheaper storage classes (S3 Glacier, Coldline) for data older than 30–90 days.
Cold storage usage
Training datasets, model checkpoints, and historical inference logs rarely need to be accessed after the initial use period. Move them to cold storage (S3 Glacier Deep Archive, Azure Archive) and save 70–90% on storage costs for that data. Maintain only the current production model artefacts in warm storage.
10. Monitoring Costs
You cannot optimise what you cannot see. Cost monitoring is the foundation that makes every other technique actionable — without it, you are flying blind and will not catch regressions when a new deployment pattern quietly doubles your LLM spend.
Cost dashboards
Build or configure dashboards that show cost broken down by: model, endpoint, feature, team, and customer (if multi-tenant). Aggregate dashboards that show only total spend are insufficient — you need to know which model call, which feature, and which user pattern is driving each cost component.
Budget alerts
Set budget alerts at 50%, 80%, and 100% of your monthly AI spend budget — not just the overall cloud budget. Configure alerts at the per-service and per-model level too. A single runaway batch job or a prompt injection attack causing token flooding should trigger an alert within minutes, not at month-end billing.
Usage tracking per feature
Tag every LLM call with the feature or user journey that generated it. This allows you to calculate cost-per-feature and cost-per-user, which are the metrics that actually inform prioritisation and pricing decisions. Without this data, cost optimisation efforts are guesswork.
Where to Start
| Technique | Implementation effort | Expected cost impact | Do first? |
|---|---|---|---|
| Model selection review | Low | 60–90% | ✅ Yes |
| Token management (output limits) | Very low | 10–50% | ✅ Yes |
| Exact-match response caching | Low | 20–40% | ✅ Yes |
| Cost monitoring & dashboards | Medium | Foundational | ✅ Yes |
| Batch processing for async tasks | Medium | 40–60% | After above |
| Model routing layer | Medium–High | 50–70% | After above |
| Autoscaling + spot instances | Medium | 50–80% | After above |
| Vector DB tuning | Medium | 30–60% | When scaling |
| Semantic caching | High | 20–50% | When scaling |
| Context pruning pipeline | High | 30–60% | When scaling |
Summary
AI infrastructure cost is an engineering discipline, not a finance problem. The ten techniques in this guide — model selection, model routing, token management, caching, infrastructure efficiency, vendor strategy, batch processing, architecture design, storage optimisation, and cost monitoring — together form a complete cost optimisation playbook that any AI engineering team can implement incrementally.
Start with the highest-ROI, lowest-effort changes: audit your model choices, set output limits everywhere, add exact-match caching, and wire up cost dashboards before the next billing cycle. Those four steps alone can cut AI spend by 40–70% in most production systems.