TL;DR
LLM inference costs are easy to ignore at small scale and impossible to ignore at large scale. The highest-impact techniques: prompt compression (biggest win), model routing (40–60% cost reduction), semantic caching (15–30% cache hit rate on FAQ-style apps), output length control, and async batching for non-real-time work.
LLM inference costs are easy to ignore at small scale and impossible to ignore at large scale. After optimizing the AI cost structure on several production products, here are the techniques that actually move the needle.
Understand Your Cost Drivers First
Before optimizing anything, measure. Where are the tokens going? Which features consume the most? Which users generate the most calls? What percentage of calls are hitting expensive models when a cheaper one would do?
Instrument every AI call with: model used, input token count, output token count, feature identifier, and user segment. Build a dashboard. You will be surprised by what you find — usually 20% of features drive 80% of cost.
Prompt Compression
The single highest-impact optimization in most systems is reducing input token count. Review every system prompt for redundancy. If you're sending 2,000 tokens of context that could be 400 tokens without losing meaning, that's an 80% reduction in input costs.
Compress retrieved RAG context intelligently — summarize or trim chunks that aren't directly relevant to the query. Use structured formats (JSON, XML) instead of verbose natural language for structured data. Every token you don't send is a token you don't pay for.
Model Routing
Not every query needs GPT-4 or Claude Sonnet. Simple classification tasks, intent detection, and structured data extraction can often be handled by smaller, faster, cheaper models with no noticeable quality difference for the user.
Build a routing layer that classifies query complexity and routes accordingly. Simple queries go to a fast, cheap model. Complex reasoning, nuanced generation, or high-stakes outputs go to the best model. A well-tuned router can cut costs by 40-60% with minimal quality impact.
Semantic Caching
Many AI applications have clusters of semantically similar queries that return similar outputs. A user asking 'how do I reset my password?' and 'what's the process for password reset?' should probably get the same answer.
Semantic caching stores embeddings of previous queries and their responses. On a new query, check if a semantically similar query was answered recently and return the cached response if the similarity score is high enough. Cache hit rates of 15-30% are common in support and FAQ-style applications — that's 15-30% of calls that cost almost nothing.
Output Length Control
Output tokens cost the same as input tokens (often more, depending on the model). If your application doesn't need long responses, explicitly instruct the model to be concise. 'Answer in 2-3 sentences' in a system prompt reliably reduces output length and cost.
For structured outputs, use JSON mode or function calling — these produce shorter, predictable outputs compared to freeform generation and are easier to parse downstream.
Batching and Async Processing
Not every AI call needs to be synchronous. Background tasks — summarization, classification, content generation — can be batched and processed during off-peak hours at lower priority pricing tiers.
Anthropic and OpenAI both offer batch processing APIs at significant discounts for non-real-time workloads. If you're doing bulk document processing, nightly summaries, or any non-interactive AI work, batching is the easiest cost reduction available.
Key Takeaways
- 1Measure first: instrument every AI call with model, token counts, feature ID, and user segment before optimising.
- 2Prompt compression is the single highest-impact optimisation — reduce input tokens aggressively.
- 3Model routing can cut costs 40–60% by sending simple queries to cheaper models.
- 4Semantic caching achieves 15–30% cache hit rates in FAQ and support applications.
- 5Use Anthropic/OpenAI batch APIs for all non-real-time AI work — significant pricing discounts available.