Today we're serving 3x the inference volume at 40% of what we were paying then. This is the complete story - every intervention, every tradeoff, every tool we used, and the actual numbers behind each change.
Where We Started: High Spend, Zero Visibility
We were running a single GPU provider, on-demand A100 80GB instances, with a standard vLLM setup. No autoscaling. No batching optimization. No cost attribution by model or endpoint. The bill was high and growing - but more critically, it was completely opaque.
We knew total GPU spend. We didn't know which models, which product features, or which traffic patterns were driving it. Optimizing without cost attribution is guesswork, and guesswork at GPU prices is expensive.
The first intervention was instrumentation. Every inference request got tagged with model ID, endpoint name, prompt token count, completion token count, GPU node ID, and latency. We streamed this into ClickHouse for fast analytical queries and built a Grafana dashboard that gave us cost-per-request by model and endpoint within 24 hours.
What we found immediately clarified everything. 70% of our inference volume was routed to our 70B parameter model for tasks that a properly prompted 8B model handles with comparable output quality. We were paying top-tier AI inference cost for commodity-quality work.
Intervention 1: Model Routing by Task Complexity (-35% Total Cost)
We built a lightweight complexity classifier - a small BERT-based model running on CPU - that scores each incoming request before routing. Short questions, simple classification, basic summarization: Llama 3.1 8B on RTX 4090 nodes. Complex reasoning, long-form generation, multi-step tasks: 70B model on A100 nodes.
The classifier adds 5-8ms per request, which is imperceptible against our inference latency. The result: 62% of our traffic shifted from the 70B model to the 8B model. Cost-per-request for that traffic dropped by 13x. Single largest cost reduction from any single infrastructure change we've made: approximately 35% off total GPU spend.
Intervention 2: Continuous Batching Tuning (-46% Cost-Per-Request on Existing Traffic)
We were running vLLM with default configuration. Our GPU utilization was averaging 38%. That means we were paying for 100% of the hardware and using 38% of it.
The tuning changes we made to our LLM serving platform configuration:
We increased --max-num-batched-tokens from 8,192 to 32,768. This enables larger, more efficient batches during high-traffic windows. We set --max-num-seqs to 256 to support higher concurrency. We enabled prefix caching, which gave immediate throughput gains on our RAG endpoints where system prompts are shared across requests.
After tuning: GPU utilization climbed from 38% to 71%. Same hardware, same cost - nearly double the effective throughput. Cost-per-request dropped 46% on all existing traffic.
Intervention 3: Multi-Provider Cost Arbitrage via OneInfer (-28% Raw GPU Cost)
After optimizing at the model and batching layers, we turned to GPU procurement itself.
We integrated OneInfer's Smart Aggregator - a unified AI inference API that routes requests across multiple GPU cloud providers under a single endpoint. OneInfer's routing engine compares real-time pricing and availability across providers and routes each request to the cheapest GPU that meets our latency threshold.
In practice: during off-peak hours, traffic routes heavily to lower-cost GPU providers. During peak hours when cheaper capacity is constrained, OneInfer routes to premium capacity automatically. We configured a latency ceiling of 800ms TTFT, and cost optimization happens within that constraint transparently.
Over 30 days post-integration, our effective cost-per-token dropped an additional 28% compared to our previous single-provider baseline. OneInfer's transparent pricing also eliminated the billing surprises we'd come to expect from our previous setup - every provider's cost is visible in a single unified dashboard.
Intervention 4: Quantization for Non-Critical Workloads (-40% GPU Count for Internal Tier)
For internal tooling - developer productivity features, internal search, content drafting assistance - we moved from FP16 to AWQ INT4 quantization. This halved the VRAM footprint, allowing us to run twice as many model instances per GPU node.
Quality tradeoff for internal workloads: imperceptible. For customer-facing features where output quality is a direct product differentiator, we kept FP16. For internal workloads where speed and cost matter more than marginal quality differences, AWQ was a clear win. GPU count required for our internal inference tier dropped by 40%.
The Full Accounting
Model routing: -35% total inference cost Batching optimization: -46% cost-per-request on existing traffic Multi-provider routing via OneInfer: -28% raw GPU cost Quantization for internal workloads: -40% internal tier GPU count
Net result: 60%+ reduction in total GPU spend while serving 3x the inference volume six months later.
The lesson is that reducing LLM inference cost is a layered problem. No single intervention closes the gap. You need to optimize simultaneously at the model selection layer, the serving configuration layer, the hardware procurement layer, and the efficiency layer.
None of these changes required rebuilding our stack. They required measurement first, prioritization second, and systematic experimentation third. Start with instrumentation - you cannot optimize what you cannot see.
To explore how OneInfer handles the procurement and routing layer for your infrastructure, visit oneinfer.ai or get in touch with the team.



