Home/Blog/From Zero to Production: Deploying LLMs on Multi-GPU Clouds
From Zero to Production: Deploying LLMs on Multi-GPU Clouds

From Zero to Production: Deploying LLMs on Multi-GPU Clouds

You've picked your model. You've tested it locally. It runs beautifully on your machine with Ollama. Now you need to get it to production - serving real users, at real scale, with real reliability.

6 min read

~1200 words

This is where most teams hit a wall. The gap between "it works on my laptop" and "it's serving 10,000 requests per day at 99.9% uptime" is filled with GPU memory management, batching configuration, autoscaling decisions, and multi-provider logistics that nobody wrote in the README. This guide walks through the entire journey, from model selection to production-ready LLM deployment.

Step 1: Choose Your Model and Quantization

Before touching a GPU, nail down your model selection and serving format. These choices determine everything downstream.

The framework for production model selection is: pick the smallest model that meets your quality bar, then optimize infrastructure around it rather than upgrading the model when you hit performance problems. Model upgrades are expensive. Infrastructure optimization is compounding.

Llama 3.1 8B Instruct is the current sweet spot for cost-efficient open source model deployment. It fits in 16GB VRAM at FP16, runs at 2,000+ tokens/second with continuous batching on an A100, and matches or exceeds GPT-3.5-turbo on most practical production tasks.

For quantization, AWQ (Activation-aware Weight Quantization) is the production standard. It delivers INT4 memory efficiency with minimal quality regression, supported natively by vLLM. AWQ-quantized Llama 3.1 8B fits on an RTX 4090 (24GB) - and at $0.29/hr on OneInfer, this is your most cost-efficient path to production LLM deployment.

Step 2: Configure Your Serving Framework

vLLM is the production standard for LLM serving. It implements PagedAttention for efficient KV cache management, continuous batching for high throughput, and tensor parallelism for multi-GPU horizontal scaling. Here are the critical configuration parameters teams consistently get wrong:

--max-model-len: Sets your maximum context window. Larger values consume more KV cache VRAM. Start at 8,192 and increase only when your application genuinely requires longer contexts.

--gpu-memory-utilization: Default is 0.9. For dedicated inference nodes, you can push to 0.95. For shared environments, lower to 0.8 to leave headroom for memory fragmentation.

--max-num-batched-tokens: Controls maximum tokens per batch iteration. For latency-sensitive real-time applications, cap at 4,096. For throughput-optimized async workloads, push to 16,384+.

--enable-prefix-caching: Always enable for RAG applications with shared system prompts. This is a free 20-40% throughput improvement that costs nothing to turn on.

Step 3: Match Hardware to Model

Treat hardware selection as a decision with long financial consequences, not a checkbox.

For Llama 3.1 8B (AWQ): RTX 4090 or L40S. OneInfer's RTX 4090 at $0.29/hr is the optimal price/performance point for high-volume, cost-sensitive LLM serving.

For Llama 3.1 70B: A100 80GB for the best balance of performance and cost. H100 SXM for maximum throughput where latency SLAs are tight. OneInfer's A100 80GB at $0.79/hr hits the best price/performance ratio for this model class.

For Mixtral 8x7B (MoE): A100 80GB minimum. The MoE architecture needs the full memory footprint despite activating only a subset of parameters per forward pass.

Step 4: Go Multi-Provider From Day One

Don't launch on a single GPU provider. This is the most important production advice in this entire guide - and the advice that most teams ignore until after their first provider incident.

GPU capacity is not infinitely elastic. Providers run out of the specific SKU you need. They have incidents. They change spot pricing without notice. Every team searching for the best platform to run Llama 3 in production eventually discovers this - ideally before a production outage, not during one.

OneInfer's unified AI inference API is OpenAI-compatible, meaning you write your inference code once and the platform handles provider routing, failover, and cost optimization transparently. You're never rewriting provider-specific integration code, and you're never stuck during a provider incident with no alternative path.

This is what separates a production-grade AI model deployment platform from a collection of cloud accounts.

Step 5: Implement Production Observability

Your inference system needs four metric categories before going live:

Latency: Time-to-first-token (TTFT), inter-token latency, and total generation time. Track P50, P95, and P99 independently. P99 is your user experience floor.

Throughput: Requests/second, tokens generated/second, batch size distribution. These signals tell you whether you're compute-bound or I/O-bound under load.

Cost: Cost per 1,000 tokens, cost per request, GPU utilization percentage. Without cost metrics, you cannot measure optimization progress or catch cost regressions.

Quality: This is the one most teams skip. Sample 1-5% of production requests and evaluate outputs for format compliance, length distribution, and task-specific correctness. Arize AI and Weights & Biases both support production LLM monitoring with this level of observability.

Step 6: Configure Proactive Autoscaling

GPU autoscaling is 10-100x slower than CPU autoscaling. A new GPU instance takes 30-120 seconds to be ready for traffic. This means you must scale proactively on leading indicators, not reactively on lagging ones.

Configure your autoscaler to trigger on queue wait time rather than GPU utilization. When request queue wait time exceeds 500ms - before users feel the impact - trigger a scale-up event. By the time the new instance is warm, you'll need it.

Maintain a minimum of one always-warm instance unless traffic genuinely drops to zero for sustained periods. The cost of one idle RTX 4090-hour ($0.29) is almost always less than the user experience cost of cold-start churn.

Step 7: Load Test Before Launch

Before real users hit your inference endpoint, run a structured step-load test.

Start at 10% of expected peak traffic, hold for 5 minutes, check GPU utilization, latency distribution, and error rate. Step to 25%, hold, check. Continue to 50%, 75%, 100%. The step-up approach catches batching configuration issues, KV cache exhaustion under load, and memory leaks - all of which are invisible at low traffic and catastrophic at scale.

Locust is the Python-native standard for this. For LLM-specific load testing, use realistic prompts from your actual use case - token length distribution matters significantly for batching behavior and can't be faked with uniform synthetic prompts.

Going from zero to production-grade LLM deployment has gotten significantly more tractable over the past year - but it still requires deliberate infrastructure decisions at every layer. Get those decisions right from the start, and your infrastructure becomes a competitive moat. Get them wrong, and you're re-architecting under live traffic.

For a fast path to production-grade open source model deployment across multiple GPU providers, explore OneInfer's platform or contact the team.

© 2025 OneInfer.AI - AI Inference Platform