Infrastructure
Beyond serverless inference, OneInfer gives you full control over the compute layer. Spin up GPU instances across 10+ cloud providers, deploy models to dedicated endpoints, configure intelligent load-balanced routing, and attach persistent storage — all managed through the same API surface.
GPU Instances
Provision on-demand GPU instances across multiple cloud providers with a single API call. Choose the GPU SKU, disk size, Docker image, and region. Instances support SSH access, custom startup scripts, and can be started, stopped, or restarted programmatically.
Providers
Novita, Together AI, RunPod, Hyperbolic, Fireworks, Vultr, Azure, Nebius, E2E Networks, Primeintellect, Verda.
SSH access
Each instance exposes an SSH endpoint. Upload your public key once and reuse it across instances.
Custom images
Pass any Docker image URL — your own registry or a public image with pre-installed dependencies.
# Create a GPU instance
curl -X POST "https://api.oneinfer.ai/v1/developer/{developerId}/create-instance" \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"provider_name": "novita",
"instance_name": "my-inference-node",
"gpu_id": "nvidia_a100_80gb",
"gpu_num": 1,
"disk_size": 100,
"image_url": "pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime",
"region": "us-west-1"
}'Dedicated Endpoints
A dedicated endpoint pins inference traffic to a specific model deployment running on a GPU instance you control. Unlike serverless inference, dedicated endpoints give you guaranteed capacity, predictable cold-start behaviour, and the ability to run private or fine-tuned models.
Predictable latency
No shared-resource contention. Traffic goes only to your deployment.
Private models
Run fine-tuned or proprietary models that are not available through the public provider APIs.
Endpoint ID
Pass the endpoint_id field in any inference request to pin it to a specific deployment.
vLLM compatible
Deployments use vLLM by default, so any vLLM-compatible model works out of the box.
# Use a dedicated endpoint in a chat completion
curl -X POST "https://api.oneinfer.ai/v1/ula/chat/completions" \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"endpoint_id": "ep_abc123",
"messages": [{ "role": "user", "content": "Hello" }]
}'Intelligent Routing
An intelligent endpoint sits in front of a pool of dedicated endpoints and automatically routes each request to the best available backend — based on latency, cost, and current load. This gives you horizontal scale across multiple GPU instances without any client-side logic.
Load balancing
Traffic is spread across healthy backends automatically.
Cost optimisation
The router prefers lower-cost backends when latency budgets allow.
Failover
Unhealthy backends are removed from the pool and re-added when they recover.
Persistent Storage
Create cloud storage volumes to persist model weights, datasets, checkpoints, and generated outputs across instance restarts. Storage is billed per GB and can be attached to any GPU instance.
Volume lifecycle
Create, list, inspect, and delete volumes via the Storage API.
Attach to instances
Mount a storage volume when creating or reconfiguring an instance.
Per-GB billing
You are charged only for the storage you provision, tracked through the credit system.
Use cases
Model weight caching, training dataset storage, fine-tune checkpoint persistence, output archiving.
# Create a storage volume
curl -X POST "https://api.oneinfer.ai/v1/storage" \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "model-weights",
"size_gb": 200,
"region": "us-west-1"
}'Next Steps
- See full schemas in the Instance APIs and Storage APIs reference.
- Explore the GPU Marketplace to compare providers and pricing.
- Learn about model deployment in the Model Lifecycle section.