Infrastructure

Beyond serverless inference, OneInfer gives you full control over the compute layer. Spin up GPU instances across 10+ cloud providers, deploy models to dedicated endpoints, configure intelligent load-balanced routing, and attach persistent storage — all managed through the same API surface.

GPU Instances

Provision on-demand GPU instances across multiple cloud providers with a single API call. Choose the GPU SKU, disk size, Docker image, and region. Instances support SSH access, custom startup scripts, and can be started, stopped, or restarted programmatically.

Providers

Novita, Together AI, RunPod, Hyperbolic, Fireworks, Vultr, Azure, Nebius, E2E Networks, Primeintellect, Verda.

SSH access

Each instance exposes an SSH endpoint. Upload your public key once and reuse it across instances.

Custom images

Pass any Docker image URL — your own registry or a public image with pre-installed dependencies.

# Create a GPU instance
curl -X POST "https://api.oneinfer.ai/v1/developer/{developerId}/create-instance" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "provider_name": "novita",
    "instance_name": "my-inference-node",
    "gpu_id": "nvidia_a100_80gb",
    "gpu_num": 1,
    "disk_size": 100,
    "image_url": "pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime",
    "region": "us-west-1"
  }'
Browse available GPUs, compare pricing across providers, and check real-time availability in the GPU Marketplace console.

Dedicated Endpoints

A dedicated endpoint pins inference traffic to a specific model deployment running on a GPU instance you control. Unlike serverless inference, dedicated endpoints give you guaranteed capacity, predictable cold-start behaviour, and the ability to run private or fine-tuned models.

Predictable latency

No shared-resource contention. Traffic goes only to your deployment.

Private models

Run fine-tuned or proprietary models that are not available through the public provider APIs.

Endpoint ID

Pass the endpoint_id field in any inference request to pin it to a specific deployment.

vLLM compatible

Deployments use vLLM by default, so any vLLM-compatible model works out of the box.

# Use a dedicated endpoint in a chat completion
curl -X POST "https://api.oneinfer.ai/v1/ula/chat/completions" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "endpoint_id": "ep_abc123",
    "messages": [{ "role": "user", "content": "Hello" }]
  }'

Intelligent Routing

An intelligent endpoint sits in front of a pool of dedicated endpoints and automatically routes each request to the best available backend — based on latency, cost, and current load. This gives you horizontal scale across multiple GPU instances without any client-side logic.

Load balancing

Traffic is spread across healthy backends automatically.

Cost optimisation

The router prefers lower-cost backends when latency budgets allow.

Failover

Unhealthy backends are removed from the pool and re-added when they recover.

Manage intelligent endpoints in the console or via the Intelligent Endpoints API.

Persistent Storage

Create cloud storage volumes to persist model weights, datasets, checkpoints, and generated outputs across instance restarts. Storage is billed per GB and can be attached to any GPU instance.

Volume lifecycle

Create, list, inspect, and delete volumes via the Storage API.

Attach to instances

Mount a storage volume when creating or reconfiguring an instance.

Per-GB billing

You are charged only for the storage you provision, tracked through the credit system.

Use cases

Model weight caching, training dataset storage, fine-tune checkpoint persistence, output archiving.

# Create a storage volume
curl -X POST "https://api.oneinfer.ai/v1/storage" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "model-weights",
    "size_gb": 200,
    "region": "us-west-1"
  }'

Next Steps