AI Model Catalog

One API. 200+ models. No infrastructure management. Browse the full catalog below and start making inference calls in minutes.

200+

Models available

From frontier LLMs to specialised vision and audio models, all accessible through one unified endpoint.

<500ms

Median latency

Custom CUDA kernel optimisation and intelligent multi-cloud routing keep time-to-first-token consistently low.

80%

Average cost reduction

Teams migrating from direct cloud GPU billing to OneInfer's subscription model typically cut inference spend by 60–80%.

1 line

To switch models

Swap from GPT-4o to Llama 3, or from Whisper to a custom ASR model, with a single parameter change.

Text and Language Models

OneInfer hosts the full range of major large language models including GPT-4o, Claude 3, Llama 3, Mistral, Gemma, and dozens of fine-tuned variants. Every text model is available through the same chat completions API endpoint, with per-token pricing that scales linearly and no cold-start penalty. For production workloads requiring consistent throughput, dedicated endpoints isolate your traffic from shared capacity and guarantee SLA-backed response times.

Vision and Multimodal Models

Vision-language models like GPT-4V, LLaVA, and CogVLM are available for image understanding, document analysis, and visual question answering. Image generation models including Stable Diffusion XL, FLUX, and DALL-E 3 are exposed via a standardised image generation endpoint. All vision models accept base64-encoded or URL-referenced images, and the response format is consistent across providers so switching models requires no code changes beyond the model identifier.

Audio and Speech Models

Speech-to-text models including Whisper large-v3, Distil-Whisper, and provider-specific ASR models are accessible for transcription, translation, and voice activity detection. Text-to-speech synthesis is available with multiple voice options and sample rates. All audio models accept standard audio formats — MP3, WAV, FLAC, and OGG — and return structured JSON responses with transcripts, timestamps, and confidence scores where supported.

Embedding and Retrieval Models

Embedding models including text-embedding-3-large, Cohere Embed v3, and BGE are available for semantic search, retrieval-augmented generation (RAG), and clustering workloads. OneInfer's embedding endpoint returns normalised vectors in a consistent schema regardless of which provider generates them, so you can benchmark models against each other without rewriting your vector store integration. Batch embedding requests are supported for processing large document corpora efficiently.

How Pricing Works

Every model in the catalog has per-token or per-request pricing shown directly in the catalog below. There are no egress fees, no minimum spend commitments on the pay-as-you-go tier, and no markup for routing across providers. Subscription plans offer a monthly credit allocation at a predictable flat rate, which is how most production teams reduce their AI infrastructure bill by 60% or more versus paying cloud providers directly.

Pricing is updated in real time as providers adjust their rates. The catalog always reflects the current cost, so you can make accurate cost projections before committing to a model. For high-volume workloads, dedicated endpoint pricing is available on request and typically offers a further 20–30% reduction over pay-as-you-go rates.

AI Models

Browse our collection of state-of-the-art AI models

Filter Models

Select a category to filter available models

Loading models...