Model Lifecycle

OneInfer covers the full model lifecycle beyond just inference. Evaluate models against standard and custom benchmarks, fine-tune them on your data, optimise inference kernels for your target hardware, and model expected throughput before you spend a dollar on compute — all from the same platform.

Evaluations & Benchmarking

Run structured evaluations against any model to measure accuracy, reasoning ability, and task-specific performance. Use built-in benchmark datasets or upload your own. Compare results across multiple models side-by-side to make informed provider and model selection decisions.

Standard benchmarks

MMLU, HellaSwag, MATH, TruthfulQA, and other widely-used evaluation datasets are available out of the box.

Custom datasets

Upload your own evaluation set with expected outputs and a scoring rubric (automatic, manual, or custom).

Multi-model comparison

Run the same benchmark across multiple models simultaneously and view a side-by-side results table.

Historical tracking

Every evaluation job is stored with its full results, so you can track model performance over time.

Sampling & preview

Preview a sample of any benchmark before committing to a full evaluation run.

Progress callbacks

Evaluation jobs are async. Poll for status or receive progress updates via webhook.

Start evaluations from the Evaluations console. Results are stored and accessible programmatically via the API.

Fine-tuning

Fine-tune foundation models on your domain-specific data using Unsloth, an optimised fine-tuning framework that significantly reduces GPU memory usage and training time compared to standard approaches. Submit a job, monitor its progress, download the output adapter, and deploy it to a dedicated endpoint.

Unsloth backend

Faster training with lower VRAM usage than standard LoRA fine-tuning — typically 2–5× faster per step.

Job management

Submit jobs, poll for status, view logs, and retrieve output files through a consistent API.

Output adapters

Download the trained LoRA adapter weights for use with your own infrastructure.

Direct deployment

Deploy the fine-tuned model directly to a OneInfer dedicated endpoint without any additional setup.

Fine-tuning jobs run on GPU instances managed through the GPU Orchestrator.

Kernel Optimisation

Inference performance varies significantly based on how well the model's compute graph is compiled for a given GPU. OneInfer's kernel optimisation service automatically generates hardware-specific inference kernels for your model and target GPU SKU — reducing latency and increasing throughput without any manual tuning.

Automatic

Submit a model and target GPU. The service generates and validates optimised kernel configs.

Async job

Kernel generation runs as a background job. Poll the status endpoint until completion.

Apply to endpoints

Once generated, the optimised kernel is applied to your dedicated endpoint automatically.

Manage kernel jobs from the Kernels console.

Hardware Intelligence

Before provisioning compute, use OneInfer's hardware intelligence tools to model expected throughput and latency for a given workload on any supported GPU SKU. Based on roofline performance analysis across a database of 2,800+ GPU configurations — covering NVIDIA, AMD, Intel, Google TPUs, and edge hardware.

Throughput estimation

Estimate tokens/second for a given model and batch size on any GPU before paying for it.

Bottleneck analysis

Understand whether a workload is compute-bound or memory-bandwidth-bound on specific hardware.

GPU comparison

Compare VRAM, memory bandwidth, TFLOPS, and estimated cost efficiency across 2,800+ SKUs.

Hardware coverage

NVIDIA (data centre & consumer), AMD ROCm, Intel Arc, Google TPU v4/v5, Jetson, Hailo NPUs.

Explore hardware specs and performance models in the Models console.

Next Steps

Deploy a fine-tuned model to a Dedicated Endpoint.
Use the GET Models API to discover available base models for fine-tuning.
Explore the Evaluations console to benchmark before and after fine-tuning.