Model Lifecycle
OneInfer covers the full model lifecycle beyond just inference. Evaluate models against standard and custom benchmarks, fine-tune them on your data, optimise inference kernels for your target hardware, and model expected throughput before you spend a dollar on compute — all from the same platform.
Evaluations & Benchmarking
Run structured evaluations against any model to measure accuracy, reasoning ability, and task-specific performance. Use built-in benchmark datasets or upload your own. Compare results across multiple models side-by-side to make informed provider and model selection decisions.
Standard benchmarks
MMLU, HellaSwag, MATH, TruthfulQA, and other widely-used evaluation datasets are available out of the box.
Custom datasets
Upload your own evaluation set with expected outputs and a scoring rubric (automatic, manual, or custom).
Multi-model comparison
Run the same benchmark across multiple models simultaneously and view a side-by-side results table.
Historical tracking
Every evaluation job is stored with its full results, so you can track model performance over time.
Sampling & preview
Preview a sample of any benchmark before committing to a full evaluation run.
Progress callbacks
Evaluation jobs are async. Poll for status or receive progress updates via webhook.
Fine-tuning
Fine-tune foundation models on your domain-specific data using Unsloth, an optimised fine-tuning framework that significantly reduces GPU memory usage and training time compared to standard approaches. Submit a job, monitor its progress, download the output adapter, and deploy it to a dedicated endpoint.
Unsloth backend
Faster training with lower VRAM usage than standard LoRA fine-tuning — typically 2–5× faster per step.
Job management
Submit jobs, poll for status, view logs, and retrieve output files through a consistent API.
Output adapters
Download the trained LoRA adapter weights for use with your own infrastructure.
Direct deployment
Deploy the fine-tuned model directly to a OneInfer dedicated endpoint without any additional setup.
Kernel Optimisation
Inference performance varies significantly based on how well the model's compute graph is compiled for a given GPU. OneInfer's kernel optimisation service automatically generates hardware-specific inference kernels for your model and target GPU SKU — reducing latency and increasing throughput without any manual tuning.
Automatic
Submit a model and target GPU. The service generates and validates optimised kernel configs.
Async job
Kernel generation runs as a background job. Poll the status endpoint until completion.
Apply to endpoints
Once generated, the optimised kernel is applied to your dedicated endpoint automatically.
Hardware Intelligence
Before provisioning compute, use OneInfer's hardware intelligence tools to model expected throughput and latency for a given workload on any supported GPU SKU. Based on roofline performance analysis across a database of 2,800+ GPU configurations — covering NVIDIA, AMD, Intel, Google TPUs, and edge hardware.
Throughput estimation
Estimate tokens/second for a given model and batch size on any GPU before paying for it.
Bottleneck analysis
Understand whether a workload is compute-bound or memory-bandwidth-bound on specific hardware.
GPU comparison
Compare VRAM, memory bandwidth, TFLOPS, and estimated cost efficiency across 2,800+ SKUs.
Hardware coverage
NVIDIA (data centre & consumer), AMD ROCm, Intel Arc, Google TPU v4/v5, Jetson, Hailo NPUs.
Next Steps
- Deploy a fine-tuned model to a Dedicated Endpoint.
- Use the GET Models API to discover available base models for fine-tuning.
- Explore the Evaluations console to benchmark before and after fine-tuning.