Audio

Speech-to-Text

Step-by-step guide to upload an audio file, call the transcription API, and read the returned text in your app.

Speech-to-Text uses the same universal audio endpoint at /v1/ula/generate-audio, but the request format is different from TTS. You must sendmultipart/form-data with an audio file and transcription fields.

How the Flow Works

Create or copy your OneInfer API key.
Call /v1/ula/oauth-authentication to get a Bearer token.
POST multipart/form-data to /v1/ula/generate-audio.
Attach the audio file as file and send the transcription model fields alongside it.
Read the transcription text from data.text in the response.

Required Form Fields

bash

file=@/path/to/audio.mp3
model=saarika:v2.5
provider=sarvam
language=en-IN

Common fields:file is the uploaded audio,model selects the transcription model,provider selects the backend, and language is optional if you want to help the model with locale detection.

Python End-to-End Example

python

import requests

BASE_URL = "https://api.oneinfer.ai"
API_KEY = "YOUR_API_KEY"
AUDIO_PATH = "sample.mp3"

# Step 1: exchange API key for Bearer token
auth_response = requests.post(
    f"{BASE_URL}/v1/ula/oauth-authentication",
    params={"api_key": API_KEY},
    timeout=30,
)
auth_response.raise_for_status()
token = auth_response.json()["access_token"]

headers = {
    "Authorization": f"Bearer {token}",
}

# Step 2: upload audio for transcription
with open(AUDIO_PATH, "rb") as audio_file:
    response = requests.post(
        f"{BASE_URL}/v1/ula/generate-audio",
        headers=headers,
        data={
            "model": "saarika:v2.5",
            "provider": "sarvam",
            "language": "en-IN",
        },
        files={
            "file": (AUDIO_PATH, audio_file, "audio/mpeg"),
        },
        timeout=120,
    )

response.raise_for_status()
payload = response.json()

print("Transcription:")
print(payload["data"]["text"])

What the Response Looks Like

json

{
  "api_details": {
    "api_status": "success",
    "api_message": "API has return response successfully."
  },
  "data": {
    "id": "3ab483fe-2444-40f7-9cf2-720eeee21125",
    "provider": "sarvam",
    "model": "saaras:v3",
    "text": "This is the transcribed output from your uploaded audio file.",
    "finish_reason": "stop",
    "usage": {
      "prompt_tokens": 29,
      "completion_tokens": 0,
      "total_tokens": 29,
      "cache_input_tokens": 0
    }
  },
  "error": {}
}

In most clients, the value you care about is data.text. That is the recognized transcript you can store, display, summarize, or pass into a chat workflow.

Practical Tips

Do not send Content-Type: application/json for STT. Let your client build the multipart boundary automatically.
Pass language when you know the locale to improve recognition consistency.
Use clean audio input when possible. Background noise and overlapping speakers will reduce accuracy.
Check HTTP status codes before reading the body, especially for file upload errors like 400 or 415.
If you are building a voice app, feed data.text directly into your LLM step after transcription completes.