Audio
Speech-to-Text
Step-by-step guide to upload an audio file, call the transcription API, and read the returned text in your app.
Speech-to-Text uses the same universal audio endpoint at
/v1/ula/generate-audio, but the request format is different from TTS. You must sendmultipart/form-data with an audio file and transcription fields.How the Flow Works
- Create or copy your OneInfer API key.
- Call
/v1/ula/oauth-authenticationto get a Bearer token. - POST
multipart/form-datato/v1/ula/generate-audio. - Attach the audio file as
fileand send the transcription model fields alongside it. - Read the transcription text from
data.textin the response.
Required Form Fields
bash
file=@/path/to/audio.mp3
model=saarika:v2.5
provider=sarvam
language=en-INCommon fields:file is the uploaded audio,model selects the transcription model,provider selects the backend, and language is optional if you want to help the model with locale detection.
Python End-to-End Example
python
import requests
BASE_URL = "https://api.oneinfer.ai"
API_KEY = "YOUR_API_KEY"
AUDIO_PATH = "sample.mp3"
# Step 1: exchange API key for Bearer token
auth_response = requests.post(
f"{BASE_URL}/v1/ula/oauth-authentication",
params={"api_key": API_KEY},
timeout=30,
)
auth_response.raise_for_status()
token = auth_response.json()["access_token"]
headers = {
"Authorization": f"Bearer {token}",
}
# Step 2: upload audio for transcription
with open(AUDIO_PATH, "rb") as audio_file:
response = requests.post(
f"{BASE_URL}/v1/ula/generate-audio",
headers=headers,
data={
"model": "saarika:v2.5",
"provider": "sarvam",
"language": "en-IN",
},
files={
"file": (AUDIO_PATH, audio_file, "audio/mpeg"),
},
timeout=120,
)
response.raise_for_status()
payload = response.json()
print("Transcription:")
print(payload["data"]["text"])What the Response Looks Like
json
{
"api_details": {
"api_status": "success",
"api_message": "API has return response successfully."
},
"data": {
"id": "3ab483fe-2444-40f7-9cf2-720eeee21125",
"provider": "sarvam",
"model": "saaras:v3",
"text": "This is the transcribed output from your uploaded audio file.",
"finish_reason": "stop",
"usage": {
"prompt_tokens": 29,
"completion_tokens": 0,
"total_tokens": 29,
"cache_input_tokens": 0
}
},
"error": {}
}In most clients, the value you care about is data.text. That is the recognized transcript you can store, display, summarize, or pass into a chat workflow.
Practical Tips
- Do not send
Content-Type: application/jsonfor STT. Let your client build the multipart boundary automatically. - Pass
languagewhen you know the locale to improve recognition consistency. - Use clean audio input when possible. Background noise and overlapping speakers will reduce accuracy.
- Check HTTP status codes before reading the body, especially for file upload errors like
400or415. - If you are building a voice app, feed
data.textdirectly into your LLM step after transcription completes.