Audio

Speech-to-Text

Step-by-step guide to upload an audio file, call the transcription API, and read the returned text in your app.

Speech-to-Text uses the same universal audio endpoint at /v1/ula/generate-audio, but the request format is different from TTS. You must sendmultipart/form-data with an audio file and transcription fields.

How the Flow Works

  1. Create or copy your OneInfer API key.
  2. Call /v1/ula/oauth-authentication to get a Bearer token.
  3. POST multipart/form-data to /v1/ula/generate-audio.
  4. Attach the audio file as file and send the transcription model fields alongside it.
  5. Read the transcription text from data.text in the response.

Required Form Fields

bash
file=@/path/to/audio.mp3
model=saarika:v2.5
provider=sarvam
language=en-IN

Common fields:file is the uploaded audio,model selects the transcription model,provider selects the backend, and language is optional if you want to help the model with locale detection.

Python End-to-End Example

python
import requests

BASE_URL = "https://api.oneinfer.ai"
API_KEY = "YOUR_API_KEY"
AUDIO_PATH = "sample.mp3"

# Step 1: exchange API key for Bearer token
auth_response = requests.post(
    f"{BASE_URL}/v1/ula/oauth-authentication",
    params={"api_key": API_KEY},
    timeout=30,
)
auth_response.raise_for_status()
token = auth_response.json()["access_token"]

headers = {
    "Authorization": f"Bearer {token}",
}

# Step 2: upload audio for transcription
with open(AUDIO_PATH, "rb") as audio_file:
    response = requests.post(
        f"{BASE_URL}/v1/ula/generate-audio",
        headers=headers,
        data={
            "model": "saarika:v2.5",
            "provider": "sarvam",
            "language": "en-IN",
        },
        files={
            "file": (AUDIO_PATH, audio_file, "audio/mpeg"),
        },
        timeout=120,
    )

response.raise_for_status()
payload = response.json()

print("Transcription:")
print(payload["data"]["text"])

What the Response Looks Like

json
{
  "api_details": {
    "api_status": "success",
    "api_message": "API has return response successfully."
  },
  "data": {
    "id": "3ab483fe-2444-40f7-9cf2-720eeee21125",
    "provider": "sarvam",
    "model": "saaras:v3",
    "text": "This is the transcribed output from your uploaded audio file.",
    "finish_reason": "stop",
    "usage": {
      "prompt_tokens": 29,
      "completion_tokens": 0,
      "total_tokens": 29,
      "cache_input_tokens": 0
    }
  },
  "error": {}
}

In most clients, the value you care about is data.text. That is the recognized transcript you can store, display, summarize, or pass into a chat workflow.

Practical Tips

  • Do not send Content-Type: application/json for STT. Let your client build the multipart boundary automatically.
  • Pass language when you know the locale to improve recognition consistency.
  • Use clean audio input when possible. Background noise and overlapping speakers will reduce accuracy.
  • Check HTTP status codes before reading the body, especially for file upload errors like 400 or 415.
  • If you are building a voice app, feed data.text directly into your LLM step after transcription completes.