Realtime Voice AI

Audio-to-Audio Assistant

Build a real-time audio-to-audio assistant with a speech-to-speech pipeline powered by Saaras STT, VAD turn detection, Sarvam 105B reasoning, and Bulbul text-to-speech. This setup is ideal for low-latency voice assistants that listen, think, and answer naturally.

OneInfer Audio APIOneInfer Chat APISaarasSarvam 105BBulbulPythonWebSocketVAD

What you'll build

This page shows how to build a real-time voice assistant where microphone audio is buffered with voice activity detection, transcribed by Saaras, answered by Sarvam 105B, and spoken back with Bulbul. The goal is a speech-to-speech experience that feels conversational instead of waiting for a full turn to complete every stage serially.

Pipeline overview

Speech start→VAD buffering→Speech end detected→Saaras transcription→Sarvam 105B response→Bulbul synthesis→Playback and interruption

Speech start: the client starts streaming microphone frames.
VAD buffering and end-of-turn: VAD collects voiced frames and closes the user turn after sustained silence.
Saaras transcription: the completed audio buffer is sent to Saaras to obtain clean transcript text.
Sarvam 105B reasoning: the transcript and history are sent to the LLM using a streaming chat response.
Bulbul TTS: streamed response text is chunked into speakable phrases and synthesized into audio.
Playback and interruption: audio is played as it arrives, but if a new user turn begins, playback is stopped and the assistant yields immediately.

Step-by-step guide

Capture microphone frames and close turns with VAD

Real-time voice assistants need fast turn-taking. VAD runs alongside incoming audio and decides when enough silence has occurred to trigger the assistant pipeline.

python

import webrtcvad

vad = webrtcvad.Vad(2)
sample_rate = 16000
silence_frames_to_close = 12

class TurnBuffer:
    def __init__(self):
        self.frames = []
        self.trailing_silence = 0
        self.is_open = False

    def push(self, pcm_frame: bytes):
        speaking = vad.is_speech(pcm_frame, sample_rate)

        if speaking:
            self.is_open = True
            self.trailing_silence = 0
            self.frames.append(pcm_frame)
            return None

        if self.is_open:
            self.frames.append(pcm_frame)
            self.trailing_silence += 1

            if self.trailing_silence >= silence_frames_to_close:
                utterance = b"".join(self.frames)
                self.frames = []
                self.trailing_silence = 0
                self.is_open = False
                return utterance

        return None

Transcribe the finished utterance with Saaras

After VAD closes the turn, convert the buffered PCM into a WAV payload and send it to the OneInfer audio endpoint using a Saaras transcription model.

python

import io
import wave
import requests

BASE_URL = "https://api.oneinfer.ai"
TOKEN = "YOUR_BEARER_TOKEN"

def pcm_to_wav_bytes(pcm_bytes: bytes, sample_rate: int = 16000) -> bytes:
    buffer = io.BytesIO()
    with wave.open(buffer, "wb") as wav_file:
        wav_file.setnchannels(1)
        wav_file.setsampwidth(2)
        wav_file.setframerate(sample_rate)
        wav_file.writeframes(pcm_bytes)
    return buffer.getvalue()

def transcribe_utterance(pcm_bytes: bytes) -> str:
    wav_bytes = pcm_to_wav_bytes(pcm_bytes)
    response = requests.post(
        f"{BASE_URL}/v1/ula/generate-audio",
        headers={"Authorization": f"Bearer {TOKEN}"},
        data={
            "model": "saaras:v3",
            "provider": "sarvam",
            "language": "en-IN",
        },
        files={
            "file": ("turn.wav", wav_bytes, "audio/wav"),
        },
        timeout=120,
    )
    response.raise_for_status()
    return response.json()["data"]["text"]

Stream the response from Sarvam 105B

Once you have the transcript, send it to Sarvam 105B with streaming enabled so the assistant can begin speaking before the full answer is finished.

python

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.oneinfer.ai/v1"
)

SYSTEM_PROMPT = (
    "You are a real-time voice assistant. "
    "Respond naturally, keep answers concise, and optimize for speech."
)

def stream_assistant_reply(history: list[dict], transcript: str):
    stream = client.chat.completions.create(
        model="sarvam-105b",
        provider="sarvam",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            *history,
            {"role": "user", "content": transcript},
        ],
        stream=True,
        max_tokens=256,
        temperature=0.4,
    )

    full_text = ""
    for chunk in stream:
        delta = chunk.choices[0].delta.content or ""
        if delta:
            full_text += delta
            yield delta

    history.append({"role": "user", "content": transcript})
    history.append({"role": "assistant", "content": full_text})

Use the exact Sarvam 105B model identifier returned by Get Models when you implement this in production.

Chunk the streamed text and synthesize Bulbul audio

Converting every token to audio is too granular. Instead, buffer tokens into short sentence-like chunks and synthesize each chunk with Bulbul.

python

import base64

def iter_speakable_chunks(token_stream):
    buffer = ""
    for token in token_stream:
        buffer += token
        if any(buffer.endswith(mark) for mark in [". ", "? ", "! ", "\n"]):
            yield buffer.strip()
            buffer = ""
    if buffer.strip():
        yield buffer.strip()

def synthesize_chunk(text: str) -> bytes:
    response = requests.post(
        f"{BASE_URL}/v1/ula/generate-audio",
        headers={
            "Authorization": f"Bearer {TOKEN}",
            "Content-Type": "application/json",
        },
        json={
            "provider": "sarvam",
            "model": "bulbul:v3",
            "prompt": text,
            "voice_id": "shubh",
            "format": "mp3",
            "stream": False,
        },
        timeout=120,
    )
    response.raise_for_status()
    audio = response.json()["data"]["audios"][0]
    return base64.b64decode(audio["base64_data"])

Fetch valid Bulbul voices from Get Supported Voice for Audio Models before hard-coding a voice_id.

Coordinate playback and interruption over WebSockets

Your WebSocket loop ties the whole assistant together: microphone frames flow in, transcript and assistant events flow out, and playback is interrupted when the user starts speaking again.

python

async def handle_audio_turn(websocket, pcm_utterance: bytes, history: list[dict]):
    transcript = transcribe_utterance(pcm_utterance)
    await websocket.send_json({
        "type": "transcript",
        "text": transcript,
    })

    token_stream = stream_assistant_reply(history, transcript)

    for text_chunk in iter_speakable_chunks(token_stream):
        if await user_started_speaking_again(websocket):
            await websocket.send_json({"type": "playback_stopped"})
            return

        audio_bytes = synthesize_chunk(text_chunk)
        await websocket.send_bytes(audio_bytes)

    await websocket.send_json({"type": "assistant_turn_complete"})

# Client responsibilities:
# 1. stream PCM frames from the microphone
# 2. play assistant audio bytes immediately
# 3. notify the server when new speech interrupts playback

Latency and production notes

Optimize VAD thresholds for your environment so speech end is detected quickly without clipping the speaker.
Use short, natural response chunks for Bulbul to balance speed and speech quality.
Keep prompts concise so Sarvam 105B responds quickly enough for conversational turn-taking.
Separate client and server concerns cleanly: browser or app handles microphone and playback, backend handles STT, LLM orchestration, and TTS.
Verify final Saaras, Sarvam, and Bulbul model identifiers with Get Models before deployment.