Audio-to-Audio Assistant
Build a real-time audio-to-audio assistant with a speech-to-speech pipeline powered by Saaras STT, VAD turn detection, Sarvam 105B reasoning, and Bulbul text-to-speech. This setup is ideal for low-latency voice assistants that listen, think, and answer naturally.
What you'll build
This page shows how to build a real-time voice assistant where microphone audio is buffered with voice activity detection, transcribed by Saaras, answered by Sarvam 105B, and spoken back with Bulbul. The goal is a speech-to-speech experience that feels conversational instead of waiting for a full turn to complete every stage serially.
Related docs: Audio Generation API, Speech-to-Text Guide, Text-to-Speech Guide, Get Models, Get Supported Voice for Audio Models.
Pipeline overview
- Speech start: the client starts streaming microphone frames.
- VAD buffering and end-of-turn: VAD collects voiced frames and closes the user turn after sustained silence.
- Saaras transcription: the completed audio buffer is sent to Saaras to obtain clean transcript text.
- Sarvam 105B reasoning: the transcript and history are sent to the LLM using a streaming chat response.
- Bulbul TTS: streamed response text is chunked into speakable phrases and synthesized into audio.
- Playback and interruption: audio is played as it arrives, but if a new user turn begins, playback is stopped and the assistant yields immediately.
Step-by-step guide
Capture microphone frames and close turns with VAD
Real-time voice assistants need fast turn-taking. VAD runs alongside incoming audio and decides when enough silence has occurred to trigger the assistant pipeline.
import webrtcvad
vad = webrtcvad.Vad(2)
sample_rate = 16000
silence_frames_to_close = 12
class TurnBuffer:
def __init__(self):
self.frames = []
self.trailing_silence = 0
self.is_open = False
def push(self, pcm_frame: bytes):
speaking = vad.is_speech(pcm_frame, sample_rate)
if speaking:
self.is_open = True
self.trailing_silence = 0
self.frames.append(pcm_frame)
return None
if self.is_open:
self.frames.append(pcm_frame)
self.trailing_silence += 1
if self.trailing_silence >= silence_frames_to_close:
utterance = b"".join(self.frames)
self.frames = []
self.trailing_silence = 0
self.is_open = False
return utterance
return NoneTranscribe the finished utterance with Saaras
After VAD closes the turn, convert the buffered PCM into a WAV payload and send it to the OneInfer audio endpoint using a Saaras transcription model.
import io
import wave
import requests
BASE_URL = "https://api.oneinfer.ai"
TOKEN = "YOUR_BEARER_TOKEN"
def pcm_to_wav_bytes(pcm_bytes: bytes, sample_rate: int = 16000) -> bytes:
buffer = io.BytesIO()
with wave.open(buffer, "wb") as wav_file:
wav_file.setnchannels(1)
wav_file.setsampwidth(2)
wav_file.setframerate(sample_rate)
wav_file.writeframes(pcm_bytes)
return buffer.getvalue()
def transcribe_utterance(pcm_bytes: bytes) -> str:
wav_bytes = pcm_to_wav_bytes(pcm_bytes)
response = requests.post(
f"{BASE_URL}/v1/ula/generate-audio",
headers={"Authorization": f"Bearer {TOKEN}"},
data={
"model": "saaras:v3",
"provider": "sarvam",
"language": "en-IN",
},
files={
"file": ("turn.wav", wav_bytes, "audio/wav"),
},
timeout=120,
)
response.raise_for_status()
return response.json()["data"]["text"]Stream the response from Sarvam 105B
Once you have the transcript, send it to Sarvam 105B with streaming enabled so the assistant can begin speaking before the full answer is finished.
from openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://api.oneinfer.ai/v1"
)
SYSTEM_PROMPT = (
"You are a real-time voice assistant. "
"Respond naturally, keep answers concise, and optimize for speech."
)
def stream_assistant_reply(history: list[dict], transcript: str):
stream = client.chat.completions.create(
model="sarvam-105b",
provider="sarvam",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
*history,
{"role": "user", "content": transcript},
],
stream=True,
max_tokens=256,
temperature=0.4,
)
full_text = ""
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
if delta:
full_text += delta
yield delta
history.append({"role": "user", "content": transcript})
history.append({"role": "assistant", "content": full_text})Chunk the streamed text and synthesize Bulbul audio
Converting every token to audio is too granular. Instead, buffer tokens into short sentence-like chunks and synthesize each chunk with Bulbul.
import base64
def iter_speakable_chunks(token_stream):
buffer = ""
for token in token_stream:
buffer += token
if any(buffer.endswith(mark) for mark in [". ", "? ", "! ", "\n"]):
yield buffer.strip()
buffer = ""
if buffer.strip():
yield buffer.strip()
def synthesize_chunk(text: str) -> bytes:
response = requests.post(
f"{BASE_URL}/v1/ula/generate-audio",
headers={
"Authorization": f"Bearer {TOKEN}",
"Content-Type": "application/json",
},
json={
"provider": "sarvam",
"model": "bulbul:v3",
"prompt": text,
"voice_id": "shubh",
"format": "mp3",
"stream": False,
},
timeout=120,
)
response.raise_for_status()
audio = response.json()["data"]["audios"][0]
return base64.b64decode(audio["base64_data"])Fetch valid Bulbul voices from Get Supported Voice for Audio Models before hard-coding a voice_id.
Coordinate playback and interruption over WebSockets
Your WebSocket loop ties the whole assistant together: microphone frames flow in, transcript and assistant events flow out, and playback is interrupted when the user starts speaking again.
async def handle_audio_turn(websocket, pcm_utterance: bytes, history: list[dict]):
transcript = transcribe_utterance(pcm_utterance)
await websocket.send_json({
"type": "transcript",
"text": transcript,
})
token_stream = stream_assistant_reply(history, transcript)
for text_chunk in iter_speakable_chunks(token_stream):
if await user_started_speaking_again(websocket):
await websocket.send_json({"type": "playback_stopped"})
return
audio_bytes = synthesize_chunk(text_chunk)
await websocket.send_bytes(audio_bytes)
await websocket.send_json({"type": "assistant_turn_complete"})
# Client responsibilities:
# 1. stream PCM frames from the microphone
# 2. play assistant audio bytes immediately
# 3. notify the server when new speech interrupts playbackLatency and production notes
- Optimize VAD thresholds for your environment so speech end is detected quickly without clipping the speaker.
- Use short, natural response chunks for Bulbul to balance speed and speech quality.
- Keep prompts concise so Sarvam 105B responds quickly enough for conversational turn-taking.
- Separate client and server concerns cleanly: browser or app handles microphone and playback, backend handles STT, LLM orchestration, and TTS.
- Verify final Saaras, Sarvam, and Bulbul model identifiers with Get Models before deployment.