Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial

Imagine you want to build a super-smart, instant-response phone assistant for a hospital or a bank. This assistant needs to listen to you, think about your request, check a database, and speak back to you—all in real-time, without making you wait.

This paper is a "cookbook" for building that assistant from scratch. The authors, a team from Salesforce AI Research, discovered that while there are many fancy new "all-in-one" AI models, they are actually too slow for real conversations. Instead, they built a faster system by connecting three specialized experts together.

Here is the breakdown of their findings and how they built it, using simple analogies.

1. The Big Misconception: The "Super-Model" Trap

The authors started by looking at the newest, most advanced AI models that can talk directly (Speech-to-Speech). Think of these like a genius polymath who tries to listen, think, and speak all at the same time.

The Problem: This genius is too slow. When you ask a question, they take about 13 seconds to start speaking. That's like ordering coffee and waiting 13 seconds just for the barista to say, "I'm thinking about your order."
The Reason: These models try to generate the audio word-by-word while thinking. It's like trying to write a novel and read it aloud simultaneously; the reading part drags the writing part down.
The Missing Skill: These "genius" models also can't use tools (like checking a calendar or a database), which is essential for real business tasks.

2. The Solution: The "Assembly Line" (Cascaded Pipeline)

Instead of one slow genius, the authors built a high-speed assembly line with three specialized workers. This is the industry standard, but they explain exactly how to make it feel instant.

Imagine a relay race where the baton is passed before the previous runner even finishes the race:

The Listener (STT - Speech-to-Text):
- Role: A super-fast stenographer.
- Action: As soon as you start speaking, they start typing what they hear. They don't wait for you to finish the whole sentence; they type the first few words immediately.
- Speed: Very fast (about 0.3 seconds).
The Thinker (LLM - Large Language Model):
- Role: The smart brain.
- Action: As soon as the Listener types the first few words, the Thinker starts generating a response. They don't wait for the whole sentence to be finished. They start "thinking out loud" (generating text tokens) immediately.
- Special Power: If the Thinker needs to check a database (e.g., "Is Dr. Smith available?"), they can pause, ask the tool, get the answer, and continue talking.
The Speaker (TTS - Text-to-Speech):
- Role: The voice actor.
- Action: As soon as the Thinker finishes a complete sentence, they pass it to the Speaker. The Speaker starts reading that sentence out loud immediately, while the Thinker is still working on the next sentence.

3. The Secret Sauce: "Pipelining"

The magic isn't just that these workers are fast; it's that they work simultaneously.

Old Way (Turn-based): You speak $\rightarrow$ Wait for Listener to finish $\rightarrow$ Wait for Thinker to finish $\rightarrow$ Wait for Speaker to finish. (Total time: ~1.6 seconds).
New Way (Streaming/Pipelining):
- You are still talking.
- The Listener is typing the first half of your sentence.
- The Thinker is already generating the first half of the answer.
- The Speaker is already reading the first half of the answer.

Because of this overlap, the user hears the first word of the answer in under 1 second (specifically, about 0.75 seconds). It feels like the AI is interrupting you with the answer before you even finish your question!

4. The "Sentence Buffer": The Traffic Cop

There is one tricky part: The Thinker generates text word-by-word, but the Speaker needs full sentences to sound natural.

The authors built a Sentence Buffer (a smart traffic cop).

It catches the words coming from the Thinker.
It holds them in a little waiting room.
As soon as it sees a period, exclamation point, or question mark (a complete sentence), it releases that sentence to the Speaker.
Meanwhile, the Thinker is already working on the next sentence in the background.

5. The Result: An Enterprise-Ready Agent

The team built a complete system that can handle complex tasks, like a hospital receptionist:

User: "I need to cancel my appointment with Dr. Smith for Tuesday."
Agent:
1. Listens and types the request.
2. Thinks: "I need to check the database."
3. Calls the tool: "Is the appointment there? Yes."
4. Thinks: "Okay, cancel it and confirm."
5. Speaks back: "I've cancelled your appointment. Is there anything else?"

All of this happens in less than a second.

Summary: What Should You Take Away?

Don't look for a "magic bullet" model: The newest "all-in-one" voice AI models are too slow for real-time conversation.
Connect the dots: The best way to build a fast voice agent is to chain together three fast, specialized tools (Listening, Thinking, Speaking).
Overlap is key: The secret to "real-time" isn't speed; it's pipelining. Start the next step before the current step is finished.
It's about the "Brain," not the "Voice": The hard part of building an agent isn't making it sound human; it's making it smart enough to use tools and solve problems. The voice is just the interface.

The authors have released all their code as a free, step-by-step tutorial so anyone can build this "assembly line" for themselves.

Based on the paper "Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial" by Salesforce AI Research, here is a detailed technical summary covering the problem, methodology, contributions, results, and significance.

1. Problem Statement

The emergence of Large Language Models (LLMs) has created a demand for voice-based AI agents capable of natural conversation and complex task execution (e.g., booking appointments, querying databases). However, building enterprise-grade, real-time voice agents faces significant engineering challenges:

Latency: Traditional turn-based pipelines (wait for full STT $\to$ LLM $\to$ TTS) introduce unacceptable delays for natural conversation.
Fragmentation: While over 25 open-source speech-to-speech (S2S) models and numerous frameworks exist, there is no single resource explaining how to integrate individual components into a working, streaming voice agent with function calling capabilities.
The "Native" S2S Limitation: Native S2S models (which process speech directly without intermediate text) are architecturally elegant but currently too slow for real-time interaction and lack the reasoning/tool-use capabilities required for enterprise tasks.

2. Methodology

The authors propose a cascaded streaming pipeline architecture, arguing that "realtime" is achieved not by a single fast model, but by streaming and pipelining across components.

Core Architecture

The system follows the formula: Voice Agent = LLM Agent (Reasoning + Tools) + Voice I/O (STT + TTS + Streaming).
The pipeline consists of three concurrent, streaming components:

Streaming STT (Speech-to-Text): Uses Deepgram (Nova-3) via WebSocket. It processes audio in 20ms chunks, providing partial transcripts for UI feedback and final transcripts to the LLM.
Streaming LLM (Large Language Model): Served via vLLM on NVIDIA GPUs (or cloud APIs). It uses the OpenAI-compatible API to stream tokens via Server-Sent Events (SSE). Crucially, it supports function calling (tool use), allowing the agent to execute tasks like database queries.
Streaming TTS (Text-to-Speech): Uses ElevenLabs (Turbo v2.5). It generates audio chunks incrementally as text is received, rather than waiting for full synthesis.

Key Engineering Mechanisms

The Sentence Buffer: A critical orchestration primitive that bridges the LLM and TTS. It accumulates streaming tokens from the LLM until a sentence boundary (., !, ?) is detected. Once a complete sentence is formed, it is immediately sent to the TTS engine while the LLM continues generating the next sentence. This creates the perception of instant response.
Voice Activity Detection (VAD): Uses Silero VAD to manage turn-taking. It implements a state machine (IDLE $\to$ LISTENING $\to$ PROCESSING $\to$ SPEAKING) and supports interruption (barge-in) when the user speaks while the agent is talking.
WebSocket Protocol: The server and client communicate via binary frames (PCM audio) and JSON control messages, ensuring low-latency transport.

Empirical Evaluation of Alternatives

The authors empirically tested Qwen2.5-Omni (a native S2S model) as a baseline.

Findings: Qwen2.5-Omni has a Time-to-First-Audio (TTFA) of ~13.2s (sentence streaming) to ~26.5s (batch).
Bottleneck: The "Talker" (audio decoder) operates at ~0.5x real-time speed and blocks until full audio is synthesized. It also lacks function calling support.
Conclusion: Native S2S models are not yet viable for real-time enterprise agents; the cascaded approach is necessary.

3. Key Contributions

Comprehensive Survey: A detailed analysis of 25+ S2S models and 30+ voice agent frameworks, highlighting the gap between available components and integrated, teachable systems.
Empirical Evidence: A rigorous comparison demonstrating that native S2S models fail to meet real-time latency requirements and lack function calling, validating the cascaded pipeline approach.
Complete Implementation: A fully functional, open-source codebase implementing a streaming voice agent with enterprise-grade function calling.
Educational Tutorial: A 9-chapter progressive tutorial covering every component (STT, LLM, TTS, VAD, WebSocket, Sentence Buffer) with working code, filling the documentation gap in existing "black-box" frameworks like Pipecat or LiveKit.

4. Results

The system achieves sub-1-second Time-to-First-Audio (TTFA), a critical metric for natural conversation.

Performance Metrics (Cloud API - GPT-4.1-mini):
- Deepgram STT P50: ~402ms
- LLM TTFT (Time to First Token): ~457ms
- ElevenLabs TTS TTFB (Time to First Byte): ~221ms
- Measured End-to-End TTFA: 958ms (Best case: 715ms).
Performance Metrics (Self-Hosted vLLM on A10G):
- Measured End-to-End TTFA: 947ms (Best case: 729ms).
Comparison: The cascaded pipeline is ~17x faster than the native S2S model (Qwen2.5-Omni) and successfully supports complex multi-step tool chains.
Latency Breakdown: The streaming overlap reduces latency significantly compared to the sequential upper bound (which would be ~~1600ms). The sentence buffer adds minimal overhead (~~143ms) while enabling the pipeline.

5. Significance

Demystifying Realtime Voice AI: The paper shifts the focus from "finding a magic S2S model" to understanding the engineering principles of pipelining and streaming. It proves that high-quality, low-latency voice agents can be built by orchestrating best-in-class individual components.
Enterprise Viability: By integrating function calling, the tutorial moves beyond simple chatbots to agents capable of executing real-world business tasks (e.g., hospital reception, order management), a capability missing in current native S2S models.
Open Source Education: The release of a 9-chapter tutorial with tested code provides a foundational resource for developers to build, debug, and optimize their own voice agents, rather than relying on opaque commercial SDKs.
Practical Guidance: The paper offers critical "gotchas" and practical notes (e.g., specific transformers versions for Qwen, vLLM configuration, and SDK compatibility) that are essential for successful deployment.

In summary, this work establishes that the path to enterprise-grade real-time voice agents lies in cascaded streaming architectures rather than monolithic native models, providing both the theoretical justification and the practical code to achieve sub-second latency with full tool-use capabilities.