Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial

This paper presents a technical tutorial demonstrating that building enterprise-grade realtime voice agents requires a cascaded streaming pipeline (STT \rightarrow LLM \rightarrow TTS) rather than native speech-to-speech models, achieving sub-second latency through the systematic integration of components like Deepgram, vLLM, and ElevenLabs.

Jielin Qiu, Zixiang Chen, Liangwei Yang, Ming Zhu, Zhiwei Liu, Juntao Tan, Wenting Zhao, Rithesh Murthy, Roshan Ram, Akshara Prabhakar, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine you want to build a super-smart, instant-response phone assistant for a hospital or a bank. This assistant needs to listen to you, think about your request, check a database, and speak back to you—all in real-time, without making you wait.

This paper is a "cookbook" for building that assistant from scratch. The authors, a team from Salesforce AI Research, discovered that while there are many fancy new "all-in-one" AI models, they are actually too slow for real conversations. Instead, they built a faster system by connecting three specialized experts together.

Here is the breakdown of their findings and how they built it, using simple analogies.

1. The Big Misconception: The "Super-Model" Trap

The authors started by looking at the newest, most advanced AI models that can talk directly (Speech-to-Speech). Think of these like a genius polymath who tries to listen, think, and speak all at the same time.

  • The Problem: This genius is too slow. When you ask a question, they take about 13 seconds to start speaking. That's like ordering coffee and waiting 13 seconds just for the barista to say, "I'm thinking about your order."
  • The Reason: These models try to generate the audio word-by-word while thinking. It's like trying to write a novel and read it aloud simultaneously; the reading part drags the writing part down.
  • The Missing Skill: These "genius" models also can't use tools (like checking a calendar or a database), which is essential for real business tasks.

2. The Solution: The "Assembly Line" (Cascaded Pipeline)

Instead of one slow genius, the authors built a high-speed assembly line with three specialized workers. This is the industry standard, but they explain exactly how to make it feel instant.

Imagine a relay race where the baton is passed before the previous runner even finishes the race:

  1. The Listener (STT - Speech-to-Text):

    • Role: A super-fast stenographer.
    • Action: As soon as you start speaking, they start typing what they hear. They don't wait for you to finish the whole sentence; they type the first few words immediately.
    • Speed: Very fast (about 0.3 seconds).
  2. The Thinker (LLM - Large Language Model):

    • Role: The smart brain.
    • Action: As soon as the Listener types the first few words, the Thinker starts generating a response. They don't wait for the whole sentence to be finished. They start "thinking out loud" (generating text tokens) immediately.
    • Special Power: If the Thinker needs to check a database (e.g., "Is Dr. Smith available?"), they can pause, ask the tool, get the answer, and continue talking.
  3. The Speaker (TTS - Text-to-Speech):

    • Role: The voice actor.
    • Action: As soon as the Thinker finishes a complete sentence, they pass it to the Speaker. The Speaker starts reading that sentence out loud immediately, while the Thinker is still working on the next sentence.

3. The Secret Sauce: "Pipelining"

The magic isn't just that these workers are fast; it's that they work simultaneously.

  • Old Way (Turn-based): You speak \rightarrow Wait for Listener to finish \rightarrow Wait for Thinker to finish \rightarrow Wait for Speaker to finish. (Total time: ~1.6 seconds).
  • New Way (Streaming/Pipelining):
    • You are still talking.
    • The Listener is typing the first half of your sentence.
    • The Thinker is already generating the first half of the answer.
    • The Speaker is already reading the first half of the answer.

Because of this overlap, the user hears the first word of the answer in under 1 second (specifically, about 0.75 seconds). It feels like the AI is interrupting you with the answer before you even finish your question!

4. The "Sentence Buffer": The Traffic Cop

There is one tricky part: The Thinker generates text word-by-word, but the Speaker needs full sentences to sound natural.

The authors built a Sentence Buffer (a smart traffic cop).

  • It catches the words coming from the Thinker.
  • It holds them in a little waiting room.
  • As soon as it sees a period, exclamation point, or question mark (a complete sentence), it releases that sentence to the Speaker.
  • Meanwhile, the Thinker is already working on the next sentence in the background.

5. The Result: An Enterprise-Ready Agent

The team built a complete system that can handle complex tasks, like a hospital receptionist:

  • User: "I need to cancel my appointment with Dr. Smith for Tuesday."
  • Agent:
    1. Listens and types the request.
    2. Thinks: "I need to check the database."
    3. Calls the tool: "Is the appointment there? Yes."
    4. Thinks: "Okay, cancel it and confirm."
    5. Speaks back: "I've cancelled your appointment. Is there anything else?"

All of this happens in less than a second.

Summary: What Should You Take Away?

  • Don't look for a "magic bullet" model: The newest "all-in-one" voice AI models are too slow for real-time conversation.
  • Connect the dots: The best way to build a fast voice agent is to chain together three fast, specialized tools (Listening, Thinking, Speaking).
  • Overlap is key: The secret to "real-time" isn't speed; it's pipelining. Start the next step before the current step is finished.
  • It's about the "Brain," not the "Voice": The hard part of building an agent isn't making it sound human; it's making it smart enough to use tools and solve problems. The voice is just the interface.

The authors have released all their code as a free, step-by-step tutorial so anyone can build this "assembly line" for themselves.