SARAH: Spatially Aware Real-time Agentic Humans

SARAH is a real-time, fully causal framework that generates spatially aware, full-body conversational motions for embodied agents by combining a causal transformer-based VAE with a flow matching model, enabling natural gaze and orientation adjustments in streaming VR applications at over 300 FPS.

Evonne Ng, Siwei Zhang, Zhang Chen, Michael Zollhoefer, Alexander Richard

Published 2026-02-23
📖 5 min read🧠 Deep dive

Imagine you are wearing a VR headset, and you are having a conversation with a virtual human. In most current VR experiences, that virtual person is a bit like a statue with a voice. No matter where you walk around them, they keep staring straight ahead. If you move to their left, they don't turn to look at you. If you step back, they don't step forward to keep the conversation comfortable. It feels stiff and unnatural, like talking to a mannequin.

The paper you shared introduces SARAH (Spatially Aware Real-time Agentic Humans), a new system designed to fix this. Think of SARAH not as a statue, but as a real, attentive friend who can move, look at you, and react to your position instantly.

Here is how it works, broken down with some everyday analogies:

1. The Problem: The "Zombie" Avatar

Current virtual agents are like zombies that only know how to talk. They have great voices and hand gestures, but they lack "spatial awareness." They don't know you are walking around them. They don't know when to look you in the eye or when to turn away to give you space.

2. The Solution: The "Dance Partner"

SARAH is like a perfect dance partner.

  • Listening to the Music (Audio): Just like a dancer moves to the beat of the music, SARAH moves its hands and body to match the rhythm and emotion of the conversation.
  • Watching the Floor (Spatial Awareness): Unlike a zombie, SARAH is constantly watching where you are standing. If you walk to the left, SARAH turns its head and body to face you. If you step back, SARAH might lean in slightly to keep the conversation intimate. It understands the "personal space" bubble.

3. How It Thinks: The "Instant Chef"

To do this in real-time (without lag), SARAH uses a special type of brain architecture.

  • The Old Way (The Slow Cooker): Many previous AI models were like a chef who waits until the entire meal is cooked before serving it. They need to see the future (what you will do next) to plan their moves. This is too slow for VR and causes lag.
  • The SARAH Way (The Live Stream Chef): SARAH is like a chef cooking a live stream. They only look at what is happening right now and what happened a second ago. They make decisions instantly, frame by frame, so there is no delay. This is called "causal" processing.

4. The Secret Sauce: The "Latent Space"

To move so fast, SARAH doesn't calculate every single muscle movement from scratch every time.

  • The Analogy: Imagine trying to describe a complex dance to a friend. You could describe every single finger movement (too slow!). Or, you could describe the "vibe" and the "key poses" (fast!).
  • The Tech: SARAH uses a VAE (Variational Autoencoder). Think of this as a compression tool. It takes the complex 3D movement and squishes it into a simple "summary code" (latent tokens). It then uses a Flow Matching model (like a smooth river current) to predict the next move based on that summary. This allows it to run at 300 frames per second—faster than the human eye can even see!

5. The "Eye Contact" Dial

One of the coolest features is that you can control how much eye contact the agent makes.

  • The Analogy: Think of eye contact like a volume knob on a radio.
    • Low Volume (0.0): The agent is shy or distracted; it looks away or around the room.
    • Medium Volume (0.8): The agent is a normal, polite conversationalist, looking at you most of the time but glancing away naturally.
    • High Volume (1.0): The agent is staring intensely at you, never looking away.
  • Why it matters: In real life, staring too much is creepy, and looking away too much is rude. SARAH lets you adjust this dial to fit your comfort level, making the interaction feel more human.

6. The Result: A Real Conversation

The researchers tested SARAH on a dataset called Embody 3D, which captures real people walking around and talking to each other (not just sitting in chairs).

  • The Test: They compared SARAH to other AI models.
  • The Winner: SARAH was 3 times faster than the best non-real-time models and produced much more natural movement. It didn't just wave its hands; it turned its whole body, shifted its weight, and maintained eye contact naturally, just like a human would.

In Summary

SARAH is the first system that lets a virtual human act like a real person in a virtual room. It listens to the conversation, watches where you walk, and moves its body instantly to match. It's the difference between talking to a puppet and talking to a friend.

The paper proves that we can finally build virtual agents that don't just speak, but exist in your space, ready to engage with you in real-time.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →