SARAH: Spatially Aware Real-time Agentic Humans

Imagine you are wearing a VR headset, and you are having a conversation with a virtual human. In most current VR experiences, that virtual person is a bit like a statue with a voice. No matter where you walk around them, they keep staring straight ahead. If you move to their left, they don't turn to look at you. If you step back, they don't step forward to keep the conversation comfortable. It feels stiff and unnatural, like talking to a mannequin.

The paper you shared introduces SARAH (Spatially Aware Real-time Agentic Humans), a new system designed to fix this. Think of SARAH not as a statue, but as a real, attentive friend who can move, look at you, and react to your position instantly.

Here is how it works, broken down with some everyday analogies:

1. The Problem: The "Zombie" Avatar

Current virtual agents are like zombies that only know how to talk. They have great voices and hand gestures, but they lack "spatial awareness." They don't know you are walking around them. They don't know when to look you in the eye or when to turn away to give you space.

2. The Solution: The "Dance Partner"

SARAH is like a perfect dance partner.

Listening to the Music (Audio): Just like a dancer moves to the beat of the music, SARAH moves its hands and body to match the rhythm and emotion of the conversation.
Watching the Floor (Spatial Awareness): Unlike a zombie, SARAH is constantly watching where you are standing. If you walk to the left, SARAH turns its head and body to face you. If you step back, SARAH might lean in slightly to keep the conversation intimate. It understands the "personal space" bubble.

3. How It Thinks: The "Instant Chef"

To do this in real-time (without lag), SARAH uses a special type of brain architecture.

The Old Way (The Slow Cooker): Many previous AI models were like a chef who waits until the entire meal is cooked before serving it. They need to see the future (what you will do next) to plan their moves. This is too slow for VR and causes lag.
The SARAH Way (The Live Stream Chef): SARAH is like a chef cooking a live stream. They only look at what is happening right now and what happened a second ago. They make decisions instantly, frame by frame, so there is no delay. This is called "causal" processing.

4. The Secret Sauce: The "Latent Space"

To move so fast, SARAH doesn't calculate every single muscle movement from scratch every time.

The Analogy: Imagine trying to describe a complex dance to a friend. You could describe every single finger movement (too slow!). Or, you could describe the "vibe" and the "key poses" (fast!).
The Tech: SARAH uses a VAE (Variational Autoencoder). Think of this as a compression tool. It takes the complex 3D movement and squishes it into a simple "summary code" (latent tokens). It then uses a Flow Matching model (like a smooth river current) to predict the next move based on that summary. This allows it to run at 300 frames per second—faster than the human eye can even see!

5. The "Eye Contact" Dial

One of the coolest features is that you can control how much eye contact the agent makes.

The Analogy: Think of eye contact like a volume knob on a radio.
- Low Volume (0.0): The agent is shy or distracted; it looks away or around the room.
- Medium Volume (0.8): The agent is a normal, polite conversationalist, looking at you most of the time but glancing away naturally.
- High Volume (1.0): The agent is staring intensely at you, never looking away.
Why it matters: In real life, staring too much is creepy, and looking away too much is rude. SARAH lets you adjust this dial to fit your comfort level, making the interaction feel more human.

6. The Result: A Real Conversation

The researchers tested SARAH on a dataset called Embody 3D, which captures real people walking around and talking to each other (not just sitting in chairs).

The Test: They compared SARAH to other AI models.
The Winner: SARAH was 3 times faster than the best non-real-time models and produced much more natural movement. It didn't just wave its hands; it turned its whole body, shifted its weight, and maintained eye contact naturally, just like a human would.

In Summary

SARAH is the first system that lets a virtual human act like a real person in a virtual room. It listens to the conversation, watches where you walk, and moves its body instantly to match. It's the difference between talking to a puppet and talking to a friend.

The paper proves that we can finally build virtual agents that don't just speak, but exist in your space, ready to engage with you in real-time.

1. Problem Statement

Embodied conversational agents in VR, telepresence, and digital human applications currently lack spatial awareness. Existing methods primarily focus on monadic (single-speaker) gesture generation aligned with speech or assume static, forward-facing dyadic interactions (like video calls). Consequently, current agents fail to:

Turn toward moving users.
React to the user's position and movement in real-time.
Maintain natural, adjustable gaze and eye contact.
Operate in real-time with causal constraints (no access to future frames), which is essential for streaming VR headsets.

The core challenge is to generate full-body 3D motion that is simultaneously conversationally appropriate (speech-aligned), spatially aware (reacts to user position), controllable (adjustable gaze intensity), and real-time (causal streaming).

2. Methodology

The authors propose SARAH, a real-time, fully causal framework for generating spatially aware conversational motion. The pipeline consists of three main components:

A. Motion Representation (Euclidean Surface-Point)

Instead of using traditional local joint rotations (which can lead to error propagation and instability), the authors use a fully Euclidean representation:

Each joint is encoded as a 3D icosahedron.
The centroid of the vertices provides the global position ( $\Pi$ ).
Singular Value Decomposition (SVD) against a reference icosahedron recovers the global orientation ( $\Omega$ ).
This representation improves training stability, enables precise end-effector control, and prevents unbounded drift by normalizing relative to the first frame.

B. Causal Transformer-based VAE with Interleaved Latents

To enable streaming inference without sacrificing temporal coherence, the authors designed a specific Variational Autoencoder (VAE):

Interleaved Latent Tokens: Unlike standard VAEs that place latent tokens only at the start, SARAH interleaves latent tokens ( $\mu_k, \sigma_k$ ) at a fixed temporal stride ( $s$ ) within the sequence.
Causal Attention: Both the encoder and decoder use causal self-attention. Each frame and latent token attends only to preceding frames and earlier latents.
Streaming Inference: This architecture allows the model to compress motion into a temporally-strided latent sequence, enabling autoregressive generation where the agent can react to new user inputs frame-by-frame.

C. Flow Matching Generator

The generative model operates in the latent space produced by the VAE:

Architecture: A transformer-based Flow Matching model.
Conditioning: The model is conditioned on:
1. User Trajectory: The user's floor-projected head position ( $p_y$ ).
2. Dyadic Audio: HuBERT features from both the agent's and the user's audio streams ( $a, b$ ).
Inference Strategy: To ensure real-time performance, the model generates motion autoregressively. It maintains a history buffer of predicted latents. Instead of explicit conditioning on past motion (which caused mode collapse), it uses imputation: it computes noisy latents for the history based on the flow equation and replaces them with imputed values before denoising the next block.

D. Controllable Gaze Guidance

To address varying user preferences for eye contact, the authors introduce a gaze scoring mechanism:

Gaze Score ( $g$ ): Calculated as the dot product between the agent's facing direction and the vector toward the user. $g=1$ (facing user), $g=0$ (perpendicular), $g=-1$ (facing away).
Classifier-Free Guidance (CFG): During training, the gaze score is concatenated with conditioning inputs and dropped with 5% probability. At inference, a target $g$ is specified. The model uses CFG to steer the output toward the desired gaze intensity while preserving natural variation and aversion behaviors learned from the data.

3. Key Contributions

First Real-Time Causal System: SARAH is the first method to generate spatially aware, full-body conversational motion in real-time (>300 FPS) on a streaming VR headset, without requiring future information.
Novel Architecture: The combination of a causal transformer-based VAE with interleaved latent tokens and a flow matching model allows for high-quality streaming inference that outperforms non-causal diffusion baselines in speed.
Decoupled Learning and Control: The gaze guidance mechanism separates the learning of natural spatial alignment from the control of eye contact intensity, allowing users to adjust engagement levels dynamically.
Euclidean Motion Representation: The use of 3D icosahedrons for joint representation improves training stability and physical plausibility (reducing foot sliding) compared to traditional joint-angle parameterizations.
Embody 3D Dataset Utilization: The method leverages the Embody 3D dataset, the first to capture realistic 3D proxemics (spatial dynamics) in dynamic dyadic conversations, addressing the lack of spatial data in prior works.

4. Experimental Results

Evaluated on the Embody 3D dataset (50 hours of dynamic conversations), SARAH was compared against retrieval baselines (Random, NN) and generative baselines (MDM, A2P, SHOW).

Performance & Speed:
- FPS: Achieves >300 FPS (3× faster than non-causal baselines like MDM and A2P which run at ~90 FPS).
- Motion Quality (FGD): Achieves a Fréchet Gesture Distance of 1.28, competitive with state-of-the-art non-causal methods and significantly better than audio-only causal baselines (SHOW: 1.99).
- Physical Plausibility: Achieves near-zero Foot Slide (0.01), outperforming diffusion-based methods (MDM: 0.11) and VQ-based methods (SHOW: 0.27).
- Expressiveness: High Wrist Variance (105.0), indicating dynamic gestures, compared to dampened outputs from diffusion baselines.
Spatial Awareness (Gaze Alignment):
- Head Angle: Achieves a mean dot product of 0.83 (where 1.0 is perfect alignment), matching the performance of non-causal methods (MDM: 0.81) despite lacking access to future user positions.
- Controllability: Users can adjust the gaze score ( $g$ ) at inference. Setting $g=0.8$ yields the best balance of alignment (0.76) and motion quality (FGD 0.92), while $g=1.0$ forces strict alignment (0.96) at a slight cost to natural variation.
Ablation Studies:
- Removing the VAE (predicting directly in Euclidean space) reduced FPS to 150 and increased FGD, proving the latent space is crucial for efficiency and distribution matching.
- Using traditional Joint Space (IK) instead of Euclidean representation degraded gaze alignment (0.72) and increased foot sliding (0.03), validating the icosahedron representation.

5. Significance

This work bridges a critical gap in embodied AI by enabling spatially aware agents that can interact naturally in dynamic 3D environments.

Deployment Viability: By achieving real-time speeds on consumer-grade hardware (A100 GPU, deployable on VR headsets), it moves digital humans from offline rendering to live, interactive telepresence.
Human-Centric Interaction: The ability to control gaze intensity allows the system to adapt to cultural norms and personal comfort levels regarding eye contact, a key factor in social acceptance.
Future Direction: It sets a new standard for dyadic motion generation, demonstrating that reactive spatial behavior can be learned causally without future context, paving the way for more immersive and socially intelligent virtual companions.