Reading the Mood Behind Words: Integrating Prosody-Derived Emotional Context into Socially Responsive VR Agents

Imagine you are talking to a robot in a virtual world. You say, "It's going to rain tomorrow."

If you say it with a cheerful, bouncy voice, you might be excited about a picnic.
If you say it with a heavy, sighing voice, you might be sad about a canceled trip.
If you say it with a sharp, angry voice, you might be furious about getting wet.

In the real world, humans instantly understand the mood behind the words. But in most current Virtual Reality (VR) games and apps, the robot only hears the words. It's like the robot has a hearing aid that strips away all the music and tone, leaving only the lyrics. So, no matter how you say it, the robot just replies, "Yes, rain is water falling from the sky." It's technically correct, but emotionally dead.

This paper, "Reading the Mood Behind Words," is about teaching VR robots to listen to the music, not just the lyrics.

The Problem: The "Flat" Robot

The authors argue that current VR agents are like blind musicians. They can read the sheet music (the text) perfectly, but they can't hear the tempo or the emotion. Because they miss the "prosody" (the rhythm, pitch, and tone of your voice), they often give responses that feel robotic, stiff, or even rude, even if the words are polite.

The Solution: The "Emotion-Injecting" Pipeline

The researchers built a new system that acts like a translator for feelings. Here is how it works, using a simple analogy:

The Listener (The Microphone): You speak into the VR headset.
The Mood Detective (The AI): Before the robot even reads what you said, a special AI (called a Speech Emotion Recognition model) listens to how you said it. It acts like a detective looking for clues in your voice. Is it happy? Sad? Angry?
The Note-Taker (The Prompt): This AI writes a little sticky note saying, "User sounds Sad," and sticks it right onto your sentence before giving it to the main robot brain.
The Brain (The LLM): The main robot brain sees the sentence and the sticky note. Now, instead of just saying, "It's raining," it says, "Oh no, it sounds like you're having a tough day. I hope the rain doesn't ruin your plans."

The Experiment: The "Neutral Sentence" Test

To prove this works, the researchers didn't just use sentences that were obviously emotional (like "I am so happy!"). They used boring, neutral sentences like, "The professor changed the classroom."

Group A (The Text-Only Robot): Heard the sentence. Responded with a boring fact.
Group B (The Mood-Aware Robot): Heard the sentence plus the "Angry" sticky note from the voice. Responded with, "That sounds frustrating! Did you have to move all your stuff?"

Even though the words were the same, the Mood-Aware Robot felt like a real friend because it understood the vibe.

The Results: Humans Prefer the "Feeling" Robot

They tested this with 30 people. The results were overwhelming:

93% of people said they preferred the robot that listened to their mood.
People felt the mood-aware robot was more human, more engaging, and more empathetic.
Even when the robot was technically "wrong" about the words (because the words were neutral), it was "right" about the feeling, which made the conversation feel natural.

Interestingly, some people thought the "Text-Only" robot was slightly more "attractive" or "fun" at first glance (maybe because it was simpler), but when asked who they would actually want to keep talking to, almost everyone chose the Mood-Aware one. It's the difference between a polite stranger who nods at you and a friend who actually cares how you're feeling.

The Big Takeaway

This paper proves that for VR agents to feel like real social partners, they can't just be text processors. They need to be mood readers.

Just like you wouldn't want a therapist who only reads your words but ignores your tears or your laughter, you don't want a VR companion that ignores the tone of your voice. By teaching robots to "read the mood behind the words," we can make virtual interactions feel less like talking to a computer and more like talking to a person.

Here is a detailed technical summary of the paper "Reading the Mood Behind Words: Integrating Prosody-Derived Emotional Context into Socially Responsive VR Agents."

1. Problem Statement

Current Virtual Reality (VR) conversational agents (ECAs) predominantly rely on Speech-to-Text (STT) pipelines that flatten rich vocal expressions into plain text. This architecture discards prosodic cues (intonation, rhythm, emphasis), which are critical for conveying emotional intent.

The Gap: While Large Language Models (LLMs) have improved conversational fluency, they remain "blind" to how something is said. When lexical content is semantically neutral or ambiguous, text-only agents often produce emotionally incongruent or flat responses, undermining social presence and user engagement.
The Challenge: Existing studies often conflate lexical and prosodic cues, failing to isolate whether prosody itself can drive emotional responsiveness when the text provides no emotional signal.

2. Methodology

Experimental Design

Participants: 30 undergraduate students (15 female, 15 male; $M_{age}=21.67$ ).
Design: Within-subjects study comparing two conditions:
1. Emotion Recognition (ER): The agent receives both the text transcript and an explicit emotion label derived from prosody.
2. Non-Emotion Recognition (NER): The agent receives only the text transcript (baseline).
Stimuli Strategy (Content–Emotion Disentanglement):
- To isolate the effect of prosody, the study used semantically neutral utterances (e.g., "The professor set the classroom air conditioner to the lowest temperature") that lack inherent emotional bias.
- Participants enacted specific target emotions (Happy, Sad, Angry) while speaking these neutral sentences.
- A small set of emotionally biased utterances was included to maintain natural flow.
Environment: Unity-based VR testbed on Meta Quest 3.

System Architecture

The system utilizes a dual-stream pipeline:

Speech-to-Text (STT): OpenAI Whisper API (medium model) converts speech to text.
Speech Emotion Recognition (SER): A HuBERT-based model (fine-tuned on the SUPERB benchmark) analyzes raw audio to infer emotional states (Happy, Sad, Angry, Neutral).
- Performance: The SER model achieved 72.0% overall accuracy on participant speech during the experiment (92.2% for Happy, 95.4% for Sad, 19.3% for Angry). The lower accuracy for "Angry" was attributed to cross-lingual acoustic gaps between English pre-training and Korean prosody.
LLM Integration:
- ER Condition: The inferred emotion label is injected into the LLM prompt as a tag (e.g., [Angry] {transcript}). The system prompt explicitly instructs the LLM to prioritize this tag over the neutral text to generate empathetic responses.
- NER Condition: The prompt explicitly instructs the LLM to ignore any emotion tags and respond solely based on semantic content.

Evaluation Metrics

Social Presence: Measured via Human-Agent Interaction (HAI) scales (Rapport, Engagement, Human-likeness, Naturalness, Synchrony).
Interaction Quality: Dialogue quality, Emotional Responsiveness, Reuse Intention.
Affective State: Self-Assessment Manikin (SAM) for Valence, Arousal, and Dominance.
User Experience: UEQ (User Experience Questionnaire) and IMI (Intrinsic Motivation Inventory).

3. Key Contributions

Novel Pipeline: Proposes an architecture that treats vocal emotion as explicit dialogue context rather than auxiliary metadata, allowing LLMs to reason about prosody directly.
Disentanglement Strategy: Demonstrates a rigorous method to separate semantic content from emotional delivery, proving that prosody alone can drive social presence even when text is neutral.
Empirical Evidence: Provides data showing that prosody-aware agents significantly outperform text-only agents in socially ambiguous scenarios.

4. Results

Quantitative Findings ( $N=30$ )

Social Presence (RQ1): The ER condition significantly outperformed NER in Rapport ( $p < .001$ ), Engagement ( $p < .01$ ), Human-likeness ( $p < .01$ ), and Naturalness ( $p < .05$ ). Notably, Synchrony showed no significant difference, suggesting that affective resonance is more critical than mechanical timing.
Interaction Quality (RQ2): Under neutral/ambiguous language, ER agents were rated significantly higher in Dialogue Quality ( $p < .001$ ) and Emotional Responsiveness ( $p < .001$ ).
User Preference: 93.3% of participants (28/30) preferred the ER agent for future use.
Affective Response: ER elicited significantly higher Valence ( $p < .001$ ) and Arousal ( $p < .05$ ).
Nuance (The Novelty-Utility Paradox): While the NER agent scored slightly higher on "Attractiveness" and "Interest" (hedonic appeal), the ER agent scored significantly higher on "Value/Usefulness" and "Reuse Intention" (pragmatic utility).

Qualitative Insights

Participants described the NER agent as "stiff and cynical," whereas the ER agent felt "lifelike" and "understood my situation."
The ER agent successfully prioritized vocal intent over literal text, even when the text was neutral, demonstrating robustness against semantic ambiguity.

5. Significance and Implications

Redefining Social Presence: The study challenges the notion that mechanical alignment (e.g., turn-taking precision) is the primary driver of social presence. Instead, affective resonance (matching the user's emotional state) is the key determinant in immersive VR.
Design Paradigm Shift: For VR agents intended as social partners, prosodic awareness is not a "nice-to-have" feature but a core requirement for sustained engagement and perceived utility.
Future Directions: The authors suggest moving toward low-latency, end-to-end architectures to reduce the current ~3 second response lag and integrating multimodal cues (facial expressions, gestures) to further enhance emotional intelligence.

Conclusion

The paper successfully demonstrates that integrating prosody-derived emotional context into LLM-based VR agents transforms them from semantic processors into socially responsive partners. By treating vocal emotion as explicit context, the system significantly enhances social presence, naturalness, and user preference, even when the spoken words carry no inherent emotional meaning.