Do What I Say: A Spoken Prompt Dataset for Instruction-Following

Imagine you've built a brilliant, all-knowing robot assistant. You've tested it a thousand times by typing commands into a keyboard, and it passes every test with flying colors. You think, "Perfect! This robot is ready for the world."

But then, you try talking to it out loud. Suddenly, the robot stammers, misunderstands your casual requests, or gives you the wrong answer. Why? Because you only tested it on how it reads, not how it listens.

This is exactly the problem the paper "Do What I Say" (or DOWIS) addresses.

The Problem: The "Text-Only" Blind Spot

Most researchers test "Speech Large Language Models" (AI that can hear and speak) using text prompts. It's like testing a chef's ability to cook a meal by only reading the recipe to them, never letting them hear you say, "Hey, could you chop these onions?"

The authors argue this is unfair. In the real world, we talk to our devices. We say, "Summarize this meeting," or "Translate what that person just said." If we only test AI with typed text, we get an overly optimistic view of how smart it really is.

The Solution: DOWIS (The "Voice-First" Toolkit)

The researchers created DOWIS, a massive new dataset. Think of it as a universal remote control for testing AI voices.

It's Multilingual: They didn't just do English. They covered 11 languages (like German, Spanish, Russian, etc.), making sure the AI works across different cultures.
It's Diverse: They didn't just say "Translate this." They recorded 10 different ways to ask for the same thing:
- Formal: "Please provide a translation."
- Casual: "Hey, can you translate this?"
- Short: "Translate."
- Detailed: "Translate this, but keep the tone friendly."
It's Human: Unlike other datasets where computers read text to simulate voices, DOWIS used real humans recording these prompts on their phones and laptops. This captures the natural "noise" and rhythm of real human speech.

The Experiment: Putting AI to the Test

The authors took two of the smartest AI models available today (named Phi-4 and Qwen) and put them through the DOWIS gauntlet. They asked the models to do nine different tasks, from translating languages to summarizing meetings and answering questions.

The Big Reveal: The "Text vs. Voice" Gap

Here is what they found, using a simple analogy:

1. The "Text" Advantage (The Optimistic Lie)
When the AI was given instructions via text, it performed brilliantly. It was like a student acing a written exam.

The Catch: When the exact same instructions were given via spoken voice, the AI's performance dropped significantly, especially for tasks where the answer is written text (like translation or summarization).
The Metaphor: It's like a musician who can play perfectly when reading sheet music but freezes up when someone hums the tune to them. The text prompts were hiding the AI's weaknesses.

2. The "Voice" Exception
Interestingly, when the task required the AI to speak back (like Text-to-Speech or translating spoken words to spoken words), the gap disappeared. The AI performed just as well with voice prompts as it did with text.

Why? It seems the AI is comfortable "thinking" in text but struggles to "listen" to text-like instructions when it needs to output a written answer.

3. The "Style" Factor
The way you ask matters.

Formal and Detailed instructions worked best. The AI loves clear, structured orders.
Casual and Short instructions (like "Hey, do this thing") were the hardest for the AI to follow. It's like the AI gets confused by slang or brevity.

4. The Gender Bias
They also noticed that the AI sometimes performed slightly differently depending on whether a man or a woman was speaking the prompt. This suggests the AI might have hidden biases based on the speaker's voice, a problem that needs fixing.

The Takeaway

The paper concludes that we cannot trust current AI evaluations that only use text. If we want to build AI that truly understands us in the real world, we need to test it the way we actually use it: by talking to it.

DOWIS is the tool that finally lets researchers say, "Okay, let's stop testing the robot's reading skills and start testing its listening skills." It ensures that when we finally talk to our AI assistants, they won't just be smart on paper—they'll be smart in the conversation.

Do What I Say: A Spoken Prompt Dataset for Instruction-Following

The Problem: The "Text-Only" Blind Spot

The Solution: DOWIS (The "Voice-First" Toolkit)

The Experiment: Putting AI to the Test

The Big Reveal: The "Text vs. Voice" Gap

The Takeaway

1. Problem Statement

2. Methodology: The DOWIS Dataset

3. Experimental Setup

4. Key Results

5. Key Contributions

6. Significance and Conclusion

Do What I Say: A Spoken Prompt Dataset for Instruction-Following

The Problem: The "Text-Only" Blind Spot

The Solution: DOWIS (The "Voice-First" Toolkit)

The Experiment: Putting AI to the Test

The Big Reveal: The "Text vs. Voice" Gap

The Takeaway

1. Problem Statement

2. Methodology: The DOWIS Dataset

3. Experimental Setup

4. Key Results

5. Key Contributions

6. Significance and Conclusion

More like this

Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents