Do What I Say: A Spoken Prompt Dataset for Instruction-Following

This paper introduces DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to evaluate Speech Large Language Models under realistic spoken instruction conditions, revealing that text prompts generally outperform spoken ones except in tasks requiring speech output.

Maike Züfle, Sara Papi, Fabian Retkowski, Szymon Mazurek, Marek Kasztelnik, Alexander Waibel, Luisa Bentivogli, Jan Niehues

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you've built a brilliant, all-knowing robot assistant. You've tested it a thousand times by typing commands into a keyboard, and it passes every test with flying colors. You think, "Perfect! This robot is ready for the world."

But then, you try talking to it out loud. Suddenly, the robot stammers, misunderstands your casual requests, or gives you the wrong answer. Why? Because you only tested it on how it reads, not how it listens.

This is exactly the problem the paper "Do What I Say" (or DOWIS) addresses.

The Problem: The "Text-Only" Blind Spot

Most researchers test "Speech Large Language Models" (AI that can hear and speak) using text prompts. It's like testing a chef's ability to cook a meal by only reading the recipe to them, never letting them hear you say, "Hey, could you chop these onions?"

The authors argue this is unfair. In the real world, we talk to our devices. We say, "Summarize this meeting," or "Translate what that person just said." If we only test AI with typed text, we get an overly optimistic view of how smart it really is.

The Solution: DOWIS (The "Voice-First" Toolkit)

The researchers created DOWIS, a massive new dataset. Think of it as a universal remote control for testing AI voices.

  • It's Multilingual: They didn't just do English. They covered 11 languages (like German, Spanish, Russian, etc.), making sure the AI works across different cultures.
  • It's Diverse: They didn't just say "Translate this." They recorded 10 different ways to ask for the same thing:
    • Formal: "Please provide a translation."
    • Casual: "Hey, can you translate this?"
    • Short: "Translate."
    • Detailed: "Translate this, but keep the tone friendly."
  • It's Human: Unlike other datasets where computers read text to simulate voices, DOWIS used real humans recording these prompts on their phones and laptops. This captures the natural "noise" and rhythm of real human speech.

The Experiment: Putting AI to the Test

The authors took two of the smartest AI models available today (named Phi-4 and Qwen) and put them through the DOWIS gauntlet. They asked the models to do nine different tasks, from translating languages to summarizing meetings and answering questions.

The Big Reveal: The "Text vs. Voice" Gap

Here is what they found, using a simple analogy:

1. The "Text" Advantage (The Optimistic Lie)
When the AI was given instructions via text, it performed brilliantly. It was like a student acing a written exam.

  • The Catch: When the exact same instructions were given via spoken voice, the AI's performance dropped significantly, especially for tasks where the answer is written text (like translation or summarization).
  • The Metaphor: It's like a musician who can play perfectly when reading sheet music but freezes up when someone hums the tune to them. The text prompts were hiding the AI's weaknesses.

2. The "Voice" Exception
Interestingly, when the task required the AI to speak back (like Text-to-Speech or translating spoken words to spoken words), the gap disappeared. The AI performed just as well with voice prompts as it did with text.

  • Why? It seems the AI is comfortable "thinking" in text but struggles to "listen" to text-like instructions when it needs to output a written answer.

3. The "Style" Factor
The way you ask matters.

  • Formal and Detailed instructions worked best. The AI loves clear, structured orders.
  • Casual and Short instructions (like "Hey, do this thing") were the hardest for the AI to follow. It's like the AI gets confused by slang or brevity.

4. The Gender Bias
They also noticed that the AI sometimes performed slightly differently depending on whether a man or a woman was speaking the prompt. This suggests the AI might have hidden biases based on the speaker's voice, a problem that needs fixing.

The Takeaway

The paper concludes that we cannot trust current AI evaluations that only use text. If we want to build AI that truly understands us in the real world, we need to test it the way we actually use it: by talking to it.

DOWIS is the tool that finally lets researchers say, "Okay, let's stop testing the robot's reading skills and start testing its listening skills." It ensures that when we finally talk to our AI assistants, they won't just be smart on paper—they'll be smart in the conversation.