Let's Talk, Not Type: An Oral-First Multi-Agent Architecture for Guaran\'i

Imagine you are trying to have a deep, relaxed conversation with a friend. You pause to think, you interrupt each other to say "Wait, I meant this," and you use slang or mix languages because that's how you actually speak.

Now, imagine that same conversation with a robot. Currently, most AI robots (like Siri or Alexa) are terrible at this. They treat your voice like a typewriter. They wait for you to stop talking completely, type out what you said, look up the answer in a text book, and then read it back to you. If you pause for a second to think, the robot thinks you're done and cuts you off. If you mix languages or use a local dialect, the robot gets confused and asks you to "speak clearly" or "try again."

This paper argues that this "text-first" way of building AI is broken for many people, especially those who speak Guaraní, a language widely spoken in Paraguay. Guaraní is a language that lives in the air and in conversation, not just on paper. For many Guaraní speakers, the written word is often in Spanish (the "official" language), while their daily life, jokes, and family stories happen in Guaraní.

The authors propose a new way to build AI called an "Oral-First Multi-Agent Architecture."

Here is the simple explanation of their idea, using a few creative analogies:

The Problem: The "Translator" vs. The "Conversationalist"

Think of current AI as a strict librarian. You whisper a request, the librarian writes it down, checks the index cards (text), and hands you a book. If you whisper too softly or pause, the librarian stops listening.

The authors say we need a team of friends instead. They call this a "Multi-Agent System." Instead of one giant brain trying to do everything, they propose a team of six specialized "friends" (agents) who work together to have a real conversation.

The Team of Six Friends (The Agents)

The Listener (The Patient Friend):
- What they do: They don't just wait for silence; they know the difference between a pause for thought and the end of a sentence.
- The Analogy: In Guaraní, people sometimes make a small sound (like a "glottal stop") in the middle of a word. A standard robot thinks this is silence and cuts you off. This "Listener" friend knows, "Oh, they are just taking a breath, keep listening." They hold the floor for you so you don't get interrupted.
The Cultural Interpreter (The Local Guide):
- What they do: They understand the meaning and the culture, not just the dictionary definition.
- The Analogy: If you say, "Let's go to the purahei," a standard robot might be confused. This agent knows that in Guaraní culture, this might mean "let's listen to music" or "let's sing." They understand slang, mixing Spanish and Guaraní (called Jopará), and regional quirks. They don't try to translate you into "perfect" English; they understand you.
The Memory Keeper (The Note-Taker):
- What they do: They remember what you talked about five minutes ago.
- The Analogy: If you say, "I don't like this," a standard robot asks, "What is 'this'?" The Memory Keeper remembers you were talking about a specific song and knows you want to skip that song. They keep the story flowing so you don't have to repeat yourself.
The Guardian (The Gatekeeper):
- What they do: They protect your privacy and your data.
- The Analogy: This is the most important friend. Before the team does anything, the Guardian checks: "Did the user say it's okay to save this voice recording?" "Is this a private family story?" They ensure the community owns their own data, not a big tech company. They are the "No" button that cannot be ignored.
The Conversationalist (The Voice):
- What they do: They speak back to you in a natural way.
- The Analogy: Instead of a robotic "Task completed," they say, "Okay, I'm playing that song now. Is that good?" They sound like a human joining the chat, not a computer reading a status report.
The Specialists (The Doers):
- What they do: They actually do the tasks (like playing music or opening a browser).
- The Analogy: These are the workers who actually turn the knobs and press the buttons, but they only do it after the other friends have agreed it's safe and correct.

Why This Matters

The paper argues that for AI to be truly fair, it can't just be a "text-to-speech" machine. It has to be a "conversation-to-conversation" system.

Current AI: Treats spoken language like a broken text message.
This New AI: Treats spoken language like a living, breathing conversation.

By building this system specifically for Guaraní, the authors hope to show that AI can respect how people actually live and speak. It's not just about translating words; it's about respecting the rhythm, the pauses, the privacy, and the culture of the people using it. It's about making technology that feels like a friend, not a boss.

Based on the position paper "Let's Talk, Not Type: An Oral-First Multi-Agent Architecture for Guaraní," here is a detailed technical summary covering the problem, methodology, contributions, results, and significance.

1. Problem Statement

Current Artificial Intelligence (AI) and Human-Computer Interaction (HCI) systems are predominantly text-first, relying on pipelines that transcribe speech to text, parse it, and generate a response. This architecture fails to support primarily oral languages and indigenous communities, particularly those exhibiting diglossia (a sociolinguistic situation where two language varieties are used in different domains, e.g., Guaraní for oral/informal use and Spanish for formal/written use).

Key issues identified:

Interaction Mismatch: Standard voice assistants (e.g., Alexa) function as command-and-control interfaces (wake word $\rightarrow$ short request $\rightarrow$ single response) rather than natural dialogue. They lack mechanisms for turn-taking, interruption handling, and repair (clarification).
Cultural Exclusion: In Paraguay, while Guaraní is widely spoken, digital interfaces default to Spanish due to institutional biases and literacy requirements. This forces bilingual speakers into a cognitive burden of code-switching or abandoning the technology.
Data Sovereignty: Existing systems often lack robust mechanisms for indigenous data governance, failing to respect community norms regarding privacy, consent, and data retention.
Low-Resource Limitations: Low-resource languages often lack the conversational infrastructure (dialogue state tracking, repair strategies) necessary for robust multi-turn interaction, not just basic speech recognition.

2. Methodology: Oral-First Multi-Agent Architecture (MAS)

The authors propose a Multi-Agent System (MAS) that decouples language understanding, conversation state, action execution, and governance. Instead of a monolithic Large Language Model (LLM), the system utilizes six specialized, cooperating agents to treat speech as a "first-class" design requirement.

The Six-Agent Architecture:

The Listener (Speech Interface Agent):
- Function: Captures audio and manages turn-taking.
- Technical Specifics: Uses Voice Activity Detection (VAD) but is tuned to Guaraní phonetics (e.g., distinguishing puesto or glottal stops from actual turn completion). It prevents premature interruption by respecting natural pause durations and floor-holding cues.
The Cultural Interpreter (Guaraní Understanding Agent):
- Function: Maps spoken input to abstract intents (e.g., PLAY_MUSIC).
- Technical Specifics: Trained on authentic, community-verified speech (including Jopará, the mixed Guaraní-Spanish dialect) rather than synthetic translations. It handles code-switching, loanwords, and regional variations.
The Memory Keeper (Conversation State Agent):
- Function: Maintains shared context (common ground) across turns.
- Technical Specifics: Performs anaphora resolution (e.g., understanding "this" refers to the current song) and tracks dialogue flow, pending actions, and context to enable fluid multi-turn exchanges.
The Guardian (Permission & Governance Agent):
- Function: A sovereign layer that mediates consent and data privacy.
- Technical Specifics: Operates independently from the execution layer. It checks every action against community-defined norms (e.g., "Do not store this audio") before allowing other agents to proceed. This ensures Indigenous Data Sovereignty.
The Conversationalist (Response Agent):
- Function: Generates conversational audio output.
- Technical Specifics: Produces responses grounded in dialogue state and action outcomes (confirmations, denials, repair prompts). Responses are vetted by the Guardian agent before delivery.
The Specialists (Action Agents):
- Function: Execute specific domain tasks (e.g., Media Agent for Spotify, Browser Agent).
- Technical Specifics: Modular, narrow-expertise agents that allow the system to scale (e.g., adding weather or smart home control) without redesigning the core conversational logic.

Training Data Strategy:
The methodology emphasizes community-led data collection over synthetic generation. It references initiatives like Mozilla Common Voice for acoustic modeling and Aikuaa (a community-led project using mingas or collaborative gatherings) to capture authentic Jopará, pragmatic nuances, and multi-turn conversational data.

3. Key Contributions

Architectural Shift: Proposes a shift from "Text-to-Speech" pipelines to an Oral-First Multi-Agent Architecture that prioritizes conversational dynamics (turn-taking, repair, shared context) over mere transcription accuracy.
Decoupling Governance: Introduces a distinct Permission & Governance Agent that acts as a sovereign gatekeeper, ensuring that data privacy and indigenous data sovereignty are hard-coded into the system logic rather than treated as an afterthought.
Handling Diglossia: Provides a technical framework specifically designed to handle the sociolinguistic reality of Paraguay, where users fluidly switch between Guaraní (oral) and Spanish (formal/written), and where Jopará is the norm.
New Evaluation Metrics: Moves beyond Word Error Rate (WER) to propose four conversational success metrics:
1. Task Success Rate (TSR): Success in multi-turn goals.
2. Repair Success Rate: Ability to recover from misunderstandings without user abandonment.
3. Perceived Sovereignty: Qualitative trust that data is controlled by the user/community.
4. Latency/Timing: Response speed aligned with cultural turn-taking norms.

4. Results and Validation

Note: As this is a position paper, it proposes a framework rather than reporting empirical experimental results from a deployed system.

Simulation of Interaction: The paper provides a conceptual walkthrough (Table 1) demonstrating how the agents collaborate to handle context resolution and repair (e.g., a user saying "No, I don't like this" triggers the State Agent to resolve "this" to the current song and the Media Agent to skip).
Feasibility Argument: The authors argue that recent work on multi-agent task-solving proves that decomposing interactions improves success rates compared to monolithic LLMs.
Gap Identification: The paper highlights that while speech datasets exist, there is a critical lack of multi-turn, spontaneous spoken interaction data for Guaraní, identifying this as the primary bottleneck for future implementation.

5. Significance

Culturally Grounded AI: The work argues that for AI to be truly equitable, it must adapt to the communicative practices of the community rather than forcing oral languages into text-centric models.
Indigenous Data Sovereignty: By placing a governance agent at the core of the architecture, the paper offers a blueprint for respecting indigenous rights over voice data, addressing the risk of cultural extraction.
Beyond Accessibility: It moves the discourse from "accessibility for low-literacy users" to "designing for oral-first cultures," acknowledging that oral traditions rely on narrative, repetition, and distributed memory rather than list-based externalization.
Scalability: The modular MAS approach allows for the gradual addition of capabilities (weather, smart home) without compromising the core conversational integrity, making it a viable path for low-resource language development.

In conclusion, the paper posits that treating spoken conversation as a first-class design requirement is essential for empowering indigenous communities and ensuring that digital ecosystems reflect, rather than overwrite, diverse linguistic practices.

Let's Talk, Not Type: An Oral-First Multi-Agent Architecture for Guaraní

The Problem: The "Translator" vs. The "Conversationalist"

The Team of Six Friends (The Agents)

Why This Matters

1. Problem Statement

2. Methodology: Oral-First Multi-Agent Architecture (MAS)

3. Key Contributions

4. Results and Validation

5. Significance

More like this

Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple

Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation

DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries

Markovian Generation Chains in Large Language Models