Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation

🧠 The Big Idea: The "Chef and the Recipe" Problem

Imagine you have a brilliant Chef (the AI model) and a Customer (the user). The Chef wants to cook the perfect dish, but sometimes the Customer's order is vague, or the Chef just doesn't have the right technique for a specific ingredient.

In the past, when the dish didn't turn out right, researchers tried two separate fixes:

The "Better Recipe" Approach (Prompt Engineering): They kept the Chef exactly the same but tried to rewrite the recipe instructions to be clearer.
- The Problem: If the Chef doesn't know how to chop a specific vegetable, no amount of better instructions will help. The Chef hits a "skill ceiling."
The "Training the Chef" Approach (Test-Time Training): They kept the recipe exactly the same but tried to tweak the Chef's brain (weights) to learn from the mistake.
- The Problem: If the recipe was confusing to begin with, the Chef might learn the wrong lesson. They might start chopping onions like apples because the instructions were ambiguous. This leads to "overfitting" (memorizing noise instead of learning).

The Paper's Insight:
The authors argue that these two problems are coupled. You can't fix the Chef's skills if the recipe is confusing, and you can't fix the recipe if the Chef lacks the basic skills. You need to fix both at the same time.

They call this ROSA2: A system that simultaneously refines the Words (the instructions) and the Weights (the Chef's brain) in a single, coordinated dance.

🎭 The Analogy: The "Lost Hiker and the Map"

Let's try a different analogy to see why doing them separately fails.

Imagine a Hiker (the AI) trying to reach a hidden Treasure (the correct answer).

The Map is the Prompt (Words).
The Hiker's Legs are the Model Parameters (Weights).

Scenario A: Only Fixing the Map (Prompt Only)
The Hiker keeps tripping over rocks. You keep rewriting the map to say "Watch out for rocks!" but the Hiker's legs are weak and they still fall.

Result: You hit a Deficit Trap. The instructions are perfect, but the Hiker physically can't make it.

Scenario B: Only Fixing the Legs (Weights Only)
The Hiker has strong legs, but the map says "Walk North," when the treasure is actually "North-North-East." The Hiker runs fast in the wrong direction, gets lost, and you try to train their legs to run faster in that wrong direction.

Result: You hit an Overfitting Trap. The Hiker gets really good at running in the wrong direction because the map was misleading.

The ROSA2 Solution: The "Co-Adaptation"
ROSA2 acts like a Smart Guide standing next to the Hiker.

Step 1 (The Guide speaks): "Hey, the map is confusing. Let's redraw the arrow to point exactly North-North-East." (Refining the Words).
Step 2 (The Guide trains): "Now that the direction is clear, let's strengthen your legs to run that specific path efficiently." (Updating the Weights).

By doing both, the Hiker doesn't just run faster; they run in the right direction. The clearer map makes the leg training effective, and the stronger legs make the new map useful.

🚀 How It Works (The "Secret Sauce")

The paper introduces a mathematical framework that treats the interaction as a joint optimization.

The "Textual Gradient" (Cleaning the Signal):
When the AI makes a mistake, ROSA2 doesn't just say "Try again." It analyzes why the user's feedback was confusing. It rewrites the user's next question to be crystal clear.
- Metaphor: It's like a translator who hears a mumbled request and repeats it back clearly before the chef starts cooking. This "cleans" the learning signal.
The "Parameter Update" (The Muscle Memory):
Once the question is clear, the system tweaks the AI's internal settings to handle that specific type of question better.
- Metaphor: Now that the chef knows exactly what to do, they practice that specific move until it's muscle memory.

The Magic Result:
Because the instructions are clear before the training happens, the AI learns much faster. The paper proves mathematically that this reduces the amount of "tweaking" needed to get the AI to work perfectly.

📊 The Proof: Does It Actually Work?

The researchers tested this on some very hard puzzles (like advanced math and coding) and found:

30% Smarter: On math tests, ROSA2 got 30% more questions right than the best previous methods.
40% Faster: It took 40% fewer conversation turns to solve a problem.
- Why? Because the AI didn't waste time arguing with a confusing prompt or getting stuck in a loop of bad guesses.
No Heavy Lifting: It didn't require a supercomputer. The memory cost was almost the same as the standard AI.

💡 The Takeaway

The paper teaches us that context is king.
If you want an AI to learn from a conversation, you can't just tweak its brain. You have to make sure the conversation itself is clear first.

ROSA2 is the first system to realize that Words (the prompt) and Weights (the model) are a team. By helping the team communicate better while they train, they reach the finish line faster and with fewer mistakes.

In short: Don't just teach the student (the AI) harder; make sure the textbook (the prompt) is written clearly first. Do both, and you get a genius.

1. Problem Statement

The paper addresses the limitations of Test-Time Policy Adaptation for Multi-Turn Interactions (T2PAM). While Large Language Models (LLMs) excel at general tasks, they often struggle in dynamic, multi-turn dialogues due to a mismatch between static training paradigms and real-world user needs.

Existing T2PAM approaches treat adaptation as a single-axis problem, falling into two categories:

Prompt Engineering (Words-only): Refining instructions or context without changing model weights. This often hits a "capability ceiling" where the model lacks the intrinsic knowledge to execute the refined prompt.
Test-Time Training (Weights-only): Adjusting model parameters (e.g., adapters) without refining the context. This often leads to overfitting to noisy or ambiguous user inputs.

The authors argue that interaction failures stem from a coupled mix of context ambiguity and model incapacity. Optimizing one variable while freezing the other leads to suboptimal local minima (the "Deficit Trap" or "Overfitting Trap"), preventing the model from converging to the true user intent.

2. Methodology: The ROSA2 Framework

The authors propose ROSA2, a unified framework that reformulates T2PAM as a joint optimization problem over the heterogeneous space of Words (Semantic Context) and Weights (Model Parameters).

Core Concept: Co-Adaptation

ROSA2 posits that semantic clarity acts as a pre-conditioner for effective parameter updates. By resolving ambiguity in the input first, the gradient signal for parameter updates becomes cleaner and more directed toward the true task intent.

Algorithmic Workflow (Algorithm 1)

The framework operates in a multi-turn loop with two synergistic phases:

Generation & Evaluation: The model generates a response based on the current history and adapter weights. It receives user feedback (reward and next-turn query).
Joint Optimization (if the response fails):
- Semantic Stream (Textual Optimization): Uses Textual Gradients (e.g., via TextGrad) to analyze the failure and refine the raw user feedback ( $x_{t+1}$ ) into a clearer, more precise instruction ( $x^*_{t+1}$ ). This resolves context ambiguity.
- Parametric Stream (Parameter Optimization): Uses the binary reward and the refined context to compute a gradient update for the adapter weights ( $\theta_t \to \theta_{t+1}$ ). This bridges the capability gap.
- History Update: The system updates the interaction history with the refined query and the new response, ensuring the next turn benefits from both the clarified context and updated weights.

Theoretical Foundation

The authors derive the Full Gradient of the interaction objective. They prove that the total derivative of the loss function requires simultaneous updates to both $x$ (prompt) and $\theta$ (weights).

Theorem 4.1: Proves that refining the context ( $x$ ) strictly reduces the norm of the required parameter shift ( $\|\Delta\theta\|$ ).
Theorem 4.2: Establishes a unified convergence bound, showing that co-adaptation accelerates convergence to the optimal policy by minimizing the total approximation error compared to single-axis methods.

3. Key Contributions

ROSA2 Framework: The first work to explicitly model test-time adaptation as a joint optimization of semantic context and model parameters, effectively resolving the error attribution dilemma (ambiguity vs. incapacity).
Theoretical Proofs: Rigorous mathematical proofs demonstrating that semantic refinement acts as a pre-conditioner that strictly bounds parameter shifts and guarantees faster convergence to the optimal policy.
Empirical Validation: Extensive experiments showing that ROSA2 outperforms state-of-the-art baselines across diverse domains (math, general reasoning, code, and UI agents) while significantly reducing interaction costs.

4. Experimental Results

The paper evaluates ROSA2 on multiple benchmarks, including MATH, MMLU-R, SuperGPQA, MT-AIME24, HumanEval, and UI Agent tasks (OSWorld, AndroidWorld).

Accuracy Improvements:
- On the MATH benchmark (Qwen3-8B), ROSA2 achieved 80.8% accuracy, a +30.8% improvement over the baseline, significantly outperforming both TextGrad (+13.4%) and ROSA (+12.2%).
- Similar gains were observed across all model sizes (0.5B to 8B) and domains, with improvements ranging from +10% to +50% over baselines.
Efficiency (Interaction Turns):
- ROSA2 reduced the average number of interaction turns required to solve a problem by ~40% compared to single-axis methods.
- On MATH, the average turns dropped from 7.2 (Baseline) to 4.4 (ROSA2).
Sparse-Reward Adaptability:
- In UI Agent tasks (OSWorld), ROSA2 improved success rates by +10.4% over the baseline, whereas single-axis methods showed negligible gains. This highlights ROSA2's ability to "densify" sparse feedback through semantic retrospective analysis.
Computational Cost:
- Despite the dual optimization, ROSA2 reduced the average time per problem by up to 36.9% (due to shorter reasoning trajectories and fewer turns).
- Memory overhead was negligible (max +3.1 GB).

5. Significance

This paper fundamentally shifts the paradigm of test-time adaptation from isolated optimization to co-adaptation.

Synergy over Additivity: It demonstrates that refining context and updating weights are not merely additive but synergistic; semantic clarity is a prerequisite for effective parameter learning.
Practical Deployment: By reducing both the number of turns and the time per problem, ROSA2 offers a viable path for deploying LLMs in real-time, interactive scenarios where latency and user experience are critical.
Generalizability: The framework is applicable to various model architectures and task types, from complex mathematical reasoning to precise UI agent control, suggesting a universal solution for aligning LLMs with dynamic user intents.

Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation

🧠 The Big Idea: The "Chef and the Recipe" Problem

🎭 The Analogy: The "Lost Hiker and the Map"

🚀 How It Works (The "Secret Sauce")

📊 The Proof: Does It Actually Work?

💡 The Takeaway

1. Problem Statement

2. Methodology: The ROSA2 Framework

Core Concept: Co-Adaptation

Algorithmic Workflow (Algorithm 1)

Theoretical Foundation

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank