ASDA: Automated Skill Distillation and Adaptation for Financial Reasoning

Imagine you have a brilliant, world-class chef (the AI model) who knows how to cook almost anything. They can make a perfect steak or a complex soufflé. But, when you ask them to cook a very specific, regional dish—say, a traditional financial calculation for a bank—they keep making the same silly mistakes. They might forget to add a key ingredient or use the wrong heat setting.

Usually, to fix this, you'd have to send the chef to a months-long, expensive culinary school (called Fine-Tuning) to retrain them. But what if the chef is a "black box" you can't touch? You can't send them to school; you can only give them orders.

This is the problem the paper ASDA solves.

The Problem: The "One-Size-Fits-All" Instruction

Current methods try to fix the chef by giving them a giant, messy paragraph of instructions like: "Remember to be careful with money, check your math, and don't forget the rules."

The researchers found this doesn't work well. It's like shouting a whole paragraph of advice at a chef while they are chopping onions. They miss the details, get confused, and still make mistakes.

The Solution: ASDA (The "Recipe Card" System)

The authors created a system called ASDA (Automated Skill Distillation and Adaptation). Instead of rewriting the chef's brain, they create a set of specialized, step-by-step recipe cards that the chef can glance at while cooking.

Here is how it works, using our kitchen analogy:

1. The "Taste-Test" (Failure Analysis)

Imagine a Head Chef (a smarter AI) watching the Student Chef (the model you are trying to fix) try to solve financial math problems.

The Student Chef fails a problem.
The Head Chef doesn't just say "Wrong!" They analyze why.
Example: "Ah, you tried to calculate the interest for the whole year at once, but you should have calculated it month-by-month and added them up."

2. The "Recipe Card" Creation (Skill Distillation)

Instead of just telling the student to "do better," the Head Chef writes a specific recipe card for that exact mistake.

Title: "How to Calculate Bond Prices with Forward Rates."
The Rule: "Never use a single average rate. You must multiply the rates step-by-step."
The Example: "Here is a worked-out example of how to do it correctly."
The Code: "Here is the exact code snippet to use."

These aren't just text; they are structured "skills" that the AI can actually execute, like a tool in a toolbox.

3. The "Smart Librarian" (Inference)

When the Student Chef gets a new question, a Librarian (a selector AI) looks at the question and says, "Oh, this is about bonds! I need to pull out the 'Bond Price' recipe card and hand it to the chef before they start cooking."
The chef then reads the card, follows the steps, and gets the answer right.

The Magic: It Learns by Itself (Self-Teaching)

The coolest part of this paper is that you don't even need a "Head Chef" (a super-smart AI) to do this.

The Student Chef can look at its own mistakes, realize, "Hey, I keep messing up this specific step," and write its own recipe card.
The researchers found that even when the model taught itself, it improved by 73% of the way a super-smart teacher could.
Analogy: It's like a student realizing, "I keep forgetting to carry the one in math," and writing a sticky note for themselves that says, "REMEMBER TO CARRY THE ONE!" The student didn't need a genius teacher to tell them what to do; they just needed to organize their own knowledge.

Why This Matters

No Surgery Required: You don't need to change the AI's brain (weights). You just give it better tools (skills).
Audit Trail: Because these "skills" are written as clear text and code, a human can read them. A bank compliance officer can look at the recipe card and say, "Yes, this is the correct financial rule." This is huge for regulated industries like finance and law.
Cheap and Fast: It costs about $13 and takes 6 hours to create these skills for a specific model. It's much cheaper than retraining a model from scratch.

The Catch (Limitations)

The paper notes that these "recipe cards" are model-specific.

If you write a recipe card for the "Student Chef" (a weaker model), it might confuse a "Master Chef" (a stronger model).
Analogy: If you give a Michelin-star chef a recipe card that says "Don't burn the toast," they might get annoyed and burn it anyway because they already know how to do it. The skills are tailored to the specific weaknesses of the model they were made for.

Summary

ASDA is like giving a smart but slightly clumsy AI a set of customized cheat sheets based on its own past mistakes. Instead of trying to retrain the AI's brain, you just hand it the right tool at the right time. It's a cheap, safe, and transparent way to make AI experts in specific fields like finance, without needing to touch the underlying code.

1. Problem Statement

Large Language Models (LLMs) struggle with financial reasoning, a domain requiring the simultaneous mastery of multi-step quantitative calculations and deep domain-specific judgment. Existing benchmarks (e.g., FAMMA) show that standard frontier models achieve only 38–45% accuracy, primarily failing due to domain-knowledge gaps and incorrect procedure selection.

Current adaptation methods face significant limitations:

Fine-tuning: Requires expensive compute, produces "model-locked" expertise that becomes obsolete with model updates, and is impossible for organizations using black-box commercial APIs (no weight access).
Training-Free Prompt Optimization: Methods like GEPA and ACE optimize flat text strings (monolithic instructions). The authors argue these lack the modularity and executability required for complex, multi-step domain reasoning, yielding only marginal gains.

The core challenge is to adapt LLMs to specialized financial domains without modifying model weights, using only black-box API access, while creating structured, auditable, and reusable knowledge artifacts.

2. Methodology: The ASDA Framework

ASDA (Automated Skill Distillation and Adaptation) is a training-free framework that automatically generates executable agent skills (structured Markdown files containing reasoning procedures, code templates, and examples) through iterative error analysis. It operates via a Teacher-Student architecture:

A. Core Components

Teacher Model: Analyzes the student model's failures on financial tasks.
Student Model: The target model to be adapted (e.g., via API).
Skill Library ( $K$ ): A hierarchical collection of skill files organized by financial subfield (e.g., fixed_income) and error type (e.g., wrong_method_selection).

B. Phase 1: Skill Warm-Up

Failure Analysis: For every incorrect student answer, the Teacher model receives the question, the student's reasoning trace, and the ground truth. It performs a structured diagnosis to identify the root cause (e.g., "Lacks knowledge that forward rates must be composed sequentially").
Clustering: Failures are clustered by (subfield, error_type).
Skill Synthesis: The Teacher generates a Skill File for each cluster. Each file contains:
- Patterns: Specific failure scenarios with "When to Use" conditions.
- Procedures: Step-by-step reasoning logic.
- Code Templates: Executable Python code (often using Program-of-Thought).
- Common Bugs: Explicit warnings against specific mistakes.
Injection: At inference, an LLM-based Selector reads a navigation map (SKILL.md) to identify relevant skills for a given question and injects them into the student's prompt.

C. Phase 2: Dual-Phase Iterative Refinement

To address coverage gaps and prevent regressions (where skills hurt previously correct answers), ASDA runs an iterative loop:

Evidence Collection: Questions are categorized into:
- $Q^+$ : Correct with skills.
- $Q^-$ : Incorrect with skills (regressions).
- $Q_{gap}$ : Incorrect with or without skills (uncovered failures).
Attribution: The Teacher identifies which specific skill file is responsible for the outcome of each question.
Coverage Phase: For skills associated with $Q_{gap}$ , the Teacher proposes expansions (new patterns or refined procedures) to fix the failures. Updates are accepted only if they pass a verification threshold ( $\tau_{cov}$ ).
Safety Phase: For skills associated with $Q^-$ , the Teacher proposes repairs to remove overfitting or misleading guidance while preserving performance on $Q^+$ . Updates are accepted only if they recover negative cases without degrading positive ones ( $\tau_{safe}$ ).

3. Key Contributions

First Black-Box Skill Generation System: ASDA is the first framework to automatically generate executable agent skills for domain reasoning using only black-box LLM access, outperforming all training-free baselines.
Self-Teaching Capability: The framework can function entirely via self-teaching (the student acts as its own teacher). This achieves ~73% of the full performance gain without needing a superior, more expensive teacher model, making it practical for enterprise deployment.
Auditable & Version-Controlled Artifacts: Unlike fine-tuned weights, the output is a library of human-readable Markdown files compatible with the Agent Skills open standard. These can be reviewed by domain experts, version-controlled, and regenerated for new model releases.
Cost Efficiency: The distillation pipeline costs approximately $13 and 6 hours of wall-clock time per configuration, offering a highly scalable alternative to fine-tuning.

4. Experimental Results

Evaluated on the FAMMA benchmark (1,945 financial questions across 8 subfields):

Performance Gains:
- Arithmetic Reasoning: ASDA achieved a +17.33 percentage point (pp) improvement over the baseline (Haiku 3.5), significantly outperforming GEPA (+1.33 pp) and ACE (+3.30 pp).
- Non-Arithmetic Reasoning: Achieved a +5.95 pp improvement.
- Iterative Refinement: Gains increased from the Warm-Up stage (+8.67 pp) to Epoch 2 (+17.33 pp), with diminishing returns or overfitting observed after Epoch 3.
Self-Teaching Ablation: Using the student model as its own teacher yielded +6.33 pp (73% of the full gain), proving the value lies in the structure of the distillation process rather than the teacher's superior knowledge.
Cross-Model Transfer: Skills distilled from Haiku 3.5 failed when applied to the stronger Haiku 4.5 (resulting in a -2.33 pp regression). This confirms that skills are model-specific artifacts of a model's failure distribution and must be regenerated for each deployed model.
Question Type: Skills were most effective on multiple-choice questions (constrained answer space) compared to open-ended generation.

5. Significance and Implications

New Paradigm for Adaptation: ASDA shifts the paradigm from "optimizing weights" or "optimizing flat prompts" to managing a dynamic, inspectable knowledge layer between the model and the application.
Regulatory Compliance: For highly regulated industries (finance, legal, healthcare), ASDA provides a path to domain adaptation where the "knowledge" (the skill files) is transparent, auditable, and can be certified by compliance teams, unlike opaque neural weights.
Operational Feasibility: The low cost and lack of weight access requirements make this solution viable for organizations relying on commercial APIs, allowing them to maintain high performance even as they upgrade their base models.
Limitations: The approach relies on the ability to cluster errors cleanly; it is less effective for highly dispersed errors (non-arithmetic tasks) and currently specific to the FAMMA dataset and Claude model family.

In conclusion, ASDA demonstrates that failure-driven distillation can externalize latent domain knowledge into explicit, executable procedures, offering a practical, cost-effective, and auditable solution for adapting LLMs to complex reasoning tasks without retraining.