Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents

This paper introduces the Determinism-Faithfulness Assurance Harness (DFAH), a framework and set of financial benchmarks demonstrating that decision determinism and task accuracy in LLM agents are uncorrelated, thereby necessitating independent measurement to ensure reliable regulatory audit replay in financial services.

Raffi Khatchadourian

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine you work at a bank, and you've hired a team of super-smart AI assistants to help you decide which transactions are safe and which ones look suspicious. These AI agents are like junior analysts: they look at data, ask questions, and make a final call.

But there's a huge problem. In the world of banking, if a regulator (the "boss" who checks your work) asks, "Show me exactly how you decided to flag this transaction," you have to be able to hit "replay" and get the exact same answer every single time.

This paper, written by Raffi Khatchadourian from IBM, introduces a new tool called DFAH (Determinism-Faithfulness Assurance Harness). Think of DFAH as a rigorous "Replay Test" lab for these AI agents.

Here is the breakdown of what they found, using simple analogies:

1. The Two Big Rules of the Game

The paper says that for an AI to be safe for banking, it needs to pass two very different tests:

  • The "Replay" Test (Determinism): If you ask the AI the same question 10 times, does it give you the exact same answer 10 times?
  • The "Truth" Test (Faithfulness/Accuracy): Is the answer actually correct? Did it base its decision on real facts, or did it just make things up?

The Big Surprise: The researchers found that these two things do not go hand-in-hand.

  • You can have an AI that is perfectly consistent (it gives the same answer every time) but is completely wrong (it's consistently wrong).
  • You can have an AI that is very smart (it gets the right answer often) but is unpredictable (sometimes it says "Yes," sometimes "No," even when the facts are the same).

The Analogy: Imagine a broken clock.

  • High Determinism, Low Accuracy: A clock that is stuck at 4:00 PM. It is 100% consistent (it always says 4:00), but it is wrong 23 hours a day.
  • Low Determinism, High Accuracy: A clock that is running perfectly but has a loose second hand that jumps around. It tells the right time on average, but if you check it twice in a row, the seconds might look different.

2. The "Small vs. Big" AI Showdown

The researchers tested different sizes of AI models (from small, 7-billion-parameter models to massive, "frontier" models).

  • The Small, Rigid Models (Tier 1): Think of these as robotic assembly line workers. They are very strict. If you give them a task, they follow a rigid script.
    • Result: They are incredibly consistent (94–100% replayable). But because they are so rigid, they often miss the nuance and get the answer wrong (only 20–40% accuracy). They are like a robot that always says "Investigate" no matter what the transaction is, just because that's the script.
  • The Big, Smart Models (Frontier Models): Think of these as creative consultants. They think deeply, try different approaches, and use many tools to solve a problem.
    • Result: They are much smarter and get the right answer more often. But because they are creative, they take different "paths" to get there. Sometimes they check Tool A first, then Tool B; other times it's Tool B then Tool A. This makes them unpredictable for a strict audit. They might give the right answer, but they can't prove exactly how they got there every single time.

The Catch: No model was found that was both a "perfect robot" (100% consistent) AND a "perfect genius" (100% accurate). You have to choose your trade-off.

3. Why This Matters for Banks

In the movie The Big Short, a bank might get in trouble not because they made a mistake, but because they couldn't explain why they made a decision.

  • The Regulatory Nightmare: If a regulator asks, "Why did you reject this loan?" and the AI says, "Well, last time I ran this, I said yes, but today I said no because I felt like it," the bank is in trouble.
  • The Solution: The paper suggests that for high-stakes, rule-based tasks (like checking for money laundering), banks should use the small, rigid models. Even if they aren't perfect geniuses, their "stuck clock" consistency means you can prove to the regulator exactly what happened.
  • The Human Safety Net: For the "creative" big models, you can't trust them to work alone. You need a human to sit between the AI and the final decision to ensure the AI didn't just "hallucinate" a reason.

4. The "Pass/Fail" Test

The paper introduces a new way of thinking about success:

  • Pass@k (The Optimist's View): "If I try 10 times, will I get at least one right answer?" (Good for coding, bad for banking).
  • Passk (The Regulator's View): "If I try 10 times, will I get the exact same right answer 10 times?" (This is what banks need).

The Bottom Line

This paper is a warning label for the financial industry. It says: "Don't just look at how smart your AI is. Look at how consistent it is."

If you are building an AI for a bank, you can't just pick the "smartest" model available. You have to pick the one that can be "replayed" perfectly, even if it means accepting a model that is a bit less "creative." The DFAH tool is the new ruler they built to measure this consistency, ensuring that when the regulators come knocking, the bank can hit "replay" and show them the exact same story every time.