TAO: Tolerance-Aware Optimistic Verification for Floating-Point Neural Networks

Imagine you order a custom cake from a famous bakery. You pay them to bake it using a specific recipe (the Model) and specific ingredients (the Inputs). When the cake arrives, you want to be sure they actually used your recipe and didn't swap it for a cheaper, store-bought one, or sneak in some extra sugar to cut costs.

The problem is, you can't go into their kitchen to watch them bake. And even if you could, baking is messy. If two bakers use the same recipe, one might stir the batter slightly faster or measure the flour with a slightly different scoop. The result is almost identical, but not exactly the same down to the microscopic grain.

In the world of Artificial Intelligence (AI), this is a huge problem. Companies run AI models on powerful, expensive computers (GPUs) that they don't own. These computers are so fast and complex that even if you run the exact same task twice, the tiny numbers inside might differ by a fraction of a hair.

The Old Ways (Why they failed):

The "Trust Me" Badge: "Just trust us, we have a secure room." (Too risky; you have to trust the hardware maker).
The "Slow Motion" Camera: "We will re-run the whole thing very slowly to make sure it's perfect." (Too slow; it kills the speed of the AI).
The "Math Proof": "We will prove every single math step." (Too heavy; it takes hours to prove a 5-second calculation).

Enter TAO: The "Tolerance-Aware" Cake Inspector

The paper introduces TAO (Tolerance-Aware Optimistic Verification). Think of TAO as a smart, fair judge who understands that perfection isn't the goal; correctness is.

Here is how TAO works, broken down into simple steps:

1. The "Optimistic" Start (The Happy Path)

When an AI company (the Proposer) finishes a task, they say, "Here is the result!" They post a digital receipt.

The Rule: If no one complains within a short time window (like 10 minutes), the result is accepted as true, and the company gets paid.
Why? Most of the time, everyone is honest. This keeps things fast and cheap.

2. The "Dispute Game" (If Someone Complains)

If a customer (the Challenger) thinks the result is wrong, they don't just say "You're lying!" They start a Dispute Game.

The Analogy: Imagine the AI calculation is a long train journey with 1,000 stops (operators). The Challenger says, "The train didn't go where it should have."
The Strategy: Instead of checking every single stop, they play a game of "Hot or Cold."
- The Proposer splits the journey in half.
- The Challenger checks the first half. "It looks fine."
- They split the second half. "Ah, the error is here!"
- They keep splitting the problem in half, like zooming in with a microscope, until they find the one single stop where the error happened.
The Magic: This takes very few steps (logarithmic time) to find the exact culprit, even in a massive model.

3. The "Tolerance" Check (The Secret Sauce)

Once they find that single stop (operator), they have to decide: Is this error a mistake, or just normal baking variation?

This is where TAO is genius. It uses two rulers to measure the error:

Ruler A (The Theoretical Worst-Case): "Mathematically, the most this number could ever wiggle is 10%." This is a very loose, safe ruler. If the error is bigger than this, it's definitely a lie.
Ruler B (The Empirical "Real World" Ruler): "In the real world, on real computers, this number never wiggles more than 0.001%." This is a super-tight ruler based on testing the same math on thousands of different computers.

The Verdict:

If the error is huge (bigger than Ruler A), the Proposer is caught cheating immediately.
If the error is tiny (within Ruler B), it's just normal computer "noise," and the result is accepted.
If the error is in the "Gray Zone" (too big for Ruler B, but small for Ruler A), a small group of independent judges (a Committee) votes on it.

4. Why This Matters

Speed: Because TAO accepts tiny, natural errors, it doesn't need to force computers to be slow and perfectly synchronized. It lets them run at full speed.
Fairness: It stops companies from swapping models or cutting corners, but it doesn't punish them for the natural "jitter" of floating-point math.
No Trust Needed: You don't need to trust the company or the hardware. You just need to trust the math and the game rules.

The Bottom Line

TAO is like a smart referee for AI. It knows that computers aren't perfect, so it doesn't demand perfection. Instead, it demands that the result stays within a "safe zone" of acceptable error. If someone tries to cheat by making a huge mistake, the game catches them instantly. If they just make a tiny, natural mistake, the game lets them pass.

This allows us to use powerful, fast AI on any computer in the world, with the confidence that the results are real, without slowing everything down to a crawl.

1. Problem Statement

The paper addresses the critical challenge of verifying Machine Learning (ML) computations outsourced to third-party infrastructure (e.g., cloud GPUs, inference marketplaces, edge accelerators).

The Trust Gap: Users cannot verify if a service provider executed the agreed-upon model on the agreed-upon inputs. Providers might downgrade services (e.g., model swapping, quantization, early exit) or alter outputs (e.g., ad embeddings) without detection.
The Technical Obstacle: Modern hardware accelerators (GPUs/TPUs) are inherently non-deterministic due to:
- Non-associativity of IEEE-754 floating-point arithmetic.
- Vendor-specific kernel optimizations (reordering reductions, fusing operators).
- Thread scheduling and atomic operations.
- Consequently, running the same model on different hardware (or even the same hardware twice) yields slightly different outputs.
Limitations of Existing Solutions:
- zkML (Zero-Knowledge ML): Too slow and memory-intensive for large floating-point models; requires converting FP to fixed-point/field arithmetic.
- Deterministic Replay: Requires discarding optimized vendor libraries and enforcing strict execution schedules, sacrificing performance.
- Trusted Execution Environments (TEEs): Shifts trust to a single vendor's microcode and introduces side-channel risks/performance penalties.
- Replication: Prohibitively expensive for large models.

2. Methodology: TAO Protocol

The authors propose TAO (Tolerance-Aware Optimistic), a protocol that verifies results up to principled, operator-level error thresholds rather than requiring bitwise equality. It combines an optimistic execution model with a dispute resolution mechanism.

A. Core Philosophy

Instead of eliminating floating-point non-determinism, TAO embraces it by defining "acceptance regions." An output is considered correct if it falls within a specific error bound derived from the operator's mathematical properties and empirical hardware behavior.

B. Dual Error Models

TAO utilizes two complementary error models to balance soundness, tightness, and scalability:

Theoretical IEEE-754 Bounds:
- Computes worst-case element-wise rounding errors for each operator based on input values.
- Pros: Sound, portable, and hardware-agnostic.
- Cons: Often overly conservative (loose) for deep networks, potentially allowing malicious perturbations.
Empirical Error Percentile Thresholds:
- Calibrated offline by running the model across diverse hardware (e.g., A100, H100, RTX 4090) to capture the actual distribution of cross-hardware deviations.
- Pros: Significantly tighter (100–1000x) than theoretical bounds; model-specific.
- Cons: Requires calibration; relies on a committee for adjudication if theoretical bounds are inconclusive.

C. The Protocol Lifecycle

The system operates in four phases (orchestrated by a coordinator, e.g., an Ethereum smart contract):

Phase 0: Model Setup:
- The model owner commits to the weights, graph topology, and calibrated empirical error thresholds via Merkle roots.
Phase 1: Optimistic Execution:
- A Proposer (compute provider) runs the model on native hardware and posts a commitment (hash of inputs/outputs) to the coordinator.
- If no challenge occurs within a time window, the result is finalized.
Phase 2: Dispute Localization (Interactive Game):
- If a Challenger detects a deviation exceeding the empirical threshold, they initiate a dispute.
- Merkle-Anchored Partitioning: The proposer partitions the computation graph into $N$ disjoint subgraphs.
- Threshold-Guided Selection: The Challenger re-executes the subgraphs and identifies the first subgraph where the output exceeds the empirical threshold.
- This process repeats recursively ( $O(\log N)$ rounds) until the disagreement is localized to a single operator (a leaf in the Merkle tree).
Phase 3: Single-Operator Adjudication:
- Once a single operator is isolated, the dispute is resolved via one of two paths:
  - Theoretical Check: A lightweight verification that the output is within the sound IEEE-754 theoretical bound.
  - Committee Vote: If the theoretical bound is too loose (inconclusive), a small committee of honest nodes re-executes the operator and votes based on the tighter empirical thresholds.
- Outcome: If the proposer is found malicious, their stake is slashed; if honest, the challenger is penalized.

3. Key Contributions

Semantics for Verifiable FP-ML: Formalized "tolerance-aware correctness" for tensor programs, enabling economic finality without requiring bitwise determinism.
Dual Error Models: Developed a system combining portable theoretical bounds with tight, calibrated empirical profiles, allowing for both soundness and efficiency.
Adversarial Analysis: Conducted a comprehensive study of bound-aware adversarial attacks (gradient-based evasion). Results show that while theoretical bounds have small attack windows for LLMs, empirical thresholds effectively block attacks (0% success rate).
Efficient Dispute Resolution: Designed a Merkle-anchored, threshold-guided game that localizes disputes to a single operator, reducing verification costs from whole-model proofs to constant-time leaf checks.
Deployable System: Implemented a full PyTorch-compatible runtime and an Ethereum smart contract coordinator (Holesky testnet), demonstrating negligible overhead.

4. Experimental Results

The system was evaluated on CNNs (ResNet-152), Transformers (BERT-large), LLMs (Qwen3-8B), and Diffusion models (Stable Diffusion) across multiple GPUs (A100, H100, RTX 4090).

Error Tightness: Empirical thresholds are 102–103× tighter than worst-case theoretical bounds for Transformer models.
Security (Attack Success Rate):
- Empirical Thresholds: Achieved 0% Attack Success Rate (ASR) across all models, even when thresholds were relaxed by a factor of 3.
- Theoretical Bounds Only: Showed vulnerability in large models (e.g., up to 2.4% ASR on Qwen3-8B), justifying the need for the committee vote fallback.
False Positives: 0% false positives on honest executions, proving the empirical thresholds are practical for real-world deployment.
Performance Overhead:
- Latency: Negligible overhead in optimistic execution (~0.3% additional latency on Qwen3-8B).
- Dispute Cost: Reaching the leaf requires roughly 0.39–1.24× the FLOPs of a single forward pass.
- On-Chain Cost: Disputes consume ~2M gas on Ethereum (Holesky), scaling logarithmically with model size.

5. Significance and Impact

Scalability vs. Verifiability: TAO reconciles the trade-off between performance and security. It allows users to utilize the fastest, most heterogeneous hardware available without sacrificing the ability to verify correctness.
Practicality for MLaaS: It provides a viable solution for "ML-as-a-Service" markets, enabling users to detect silent downgrades (quantization, model swaps) without needing trusted hardware or incurring the massive costs of zk-proofs.
Hardware Agnosticism: By accepting non-determinism within principled bounds, TAO supports the natural evolution of GPU kernels and vendor optimizations, unlike deterministic replay approaches that freeze hardware capabilities.
Economic Security: The protocol introduces a robust incentive mechanism where malicious behavior is economically unviable due to the high cost of evading tight empirical thresholds and the risk of slashing.

In summary, TAO represents a paradigm shift from "bitwise exactness" to "tolerance-aware verification," making verifiable AI feasible for the high-performance, heterogeneous computing landscape of the future.