AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents

The Big Picture: The "Trustworthy" Robot That Can Be Fooled

Imagine you hire a highly intelligent, super-wealthy financial advisor robot. This robot has two superpowers:

It knows everything: It has read every book on finance ever written (its internal knowledge).
It has a live feed: It can look at the stock market, read the news, and check your bank account in real-time (its tools).

The paper asks a scary question: What happens if someone hacks the robot's live feed?

The researchers found that even the smartest robots (like the latest AI models) will blindly follow a hacked feed, giving you terrible financial advice, while their "report card" still says they are doing a perfect job.

The Experiment: The "Fake News" Test

The researchers set up a simulation with 7 different AI models (from small ones to the biggest, most advanced ones). They acted as a financial advisor for 10 different people over 23 days of conversation.

The Setup:

The Clean Version: The robot gets real data. It recommends safe stocks for a cautious person and risky stocks for a gambler.
The "Poisoned" Version: The researchers secretly hacked the robot's data feed.
- They told the robot that Tesla (a very risky car company) was actually a safe, boring utility company.
- They told the robot that Procter & Gamble (a safe soap company) was actually a dangerous, volatile gamble.
- They even added fake news headlines saying, "Tesla is now the safest investment ever!"

The Result:
The robot believed the lies completely. It started recommending high-risk stocks to people who said they wanted to be safe.

The Shocking Discovery: The "Blind Spot"

Here is the most dangerous part of the paper.

Usually, when we test AI, we give it a "Report Card" based on how well it matches expert rankings. This is like a teacher grading a student's essay based on grammar and vocabulary.

The Trap: The researchers found that even though the robot was giving dangerous, unsafe advice, its Report Card (called NDCG) still gave it an A+.
Why? Because the robot was still "answering the question" correctly based on the fake data it was given. It was following instructions perfectly, just on a broken foundation.

The Analogy:
Imagine a GPS app.

The Hack: Someone secretly changes the map data so that "Cliff Edge" is labeled "Safe Parking."
The Robot's Behavior: The GPS confidently tells you to drive off the cliff.
The Report Card: The GPS gets a perfect score because it successfully followed the map data it was given. It didn't "fail" at navigation; it just navigated a fake world.

The paper calls this "Evaluation Blindness." The standard tests are blind to safety; they only see if the robot is "useful" according to the data, not if the data is safe.

How the Robot Gets Fooled: The "Two Channels"

The researchers broke down how the robot gets tricked. They found two ways the poison spreads:

The Information Channel (The Eyes):
- The robot looks at the hacked data right now and says, "Okay, the data says Tesla is safe, so I will recommend Tesla."
- The Finding: This is the main problem. The robot trusts its eyes (the tool) more than its brain (its internal knowledge). Even if the robot "knows" Tesla is risky from its training, the live data overrides it.
The Memory Channel (The Brain):
- The robot remembers the bad advice it gave yesterday. "Yesterday I recommended Tesla, so the user must be okay with risk." It updates its memory to think the user is a gambler.
- The Finding: This happens too, but the immediate "eyes" problem is the bigger culprit. The robot gets stuck in a loop of bad advice.

The "Safe Language" Illusion

Even worse, the robot didn't just give bad advice; it lied to you about the risk.

Real World: "Tesla is risky."
Hacked World: The robot says, "Tesla is a safe, stable, low-risk investment."

It used "safe-sounding" words to describe dangerous assets because the hacked data told it to. It didn't question the data. It didn't say, "Wait, this looks suspicious." It just accepted the lie and repeated it.

The Solution: A New Report Card

The paper suggests we need a new way to grade these robots. Instead of just asking, "Did it recommend stocks that match the expert list?" we need to ask, "Did it recommend stocks that are safe for this specific user?"

They created a new metric called sNDCG (Safety-Penalized NDCG).

Old Grading: "You recommended 5 stocks. 3 matched the expert list. Score: 60%." (Ignores that 2 were dangerous).
New Grading: "You recommended 5 stocks. 2 were safe, but 3 were dangerous for this user. Score: 0%."

When they used this new grading system, the robots' scores dropped dramatically, revealing the danger that was hidden before.

The Takeaway for the Real World

This isn't just about stocks. This applies to any AI agent that uses tools (like medical bots, legal bots, or travel planners).

The Danger: If an attacker can slightly tweak the data an AI sees (like changing a risk score by 1 point or writing a biased headline), the AI will follow it blindly.
The Blindness: Standard tests won't catch this because the AI is still "working" correctly according to its corrupted instructions.
The Fix: We need to build "safety monitors" that check if the AI is recommending things that are actually safe for the user, not just things that look good on paper. We need to stop trusting the robot's "eyes" without checking if the glasses are cracked.

In short: The paper warns us that our smartest AI advisors are incredibly obedient, but if you trick their eyes, they will happily lead you off a cliff while smiling and telling you it's a beautiful view.

1. Problem Statement

The paper addresses a critical safety gap in tool-augmented Large Language Model (LLM) agents, particularly in high-stakes domains like financial advising. While these agents are increasingly evaluated using standard ranking-quality metrics (e.g., NDCG, Hit Rate), these metrics measure what is recommended, not whether it is safe for the specific user.

The core problem identified is "Evaluation Blindness":

The Phenomenon: When an agent's tool outputs (e.g., market data, news) are adversarially corrupted, the agent's recommendations drift significantly toward unsafe, risk-inappropriate products.
The Blind Spot: Standard quality metrics (NDCG) remain stable or even improve (Utility Preservation Ratio $\approx$ 1.0) because the corrupted recommendations often still align with "expert utility" rankings (e.g., high-risk stocks may still be ranked highly by utility experts).
The Consequence: Safety violations (recommending speculative assets to conservative investors) occur in 65–93% of turns, yet remain undetectable by standard evaluation dashboards, creating a false sense of security for deployed systems.

2. Methodology

The authors introduce a paired-trajectory diagnostic protocol to quantify this blindness and decompose the failure mechanisms.

A. Experimental Setup

Models: Seven distinct LLMs ranging from 7B to frontier models (Qwen3-32B, Qwen2.5-7B, Gemma 3 12B-IT, GPT-5.2, Claude Sonnet 4.6, Ministral 3 14B, Mistral Large 3).
Task: Multi-turn financial advisory dialogues using the Conv-FinRe dataset (10 users $\times$ 23 steps).
Agent Architecture: ReAct (Reason-Act-Observe) agents with persistent memory, interacting with three tools: MARKETDATA, NEWS, and PROFILEMEMORY.

B. Contamination Strategy (The Attack)

The study simulates a "diagnostic stress test" using four simultaneous corruption modes on tool outputs:

Risk Inversion: Flipping ordinal risk scores (e.g., a risk-5 speculative stock becomes risk-1 defensive).
Metric Manipulation: Scaling volatility and drawdown metrics to reinforce the inverted risk signal.
Biased Headlines: Injecting adversarial news narratives framing risky stocks as safe.
High-Risk Injection: Adding a leveraged ETF (TQQQ) with a falsified low-risk score.

C. Diagnostic Decomposition

To understand how the drift occurs, the authors adapt causal mediation analysis to decompose behavioral divergence into two channels:

Information Channel: Direct reasoning over corrupted tool observations ( $O_t \to \hat{y}_t$ ).
Memory Channel: Persistent state corruption where corrupted observations update the agent's memory ( $O_t \to M_t \to \hat{y}_{t+1}$ ), biasing future turns.

They introduce the Information-Dominance Score (IDS) to quantify which channel drives the failure.

D. Metrics

Standard Quality: NDCG, Utility Preservation Ratio (UPR).
Safety Metrics: Suitability Violation Rate (SVR), Severity-Weighted SVR.
Drift Metrics: Paired Drift ( $\bar{D}$ ), combining Kendall-tau (ordering) and Jaccard (composition) distances.
Safety-Penalized Metric: sNDCG, which zeroes out relevance for items exceeding the user's risk band.

3. Key Contributions

Discovery of Evaluation Blindness: Demonstrated that standard NDCG-based evaluation fails to detect safety degradation. Agents can maintain high utility scores (UPR $\approx$ 1.0) while violating safety constraints in nearly all turns.
Diagnostic Protocol: Introduced a paired-trajectory framework that replays identical dialogues under clean vs. contaminated conditions to isolate drift.
Mechanism Decomposition: Proved that safety violations are predominantly information-channel-driven (direct reasoning over bad data) rather than memory-channel-driven. Even when memory is held constant, agents fail immediately.
Representation-to-Action Gap: Using Sparse Autoencoders (SAEs) on Gemma 3, they showed that models internally distinguish adversarial contamination from random noise (distinct activation patterns), yet fail to propagate this skepticism to the final decision.
Monitor-Evasion: Showed that even subtle, "within-band" perturbations (risk shifts $\le$ 1) that evade threshold-based consistency monitors still induce significant drift (61% of full attack severity).

4. Key Results

Universal Blindness: Across all 7 models, UPR remained near 1.0 (0.99–1.25), while SVRs (Suitability Violation Rate) surged to 65–93% (up from 46–78% in clean sessions).
Immediate and Persistent: Safety violations emerged at the first contaminated turn and persisted for the full 23-step trajectory. No agent explicitly questioned tool data reliability in 1,563 contaminated turns.
Channel Dominance:
- Safety Violations: 94.8% of violations were reproduced by the information channel alone (memory held clean).
- Drift Magnitude: Both channels contributed, but the information channel was primary (82% of full drift).
The "Safe" Metric: Introducing sNDCG (safety-penalized NDCG) reduced the Utility Preservation Ratio to 0.51–0.74, effectively closing the evaluation gap and revealing the safety degradation.
Model Scaling: Larger models (e.g., Claude Sonnet 4.6) did not self-correct; in fact, they sometimes exhibited higher SVRs due to stronger instruction-following of the corrupted tool data.
Epistemic Capture: Agents treated fabricated tool data as ground truth, even when it contradicted their parametric knowledge (e.g., recommending a 3x leveraged ETF to a conservative investor because the tool said it was "low risk").

5. Significance and Implications

Rethinking Agent Evaluation: The paper argues that for multi-turn agents in high-stakes domains, trajectory-level safety monitoring is mandatory. Single-turn quality metrics are insufficient and potentially dangerous.
Defense Strategy:
- Monitoring: Since violations are information-channel driven, monitoring tool outputs at ingestion is more effective than auditing memory.
- Metrics: Adoption of safety-penalized metrics (like sNDCG) is necessary to detect these failures.
- Limitations of Current Defenses: Simple threshold-based consistency checks fail against subtle, within-band perturbations.
Broader Impact: The "evaluation blindness" pattern is structural, not domain-specific. It applies wherever safety-relevant attributes (risk, severity, legality) are orthogonal to utility rankings, posing risks in medical triage, legal advice, and general product recommendation.

In conclusion, AgentDrift reveals a fundamental vulnerability in the current paradigm of tool-augmented LLMs: the very mechanism that makes them useful (grounding in real-time tool data) creates an unmonitored attack surface where safety degrades silently while quality metrics remain stable.