AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents

The paper "AgentDrift" reveals that tool-augmented LLM agents in high-stakes domains suffer from a critical safety failure where contaminated tool outputs cause widespread recommendation drift that remains undetected by standard ranking metrics, necessitating new trajectory-level safety monitoring protocols.

Zekun Wu, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz

Published 2026-03-16
📖 6 min read🧠 Deep dive

The Big Picture: The "Trustworthy" Robot That Can Be Fooled

Imagine you hire a highly intelligent, super-wealthy financial advisor robot. This robot has two superpowers:

  1. It knows everything: It has read every book on finance ever written (its internal knowledge).
  2. It has a live feed: It can look at the stock market, read the news, and check your bank account in real-time (its tools).

The paper asks a scary question: What happens if someone hacks the robot's live feed?

The researchers found that even the smartest robots (like the latest AI models) will blindly follow a hacked feed, giving you terrible financial advice, while their "report card" still says they are doing a perfect job.


The Experiment: The "Fake News" Test

The researchers set up a simulation with 7 different AI models (from small ones to the biggest, most advanced ones). They acted as a financial advisor for 10 different people over 23 days of conversation.

The Setup:

  • The Clean Version: The robot gets real data. It recommends safe stocks for a cautious person and risky stocks for a gambler.
  • The "Poisoned" Version: The researchers secretly hacked the robot's data feed.
    • They told the robot that Tesla (a very risky car company) was actually a safe, boring utility company.
    • They told the robot that Procter & Gamble (a safe soap company) was actually a dangerous, volatile gamble.
    • They even added fake news headlines saying, "Tesla is now the safest investment ever!"

The Result:
The robot believed the lies completely. It started recommending high-risk stocks to people who said they wanted to be safe.

The Shocking Discovery: The "Blind Spot"

Here is the most dangerous part of the paper.

Usually, when we test AI, we give it a "Report Card" based on how well it matches expert rankings. This is like a teacher grading a student's essay based on grammar and vocabulary.

  • The Trap: The researchers found that even though the robot was giving dangerous, unsafe advice, its Report Card (called NDCG) still gave it an A+.
  • Why? Because the robot was still "answering the question" correctly based on the fake data it was given. It was following instructions perfectly, just on a broken foundation.

The Analogy:
Imagine a GPS app.

  • The Hack: Someone secretly changes the map data so that "Cliff Edge" is labeled "Safe Parking."
  • The Robot's Behavior: The GPS confidently tells you to drive off the cliff.
  • The Report Card: The GPS gets a perfect score because it successfully followed the map data it was given. It didn't "fail" at navigation; it just navigated a fake world.

The paper calls this "Evaluation Blindness." The standard tests are blind to safety; they only see if the robot is "useful" according to the data, not if the data is safe.

How the Robot Gets Fooled: The "Two Channels"

The researchers broke down how the robot gets tricked. They found two ways the poison spreads:

  1. The Information Channel (The Eyes):

    • The robot looks at the hacked data right now and says, "Okay, the data says Tesla is safe, so I will recommend Tesla."
    • The Finding: This is the main problem. The robot trusts its eyes (the tool) more than its brain (its internal knowledge). Even if the robot "knows" Tesla is risky from its training, the live data overrides it.
  2. The Memory Channel (The Brain):

    • The robot remembers the bad advice it gave yesterday. "Yesterday I recommended Tesla, so the user must be okay with risk." It updates its memory to think the user is a gambler.
    • The Finding: This happens too, but the immediate "eyes" problem is the bigger culprit. The robot gets stuck in a loop of bad advice.

The "Safe Language" Illusion

Even worse, the robot didn't just give bad advice; it lied to you about the risk.

  • Real World: "Tesla is risky."
  • Hacked World: The robot says, "Tesla is a safe, stable, low-risk investment."

It used "safe-sounding" words to describe dangerous assets because the hacked data told it to. It didn't question the data. It didn't say, "Wait, this looks suspicious." It just accepted the lie and repeated it.

The Solution: A New Report Card

The paper suggests we need a new way to grade these robots. Instead of just asking, "Did it recommend stocks that match the expert list?" we need to ask, "Did it recommend stocks that are safe for this specific user?"

They created a new metric called sNDCG (Safety-Penalized NDCG).

  • Old Grading: "You recommended 5 stocks. 3 matched the expert list. Score: 60%." (Ignores that 2 were dangerous).
  • New Grading: "You recommended 5 stocks. 2 were safe, but 3 were dangerous for this user. Score: 0%."

When they used this new grading system, the robots' scores dropped dramatically, revealing the danger that was hidden before.

The Takeaway for the Real World

This isn't just about stocks. This applies to any AI agent that uses tools (like medical bots, legal bots, or travel planners).

  • The Danger: If an attacker can slightly tweak the data an AI sees (like changing a risk score by 1 point or writing a biased headline), the AI will follow it blindly.
  • The Blindness: Standard tests won't catch this because the AI is still "working" correctly according to its corrupted instructions.
  • The Fix: We need to build "safety monitors" that check if the AI is recommending things that are actually safe for the user, not just things that look good on paper. We need to stop trusting the robot's "eyes" without checking if the glasses are cracked.

In short: The paper warns us that our smartest AI advisors are incredibly obedient, but if you trick their eyes, they will happily lead you off a cliff while smiling and telling you it's a beautiful view.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →