Robustness Over Time: Understanding Adversarial Examples' Effectiveness on Longitudinal Versions of Large Language Models

This paper presents a longitudinal study of GPT, Llama, and Qwen models, revealing that continuous updates and increased model sizes do not consistently enhance adversarial robustness against misclassification, jailbreaks, and hallucinations, and can sometimes exacerbate existing vulnerabilities.

Yugeng Liu, Tianshuo Cong, Zhengyu Zhao, Michael Backes, Yun Shen, Yang Zhang

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you have a very smart, helpful robot assistant. Every few months, the company that built it sends out a "software update" to make it smarter, kinder, or better at its job. You might assume that with every update, the robot gets safer and harder to trick.

This paper is like a group of security experts who decided to test that assumption. They didn't just check the robot once; they watched it over time, checking every single version of three famous robot families (GPT, Llama, and Qwen) as they grew up.

Here is what they found, explained simply:

1. The "New Car Smell" Trap

Usually, when you buy a new car, you expect it to be safer than the old model. You expect the brakes to work better and the airbags to be smarter.

The researchers found that LLMs (Large Language Models) are different. Just because a model gets a new version number (like going from "Version 1.0" to "Version 2.0") doesn't mean it's safer. In fact, sometimes the new version is more fragile than the old one.

  • The Analogy: Imagine a chef who gets a new, expensive knife. You'd expect them to chop vegetables better. But sometimes, in their excitement to use the new knife, they accidentally chop their own finger. Similarly, when model developers tweak a model to fix one problem (like stopping it from saying bad words), they sometimes accidentally break something else (like making it bad at math or grammar).

2. The Three Ways Robots Get Tricked

The researchers tested the robots using three specific "tricks" (attacks):

  • The "Confusion" Trick (Misclassification): Imagine asking the robot, "Is this sentence nice?" and it says "No" when it's clearly nice. The researchers found that newer models sometimes get more confused by simple tricks than older ones.
  • The "Jailbreak" Trick: This is like trying to convince a strict bouncer to let you into a VIP club by dressing up or using a fake ID. The researchers tried to "jailbreak" the models to make them say things they aren't supposed to (like how to make a bomb or be mean).
    • The Twist: For some models, the newer versions were actually better at resisting these jailbreaks. But for others, the new versions were surprisingly easy to trick.
  • The "Lying" Trick (Hallucination): This is when the robot makes up facts that sound real but are completely fake. Imagine the robot confidently telling you that the moon is made of cheese. The study found that newer models didn't necessarily stop lying; in some cases, they started lying more in specific situations.

3. Bigger Isn't Always Better

There's a common belief that "bigger is better." If a robot has a bigger brain (more data), it should be smarter and safer, right?

  • The Reality: The researchers found that bigger models are not automatically safer. Sometimes, a giant 70-billion-parameter model is actually easier to trick than a smaller, simpler one. It's like having a giant, complex castle; it has more doors and windows for a thief to sneak through, even if the walls are thicker.

4. The "Patch" Problem

Software companies often release "minor updates" (patches) to fix small bugs. The researchers watched these updates happen week by week.

  • The Finding: Sometimes, a minor update meant to fix a small issue actually made the robot worse at other things. It's like a mechanic fixing a squeaky door on your car, but in the process, they accidentally loosen the bolts on the engine. The car still runs, but it's now less reliable.

The Big Takeaway

The main message of this paper is: Don't assume a new version is automatically safer.

  • For Users: If you are using an AI, don't just trust the "Newest Version." You still need to be careful and check if it's behaving well.
  • For Builders: The companies making these AI models need to stop just chasing "smarter" or "faster" updates. They need to treat safety and robustness as a separate, critical goal. They need to test their models against tricks before releasing them, ensuring that fixing one bug doesn't create three new ones.

In short: Upgrading your AI is like upgrading your house. Just because you installed a new, fancy front door doesn't mean you didn't accidentally leave the back window wide open. You have to check the whole house every time you make a change.