Expert Selections In MoE Models Reveal (Almost) As Much As Text

This paper demonstrates that expert routing decisions in Mixture-of-Experts (MoE) language models leak substantial information, enabling advanced machine learning attacks to reconstruct up to 91.2% of original text tokens from routing data alone, thereby establishing expert selections as sensitive information comparable to the underlying text.

Amir Nuriyev, Gabriel Kulp

Published Fri, 13 Ma
📖 5 min read🧠 Deep dive

Here is an explanation of the paper using simple language and creative analogies.

The Big Idea: The "Secret Menu" Leak

Imagine you order a complex meal at a restaurant. You don't just get the food; you also get a receipt that lists exactly which chefs in the kitchen cooked which parts of your meal.

  • The Meal: The text you type into an AI (like a secret password or a private email).
  • The Chefs: The "Experts" inside a special type of AI called a Mixture-of-Experts (MoE) model.
  • The Receipt: The Routing Trace (the list of which chefs were chosen).

The Paper's Discovery:
The authors found that if a hacker steals the "Receipt" (the list of which experts were chosen), they can almost perfectly reconstruct the "Meal" (your original text) without ever seeing the text itself.

In fact, the list of chefs chosen tells them 91% to 94% of what you wrote. It's like looking at a receipt that says "Chef A made the steak, Chef B made the salad, and Chef C made the dessert," and being able to guess the exact ingredients and recipe just from knowing who cooked what.


How It Works: The "Specialized Kitchen" Analogy

1. The MoE Model (The Super-Kitchen)

Modern AI models are huge. To make them faster, engineers built "Mixture-of-Experts" models.

  • Old Way: Imagine a kitchen with 100 chefs, and every chef touches every dish you order. It's slow and messy.
  • MoE Way: Imagine a kitchen with 100 chefs, but for every dish, a "Manager" (the Router) picks only the top 4 chefs who are best at that specific task. The other 96 chefs do nothing.
  • The Leak: The Manager has to tell the 4 chosen chefs, "Hey, you're on!" This signal (the Expert Selection) is what the hackers steal.

2. The Attack (Cracking the Code)

The researchers asked: "If we only know which 4 chefs were called, can we guess what the customer ordered?"

  • The Simple Attempt (The "Guessing Game"): They first tried a simple computer program (a 3-layer MLP) that looked at one dish at a time. It was okay at guessing, getting about 63% right. It was like guessing the main course based on the chef, but often getting the side dish wrong.
  • The Smart Attempt (The "Detective"): They then used a much smarter AI (a Transformer decoder) that looked at the whole order at once. It realized patterns: "If Chef A and Chef B work together on the first three dishes, they usually make a specific type of Italian meal."
  • The Result: This smart detective got 91.2% of the words right, and 94.8% of the words if you allowed for 10 guesses per word.

The Takeaway: The path the data takes through the AI is just as sensitive as the data itself.


How Do Hackers Get the "Receipt"? (The Attack Surfaces)

You might ask, "How does a hacker get this list of chefs?" The paper suggests a few realistic scenarios:

  1. The "Bad Neighbor" (Distributed Inference):
    Imagine the AI is running on a cloud server shared by many companies. If a hacker rents a tiny slice of that server, they might be able to see the internal traffic logs. They see, "Oh, the AI just asked Expert #5 and Expert #12 to work," and they log that down.

  2. The "Power Meter" (Side Channels):
    Different chefs use different amounts of electricity or make different amounts of noise. A hacker with physical access to the server room (or a co-located machine) could measure power spikes or electromagnetic waves to figure out which "chefs" are active.

  3. The "Assembly Line" (Pipeline Parallelism):
    If the AI is split across many computers (like an assembly line), and the hacker controls one computer, they can see which "parts" of the product are arriving at their station, revealing which experts were used.


Can We Stop It? (The Defenses)

The paper suggests that we need to treat these "Routing Traces" (the receipts) as highly secret, just like the text itself.

  • Don't Print the Receipt: Don't log or export the list of which experts were chosen.
  • Add Static Noise: Make the kitchen chaotic. Sometimes, make the Manager pick a random chef just to confuse the observer. Or, have all chefs do a little bit of "dummy work" so you can't tell who is actually cooking.
  • Shuffle the Names: Periodically rename the chefs. Today, Chef #1 is the "Steak Chef"; tomorrow, Chef #1 is the "Salad Chef." This breaks the pattern the hacker is trying to learn.

Why This Matters

For a long time, people thought the "internal wiring" of an AI was safe because it wasn't the final text. This paper proves that the path the data takes is a secret too.

If you are using an AI to process sensitive data (like medical records or legal contracts), you can't just protect the input and output. You also have to protect the invisible "routing decisions" happening inside the machine, or else the "receipt" might give away your secrets.

Summary in One Sentence

Just knowing which "specialist" an AI uses to process a word is often enough to guess the word itself, meaning the internal routing signals of these models are a major privacy leak.