Expert Selections In MoE Models Reveal (Almost) As Much As Text

Here is an explanation of the paper using simple language and creative analogies.

The Big Idea: The "Secret Menu" Leak

Imagine you order a complex meal at a restaurant. You don't just get the food; you also get a receipt that lists exactly which chefs in the kitchen cooked which parts of your meal.

The Meal: The text you type into an AI (like a secret password or a private email).
The Chefs: The "Experts" inside a special type of AI called a Mixture-of-Experts (MoE) model.
The Receipt: The Routing Trace (the list of which chefs were chosen).

The Paper's Discovery:
The authors found that if a hacker steals the "Receipt" (the list of which experts were chosen), they can almost perfectly reconstruct the "Meal" (your original text) without ever seeing the text itself.

In fact, the list of chefs chosen tells them 91% to 94% of what you wrote. It's like looking at a receipt that says "Chef A made the steak, Chef B made the salad, and Chef C made the dessert," and being able to guess the exact ingredients and recipe just from knowing who cooked what.

How It Works: The "Specialized Kitchen" Analogy

1. The MoE Model (The Super-Kitchen)

Modern AI models are huge. To make them faster, engineers built "Mixture-of-Experts" models.

Old Way: Imagine a kitchen with 100 chefs, and every chef touches every dish you order. It's slow and messy.
MoE Way: Imagine a kitchen with 100 chefs, but for every dish, a "Manager" (the Router) picks only the top 4 chefs who are best at that specific task. The other 96 chefs do nothing.
The Leak: The Manager has to tell the 4 chosen chefs, "Hey, you're on!" This signal (the Expert Selection) is what the hackers steal.

2. The Attack (Cracking the Code)

The researchers asked: "If we only know which 4 chefs were called, can we guess what the customer ordered?"

The Simple Attempt (The "Guessing Game"): They first tried a simple computer program (a 3-layer MLP) that looked at one dish at a time. It was okay at guessing, getting about 63% right. It was like guessing the main course based on the chef, but often getting the side dish wrong.
The Smart Attempt (The "Detective"): They then used a much smarter AI (a Transformer decoder) that looked at the whole order at once. It realized patterns: "If Chef A and Chef B work together on the first three dishes, they usually make a specific type of Italian meal."
The Result: This smart detective got 91.2% of the words right, and 94.8% of the words if you allowed for 10 guesses per word.

The Takeaway: The path the data takes through the AI is just as sensitive as the data itself.

How Do Hackers Get the "Receipt"? (The Attack Surfaces)

You might ask, "How does a hacker get this list of chefs?" The paper suggests a few realistic scenarios:

The "Bad Neighbor" (Distributed Inference):
Imagine the AI is running on a cloud server shared by many companies. If a hacker rents a tiny slice of that server, they might be able to see the internal traffic logs. They see, "Oh, the AI just asked Expert #5 and Expert #12 to work," and they log that down.
The "Power Meter" (Side Channels):
Different chefs use different amounts of electricity or make different amounts of noise. A hacker with physical access to the server room (or a co-located machine) could measure power spikes or electromagnetic waves to figure out which "chefs" are active.
The "Assembly Line" (Pipeline Parallelism):
If the AI is split across many computers (like an assembly line), and the hacker controls one computer, they can see which "parts" of the product are arriving at their station, revealing which experts were used.

Can We Stop It? (The Defenses)

The paper suggests that we need to treat these "Routing Traces" (the receipts) as highly secret, just like the text itself.

Don't Print the Receipt: Don't log or export the list of which experts were chosen.
Add Static Noise: Make the kitchen chaotic. Sometimes, make the Manager pick a random chef just to confuse the observer. Or, have all chefs do a little bit of "dummy work" so you can't tell who is actually cooking.
Shuffle the Names: Periodically rename the chefs. Today, Chef #1 is the "Steak Chef"; tomorrow, Chef #1 is the "Salad Chef." This breaks the pattern the hacker is trying to learn.

Why This Matters

For a long time, people thought the "internal wiring" of an AI was safe because it wasn't the final text. This paper proves that the path the data takes is a secret too.

If you are using an AI to process sensitive data (like medical records or legal contracts), you can't just protect the input and output. You also have to protect the invisible "routing decisions" happening inside the machine, or else the "receipt" might give away your secrets.

Summary in One Sentence

Just knowing which "specialist" an AI uses to process a word is often enough to guess the word itself, meaning the internal routing signals of these models are a major privacy leak.

Here is a detailed technical summary of the paper "Expert Selections in MoE Models Reveal (Almost) as Much as Text," presented at the ICLR 2026 Workshop.

1. Problem Statement

The paper addresses a critical privacy vulnerability in Mixture-of-Experts (MoE) Large Language Models (LLMs). In MoE architectures, each token is routed to a specific subset of "expert" subnetworks based on router decisions. While these routing decisions are internal mechanisms designed for efficiency, the authors hypothesize that they act as a side channel that leaks significant information about the input text.

The core problem is that an adversary who can observe the expert selection traces (i.e., which experts were activated for each token at each layer) can potentially reconstruct the original input text, even without access to the model's weights, hidden states, or output logits.

2. Methodology

Threat Model

Adversary Capability: The attacker observes the router's expert selections for tokens at one or more layers. They do not see router logits, weights, hidden states, or expert outputs.
Knowledge: The attacker knows the tokenizer, the MoE routing configuration (number of experts $n$ , top- $k$ routing), and can access training pairs of (text, expert-selection trace) from a same-family model or internal logs.
Goal: Reconstruct the original token sequence from the observed routing traces.

Attack Surfaces

The authors identify practical scenarios where these traces could leak:

Distributed Inference: A malicious host in a distributed cluster observing routing decisions across devices.
Physical Side Channels: Inferring active experts via power consumption, electromagnetic emissions, or GPU performance counters (building on prior work like MoEcho).
Pipeline-Parallelism: Detecting activity on specific GPU shards to infer which experts are processing data.

Decoding Models

The authors propose and evaluate two learning-based decoders to invert the routing traces:

Single-Token MLP: A 3-layer Multi-Layer Perceptron trained to predict a token based solely on its specific expert selection trace, treating tokens independently.
Sequence Decoder (Transformer): An encoder-only transformer that consumes the entire sequence of expert selection traces. It converts top- $k$ selections into binary vectors, processes them through per-layer MLPs, and uses non-causal self-attention to exploit dependencies across token positions to predict the full sequence.

Experimental Setup

Model: gpt-oss-20b (32 experts, top-4 routing, 24 layers, vocab size ~201k).
Dataset: OpenWebText (100M tokens for training, 10M for testing).
Input: 32-token chunks processed in "prefill" mode (no autoregressive generation).

3. Key Contributions

High-Fidelity Text Reconstruction: The paper demonstrates that expert selections alone are sufficient to reconstruct input text with high accuracy, challenging the assumption that routing decisions are low-information signals.
Superiority of Sequence Modeling: The authors show that modeling the sequence context (via a Transformer) is significantly more effective than per-token classification (MLP), proving that routing traces contain contextual dependencies.
Information-Theoretic Analysis: The paper provides an entropy analysis of routing traces, calculating that while the theoretical upper bound is high (~363 bits/token), the effective entropy is lower due to correlations, yet still sufficient for reconstruction.
Practical Mitigations: The authors propose concrete engineering defenses, including adding noise to routing, blurring activity patterns, and treating routing traces as sensitive data.

4. Key Results

Reconstruction Accuracy

On a held-out test set of 10M tokens:

3-Layer MLP (Per-token): Achieved 63.1% Top-1 accuracy, 80.3% Top-5, and 84.3% Top-10.
Sequence Decoder (Transformer): Achieved 91.2% Top-1 accuracy, 94.3% Top-5, and 94.8% Top-10.
- Significance: The sequence decoder recovers nearly the entire text sequence correctly, demonstrating that expert selections leak "almost as much as text."

Robustness and Analysis

Noise Sensitivity: Adding noise (randomly corrupting a fraction of expert selections) degrades performance but does not eliminate the attack. Even with significant noise, reconstruction remains feasible.
Layerwise Information: Entropy analysis reveals that early layers (1-7) share high mutual information, while middle layers (around layer 11) show distinct routing regimes. The total information content across 24 layers is estimated at ~206 bits.
Data Efficiency: The sequence decoder's performance scales with training data size, showing graceful degradation when trained on fewer tokens.

5. Significance and Implications

Paradigm Shift in Privacy: This work redefines the security boundary of MoE models. It establishes that routing traces are sensitive data comparable to the input text itself.
Broader Context: It connects MoE routing to the broader literature on embedding inversion, showing that even discrete, lower-bandwidth intermediate signals can be inverted to recover semantic content.
Deployment Risks: The findings suggest that current MoE deployment strategies (distributed inference, pipeline parallelism) may inadvertently expose user prompts to malicious actors within the infrastructure or via side channels.
Defensive Recommendations:
- Treat expert selection traces as sensitive outputs; do not log or export them unless the text itself is protected.
- Implement engineering mitigations such as routing noise, dummy compute padding, and hardware hardening against side-channel measurements.
- Future work should focus on quantifying the trade-offs between these defenses and model performance/quality.

In conclusion, the paper provides a stark warning: the efficiency mechanisms of MoE models create a new, potent attack surface that allows for near-perfect reconstruction of private user inputs, necessitating immediate changes in how these models are deployed and monitored.