⚛️ high-energy theory

Towards Worst-Case Guarantees with Scale-Aware Interpretability

This paper proposes a research agenda for "scale-aware interpretability" that adapts the renormalization framework from statistical physics to develop formal tools capable of providing worst-case guarantees on neural network behavior by explicitly tracking how features compose across different resolutions.

Original authors: Lauren Greenspan, David Berman, Aryeh Brill, Ro Jefferson, Artemy Kolchinsky, Jennifer Lin, Andrew Mack, Anindita Maiti, Fernando E. Rosas, Alexander Stapleton, Lucas Teixeira, Dmitry Vaintrob

Published 2026-02-06

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Lauren Greenspan, David Berman, Aryeh Brill, Ro Jefferson, Artemy Kolchinsky, Jennifer Lin, Andrew Mack, Anindita Maiti, Fernando E. Rosas, Alexander Stapleton, Lucas Teixeira, Dmitry Vaintrob

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to understand how a massive, complex machine works—like a giant, self-assembling robot made of millions of tiny gears. Currently, AI researchers are trying to figure out what this robot is thinking by looking at the individual gears. But there's a problem: there are too many gears, and looking at every single one is impossible. Plus, if you zoom in too close, you start seeing dust and scratches that don't actually matter to how the robot moves. You get lost in the noise.

This paper proposes a new way to look at these AI "robots" (neural networks) by borrowing a powerful idea from physics called Renormalization.

Here is the breakdown of their idea using simple analogies:

1. The Problem: Getting Lost in the Details

Think of an AI model like a high-resolution photograph. If you zoom in all the way to a single pixel, you just see a colored dot. It doesn't tell you if the picture is of a cat or a dog. But if you zoom out, you see shapes, then objects, then the whole scene.

Current tools for understanding AI often try to look at the "pixels" (individual numbers inside the computer) or the "shapes" (features) without a clear rule for how much to zoom out. They might miss the big picture because they are too focused on tiny details, or they might miss dangerous small details because they are too focused on the big picture. They lack a "scale."

2. The Solution: The "Zoom Lens" from Physics

The authors suggest using Renormalization, a concept physicists use to understand how things work at different sizes.

The Analogy: Imagine you are looking at a forest.
- Microscopic view: You see individual leaves, twigs, and bugs.
- Macroscopic view: You see the shape of the forest, the wind moving through the trees, and the overall ecosystem.
- Renormalization is the mathematical rulebook that tells you: "If you zoom out to this level, you can safely ignore the individual leaves because they don't change the shape of the forest. But if you zoom out too far, you might miss a fire starting in a specific patch."

The paper argues that AI models naturally organize information in layers, just like a forest has layers of leaves, branches, and the whole tree. We need a tool that respects this natural "zooming" process.

3. The Goal: "Scale-Aware" Understanding

The authors want to build a new kind of "microscope" for AI that has a dial.

Turning the dial (Coarse-Graining): This is the act of grouping tiny details together into bigger, simpler concepts.
The "Separation of Scales" Guarantee: This is the most important part. They want to prove mathematically that if you zoom out to a certain level, the tiny, messy details (the "noise") cannot suddenly change the big picture.

Why does this matter for safety?
Imagine you are driving a car. You care about the road ahead (the big picture). You don't need to worry about every single grain of dust on the asphalt (the tiny details).

Current worry: What if a tiny, invisible grain of dust (a hidden trick in the AI) suddenly causes the car to crash?
The Renormalization Promise: If we use this new framework, we can say: "We have zoomed out enough to see the road. We have mathematically proven that any dust smaller than this size cannot possibly change the car's path. Therefore, we are safe."

4. Two Ways to Do It

The paper suggests two ways to apply this:

Implicit Renormalization (The Natural Way): AI models already do this automatically when they learn. For example, in image generation, the AI first learns the general shape of a face, then the eyes, then the eyelashes. The authors want to study how the AI naturally "zooms out" on its own.
Explicit Renormalization (The Tool Way): This is about building new software tools (like a better version of current "feature finders") that force the AI to show us its work at different zoom levels. Instead of just finding one "feature," the tool would show you the "forest," then the "tree," then the "branch," and tell you which level is safe to ignore.

5. The Call to Action

The authors are calling for physicists, computer scientists, and AI safety experts to work together. They believe that by combining the math of physics with the tools of AI, we can finally build AI systems that we can trust.

In short: They want to stop trying to understand AI by counting every single grain of sand. Instead, they want to build a map that tells us exactly which grains of sand matter and which ones we can safely ignore, giving us a mathematical guarantee that the AI won't surprise us with a hidden trick.

Technical Summary: Towards Worst-Case Guarantees with Scale-Aware Interpretability

Problem Statement

Current AI interpretability methods, such as Sparse Autoencoders (SAEs), rely heavily on engineering artifacts and theoretical hypotheses that lack rigorous guarantees regarding their faithfulness to model internals or their robustness to distributional shifts. A critical limitation is the inability to formally bound the influence of fine-grained details (treated as noise) on macroscopic, safety-relevant behaviors. Existing tools often fail to account for the hierarchical, multi-scale structure inherent in natural data and neural network (NN) representations. Consequently, they struggle to provide "worst-case guarantees" that fine-grained fluctuations cannot significantly alter coarse-grained observables, leaving systems vulnerable to steganography, distributional shifts, and hidden causal mechanisms.

Methodology and Framework

The paper proposes Scale-Aware Interpretability, a research agenda that adapts the renormalisation group (RG) framework from statistical physics to the domain of neural networks. Rather than claiming modern NNs are strictly renormalizable in a field-theoretic sense, the authors posit that the RG framework offers a necessary language and set of design constraints to formalize three key aspects currently handled poorly:

Scale: The granularity or resolution at which features are observed.
Relevance: Which degrees of freedom (features) matter at a specific scale.
Coarse-graining: The systematic ignoring of irrelevant degrees of freedom.

The methodology distinguishes between two types of renormalisation in NNs:

Implicit Renormalisation: The natural process by which NNs coarse-grain data during training and inference (e.g., diffusion models organizing data by noise levels, or language models tracking context stability). This is driven by the model's own dynamics and architecture.
Explicit Renormalisation: Post-hoc interpretability tools (like SAEs or spectral truncation) that impose scale parameters and coarse-graining rules to extract interpretable structures.

The core technical proposal involves constructing an RG-like scheme for NNs that satisfies three conditions:

Defining Coarse-Grainings: Identifying "model-natural" scales (e.g., kernel eigenmodes, diffusion time, context length) and cutoffs that respect the model's implicit hierarchy.
Effective Degrees of Freedom: Reducing the high-dimensional model to a smaller set of effective features whose behavior predicts macroscopic observables within a specified error budget. This involves establishing a relevance ordering where features are ranked by their contribution to long-range observables.
Separation of Scales: Establishing a property where microscopic details (irrelevant subspace) can vary within a bounded range without materially changing the coarse behavior of the system. This is formalized as hierarchical conditional independence, where coarse variables act as sufficient statistics for finer variables.

Key Contributions

The paper does not present new experimental results but rather synthesizes scattered research threads into a unified theoretical agenda. Its primary contributions are:

Formalizing the Renormalisation Analogy: The authors map RG concepts (UV/IR cutoffs, relevant/irrelevant operators, fixed points, universality classes) to NN interpretability. They argue that "features" should be viewed as effective degrees of freedom that emerge at specific scales, rather than static atomic units.
Identifying Failure Modes of Current Tools: The paper critiques existing methods (like SAEs) for lacking canonicity (different runs yield different decompositions), completeness (missing entangled features), and faithfulness (optimizing for reconstruction rather than causal structure). It argues that without a separation of scales, these tools cannot guarantee that ignored features do not impact safety-critical outputs.
Proposing Research Artifacts: To bridge the gap between theory and practice, the authors propose two specific artifacts analogous to "Toy Models of Superposition" (TMS) and SAEs:
- Toy Model of Renormalisation (TMR): A synthetic model organism (e.g., using hierarchical data distributions) to generate hypotheses about how features compose and coarsen, allowing for provable bounds on fine-grained influence.
- General Renormalisation Tool (GRT): A scalable, post-hoc tool (analogous to SAEs) that extracts multi-scale, interpretable structures from real models, potentially using techniques like real-space mutual information (RSMI) or lattice RG on activation graphs.
Surveying Existing Work: The paper reviews literature in kernel renormalisation (NNGP, NTK, spectral gaps) and data-space renormalisation (hierarchical data models, fractal structures, information-theoretic coarse-graining), demonstrating that the theoretical foundations for this agenda already exist in physics and machine learning but have not been synthesized for AI safety.

Results and Claims

The paper does not report empirical results from a new tool or model. Instead, its "results" are theoretical arguments and a synthesis of existing evidence:

Theoretical Feasibility: The authors argue that the renormalisation framework is mature enough in physics to be adapted for NNs, citing successful applications in diffusion models, kernel theory, and information-theoretic compression.
Necessity of Scale-Awareness: They demonstrate that current interpretability tools often fail because they do not respect the model's implicit scales. For instance, treating all neurons as equal ignores the fact that some directions in activation space are "relevant" (large eigenvalues) while others are "irrelevant" (spectral tails).
Potential for Guarantees: The paper claims that a successful RG-based framework could provide worst-case guarantees. Specifically, it aims to prove statements of the form: "Conditional on an effective coarse description, perturbations confined to the irrelevant subspace cannot change observable X by more than $\epsilon$ ."

Significance and Claims

The paper positions itself as a call to action for interdisciplinary coordination between physics, neuroscience, computer science, and AI safety. Its significance lies in:

Shifting the Goalpost: Moving interpretability from "finding human-understandable features" to "providing robust, theory-backed guarantees" about what a model does and does not do.
Addressing Safety: By formalizing the separation of scales, the framework aims to prevent dangerous behaviors (e.g., deception, steganography) from hiding in the "irrelevant" fine-grained details that current tools discard.
Unifying Disparate Fields: It seeks to bridge the gap between theoretical physics (renormalisation, universality) and practical AI safety, suggesting that the "messy" nature of NNs may actually be amenable to the same statistical tools used to understand complex physical systems.

The authors remain modest about their claims, acknowledging that NNs may not exhibit strict universality or criticality in all regimes. They emphasize that the proposed agenda is a path toward developing tools that are "faithful" and "robust," rather than claiming that current methods are already sufficient or that the physics analogy is a perfect one-to-one mapping. The ultimate goal is to build a framework where interpretability is not just an engineering heuristic, but a discipline grounded in statistical physics capable of bounding the influence of discarded information.