Locating and Editing Figure-Ground Organization in Vision Transformers

Imagine you are looking at a picture of a dart shape (a four-sided shape with a pointy tail). Now, imagine someone paints over the empty space inside the "tail" part of the dart, turning it into a solid triangle.

Your brain has to make a quick choice:

The "Local" View: "Hey, that's a dart! The tail is missing, so I should see the empty space." (This is the concave view).
The "Global" View: "No, that's a solid triangle! The empty space is just background noise." (This is the convex view).

Humans usually default to the "Global" view. We see the solid triangle because our brains love simple, convex shapes. This is a rule of psychology called Gestalt, specifically the "Figure-Ground" principle.

This paper asks a big question: Do AI Vision Transformers (like the model BEiT) have this same "brain rule," and if so, exactly where inside their digital brain does this decision happen?

Here is the breakdown of their discovery, using simple analogies.

1. The Experiment: The "Dart" Test

The researchers created a special test. They showed the AI a dart shape but masked (hid) the part that makes it look like a dart. They forced the AI to "guess" what was under the mask.

If the AI guessed a triangle, it was following the "Global Rule" (Convexity).
If the AI guessed the dart shape, it was following the "Local Evidence" (Concavity).

The Result: The AI almost always guessed the triangle. It had the same "bias" as humans. But why?

2. The Investigation: Opening the Black Box

The researchers didn't just look at the final answer; they looked at the AI's internal "thought process" layer by layer. They used a technique called Logit Attribution, which is like putting a microphone on every single neuron in the AI to hear what it's "saying" about the shape.

The Timeline of the Decision:

Early Layers (The Confused Crowd): In the beginning, the AI is undecided. It's like a room full of people arguing. Some are saying "It's a dart!" and others are saying "It's a triangle!" The noise is balanced.
Late Layers (The Verdict): By the end, the AI has clearly decided on the triangle. The "Triangle" voice has won.

3. The Smoking Gun: The "Seed"

The most exciting part of the paper is finding who started the argument.

They discovered that the decision isn't a slow build-up. It starts with a single, tiny component in the very first layer of the AI.

The Character: An attention head named L0H9.
The Action: This tiny component acts like a seed. As soon as the image is seen, L0H9 whispers, "Hey, let's lean toward the triangle idea."
The Effect: It's a very weak whisper at first, but it sets the stage. As the signal moves through the deeper layers, other parts of the AI hear this whisper and amplify it until the whole system is convinced it's a triangle.

4. The Magic Trick: Editing the Brain

To prove this wasn't just a coincidence, the researchers performed "brain surgery" on the AI.

They found that L0H9 was the "convexity seed." So, they turned its volume down (they "downscaled" it).

Before: The AI saw a triangle.
After: With the "seed" silenced, the AI suddenly saw the dart.

It's as if they found the specific switch in a car's engine that makes it prefer driving on the highway. When they turned that switch off, the car suddenly started taking the scenic back road instead.

Why Does This Matter?

This is huge for two reasons:

It's Not Magic: It proves that the AI's "intuition" isn't some mysterious, unchangeable magic. It's a mechanical process driven by specific, identifiable parts of the code.
Safety and Control: In the real world, sometimes you don't want the AI to follow the "Global Rule."
- Example: In medical imaging, a tumor might look like a weird, concave shape. If the AI's "Global Rule" (convexity bias) is too strong, it might ignore the tumor and just see "normal tissue."
- Because we know exactly which "seed" (L0H9) causes this bias, we can tweak it. We can tell the AI, "Ignore the global rule for a second, look closely at the local details."

The Takeaway

The paper shows that AI vision models have learned human-like "rules of thumb" for seeing shapes. But unlike a human brain, where these rules are hard to change, we can find the exact digital "seed" that starts the rule and turn it up or down. We can literally edit the AI's perception to make it see the world differently.

1. Problem Statement

Vision Transformers (ViTs) have demonstrated an ability to integrate local cues into global shapes, often aligning with human perceptual principles known as Gestalt laws. Specifically, humans exhibit a strong bias toward perceiving convex regions as figures (foreground) and concave regions as ground (background). While behavioral studies suggest ViTs internalize this "convexity prior," it remains unclear:

Where in the model's architecture this preference is realized.
How the model resolves conflicts between local geometric evidence (e.g., a concave notch) and global organizational priors (e.g., a convex hull).
Whether this bias is a passive emergent property or an active, controllable mechanism driven by specific substructures.

2. Methodology

The authors employ a mechanistic interpretability approach to dissect the internal workings of the BEiT model (a Vision Transformer trained with a masked modeling objective).

A. Experimental Stimulus: Perceptual Conflict

To create a controlled conflict, the authors synthesized 10,000 "dart" shapes (non-convex quadrilaterals).

The Conflict Region: Defined as the geometric difference between the dart shape ( $S$ ) and its convex hull ( $H(S)$ ). Mathematically: $M = H(S) \setminus S$ .
The Task: The model is presented with the dart shape where the conflict region $M$ $M$ is masked.
- Convex Completion: If the model fills the mask to form a solid triangle (the convex hull), it prioritizes the global prior.
- Concave Completion: If the model preserves the notch (the original dart shape), it prioritizes local geometric evidence.
Discrete Output Space: Unlike MAE (which regresses pixels), BEiT uses a discrete visual codebook. This allows the completion task to be treated as a classification problem, enabling precise measurement of logits for "convex" vs. "concave" tokens.

B. Analysis Techniques

Logit Attribution:
- The authors decompose the residual stream to isolate the contribution of specific components (layers, attention heads, MLPs) to the final logits.
- They define a figure-ground direction in the codebook space: the difference between the vector sum of tokens representing the convex completion ( $T_{figure}$ ) and the concave background ( $T_{ground}$ ).
- By projecting component outputs onto this direction, they calculate a scalar score indicating whether a component favors convexity (positive) or concavity (negative).
Attention Lens & Activation Scaling:
- Lens: Decomposing the attention layer to measure the direct effect of individual attention heads on the logit difference.
- Intervention: Applying activation scaling (multiplying the activation of a specific head by a scalar $\alpha$ ) to test causality. If a head is the "seed" of the bias, downscaling it should shift the model's decision boundary.

3. Key Contributions

Mechanistic Localization: The paper moves beyond behavioral observation to identify the specific functional units (attention heads) within BEiT responsible for figure-ground organization.
The "Seeding" Hypothesis: The authors propose that convexity bias is not a late-stage readout but is seeded early in the network by specific attention heads, which then propagate and amplify through the residual stream.
Controllable Intervention: They demonstrate that figure-ground organization is not an immutable architectural constraint. By modulating a single attention head, they can flip the model's perceptual hypothesis from convex to concave.

4. Key Results

A. Temporal Dynamics of Resolution

Early/Intermediate Layers: The residual stream shows a state of geometric bistability. The attribution scores hover near zero, indicating that neither the local concave evidence nor the global convex prior has achieved dominance. The model maintains a state of competition.
Late Layers: An abrupt shift occurs in the terminal layers where the residual stream commits to a convex bias. This suggests the decision threshold is crossed late, but the pressure is built up cumulatively.

B. Identification of the "Seed" Head

L0H9 (Layer 0, Head 9): This specific attention head acts as an early seed. It introduces a weak but consistent bias toward convexity immediately upon input.
Competition: While L0H9 seeds the bias, later layers (e.g., L9H6) introduce "counter-voices" favoring concavity (geometric fidelity). However, the aggregate sum of convex-supporting votes from the ensemble of heads ultimately outweighs the concave opposition.

C. Successful Steering via Intervention

Downscaling L0H9: When the activation of L0H9 was scaled down ( $\alpha = 0.3$ ), the model's decision boundary shifted.
Outcome: The model moved from the "convex preference" region to the "concave preference" region.
Visual Evidence: In the scaled condition, the model successfully reconstructed the concave notch of the dart, whereas the baseline model ignored the notch and completed a solid triangle.
Significance: This proves that the convexity prior is an active force governed by specific mechanistic units, not a passive artifact of the training data.

5. Significance and Implications

Beyond Observation: The study transforms the understanding of Gestalt principles in AI from descriptive correlations to causal, steerable mechanisms.
Model Robustness & Safety: In critical domains like medical imaging or anomaly detection, local features (often concave or irregular) can be diagnostically vital. If a model's global prior (convexity) catastrophically overrides local evidence, it could miss anomalies.
- The ability to identify and modulate the "seeding" heads allows for precise calibration of how much weight a model gives to global priors vs. local evidence.
Generalizability: While tested on BEiT, the findings suggest a generalizable principle of "early bias seeding" and "late competitive commitment" in transformer architectures, offering a new framework for analyzing and editing visual reasoning.

Conclusion

The paper provides a rigorous mechanistic explanation of how Vision Transformers resolve figure-ground ambiguity. It identifies that convexity bias is seeded early by specific attention heads (like L0H9) and amplified through the network. Crucially, the authors demonstrate that this bias is editable; by suppressing the seed head, the model can be forced to prioritize local geometric evidence over global priors, fundamentally altering its perceptual output.