Determinants of visual ambiguity resolution

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are walking down a foggy road at night. You see a dark, blurry shape in the distance. Is it a person? A mailbox? A stray dog? Your brain is trying to guess, but the information is incomplete. This is visual ambiguity.

This paper is like a massive experiment where researchers handed 1,000 people a giant box of "foggy photos" (called Mooney images—black and white pictures that look like random blobs until you know what they are) and asked them to guess what the objects were. Then, they showed them the clear, normal photo of the same object, and asked them to look at the foggy photo again.

Here is what they discovered, explained through simple analogies:

1. The "High-Level" vs. "Low-Level" Puzzle

Think of your brain's visual system like a two-step detective process:

High-Level Features (The Big Picture): These are the general ideas, like "it's an animal" or "it's a vehicle."
Low-Level Features (The Details): These are the specific lines, curves, and edges.

The Finding: When the foggy photo is first shown, your brain is desperate for the Big Picture. If the foggy photo has lost the "animal-ness" or "vehicle-ness" (the high-level features), you can't guess it, no matter how hard you try. The low-level details (the squiggly lines) don't matter much yet because you don't have a hypothesis to test.

The Twist: Once you see the clear photo and "get it," your brain changes tactics. Suddenly, those specific squiggly lines (low-level features) become super important. Your brain now uses the clear photo as a template and checks: "Do the lines in this foggy blob match the lines of the dog I just saw?"

Analogy: Imagine trying to identify a friend in a crowd wearing a mask.

Before you know who it is: You need to see their face shape or height (High-Level). If the mask hides their whole face, you are stuck.
After you know it's your friend: You stop looking at the face shape. Instead, you look for a specific scar on their chin or the way they hold their coffee cup (Low-Level). Now that you have the "answer," the tiny details confirm your guess.

2. The "U-Shaped" Surprise

You might think that the more information you get, the easier it is to guess. If the fog clears a little bit, you should get a little better at guessing, right?

The Finding: Not exactly. The relationship is U-shaped.

Scenario A (Total Confusion): You guess wildly wrong. Then you see the clear photo. The information gain is huge. You go from "I have no idea" to "Oh, it's a toaster!" Result: You feel very clear about what it is.
Scenario B (Total Clarity): You guessed correctly (or very close) before seeing the clear photo. The clear photo just confirms you. The information gain is tiny. Result: You still feel very clear about what it is.
Scenario C (The Middle Ground): You guessed something sort of related but wrong. Then you see the clear photo. It doesn't fully confirm your guess, but it changes your mind. This "middle ground" of information gain actually makes you feel less certain than the other two extremes.

Analogy: Think of it like solving a riddle.

If the answer is completely different from what you thought, the "Aha!" moment is huge and satisfying.
If the answer is exactly what you thought, the "Aha!" moment is a satisfying nod of confirmation.
But if the answer is almost what you thought, but slightly different, it leaves you feeling confused and unsure. "Wait, was it a cat or a raccoon?" That middle ground is the bottom of the "U."

3. The "Group Consensus" Effect

Before seeing the clear photo, if you asked 100 people to name the blob, they would all say different things (high "entropy" or confusion). One says "cloud," another says "shoe," another says "cloud."

After seeing the clear photo, if you ask them again, they all suddenly agree. They all say "It's a shoe." The group becomes much more consistent. The researchers found that this shift from chaos to agreement happens almost instantly once the "key" (the clear photo) is revealed.

The Big Takeaway

Our brains are not just passive cameras that record what we see. They are active guessing machines.

When we are confused: We rely on our brain's "big picture" predictions. If the picture is too blurry to match a big idea, we are stuck.
When we learn the answer: Our brain switches to "detail mode," checking the tiny lines to confirm the match.
The learning curve: We learn best when we are either totally wrong (and get a big correction) or totally right (and get confirmation). Being "sort of right" is the most confusing state of all.

This study helps us understand how our brains turn a blurry, confusing world into a clear, understandable one, and why sometimes getting more information doesn't always make us feel clearer.

1. Problem Statement

Natural visual perception is inherently ambiguous due to occlusion, variable lighting, and sensory noise. While humans generally resolve this ambiguity effectively, the specific cognitive and input-related determinants that dictate why some ambiguous images remain unidentifiable while others are resolved immediately are not fully understood.

The Gap: Existing research often relies on clear images or complex natural scenes, failing to isolate the specific mechanisms of ambiguity resolution in a controlled setting.
The Question: What visual features drive subjective clarification? How does the acquisition of new information (disambiguation) alter the perceptual process and subsequent identification?

2. Methodology

Dataset and Stimuli

Stimuli Generation: The authors created a large-scale, open dataset of 1,854 ambiguous Mooney images (two-tone, black-and-white) derived from the THINGSplus object database.
Transformation: Original greyscale images were converted to Mooney images using a Gaussian blur and manual intensity thresholding to create binary images that obscure object identity until disambiguated.
Participants: 947 participants (after exclusions from an initial 1,065 recruited via Prolific) completed the task. All were native English speakers, aged 18–35.

Experimental Design

The experiment followed a three-phase trial structure for each image:

Pre-disambiguation: Presentation of the ambiguous Mooney image. Participants indicated if they identified the object (Yes/No) and provided a verbal label.
Disambiguation: Presentation of the unambiguous, original greyscale image.
Post-disambiguation: Re-presentation of the Mooney image. Participants repeated the identification and labeling task.

Computational and Analytical Approaches

Feature Preservation Analysis: The authors used CORnet-S, a deep convolutional neural network (DNN) mimicking the primate ventral visual stream (V1, V2, V4, IT). They extracted feature representations for both the Mooney and unambiguous versions of each image.
- Metric: A Preservation Index was calculated as the Pearson correlation between feature vectors of the Mooney and unambiguous images at each network layer.
Regression & Variance Partitioning: Using the 49-dimensional behavioral embedding from the THINGS database (categorized into visual and semantic dimensions), they performed multiple regression to determine how much variance in subjective identification was explained by semantic vs. visual features.
Semantic Metrics:
- Semantic Distance: Cosine dissimilarity between the participant's verbal label and the true object label (using WordNet embeddings).
- Semantic Entropy: Shannon entropy calculated from the distribution of labels provided by participants for a given image (measuring response consistency).
Information Gain Modeling: The relationship between the change in semantic distance/entropy (gain) and subsequent subjective identification was modeled using Ordinary Least Squares (OLS) regression, including quadratic terms to test for non-linearity.

3. Key Results

A. Behavioral Validation

Disambiguation Effect: Subjective identification rates increased significantly from 47% (pre) to 85.9% (post). Reaction times decreased, and naming accuracy improved, confirming the Mooney images successfully induced ambiguity that was resolved by the unambiguous cue.

B. Feature Preservation and Identification

Feature Loss: The Mooney transformation significantly impaired higher-level features (IT layer) more than lower-level features (V1). The preservation index dropped from V1 ( $M=0.72$ ) to IT ( $M=0.24$ ).
Pre-Disambiguation: Subjective identification was strongly driven by the preservation of high-level features (IT, V4).
Post-Disambiguation: A critical shift occurred. After seeing the unambiguous image, the correlation between identification and low-level features (V1, V2, V4) increased, while the reliance on high-level features (IT) decreased.
- Interpretation: The visual system shifts from top-down guessing (relying on high-level priors) to bottom-up matching (verifying low-level details against the newly acquired prior).

C. Semantic Dimensions

Semantic Dominance: Regression analysis revealed that semantic dimensions (e.g., "animal," "valuable") explained the majority of variance in subjective identification (approx. 62–64%) compared to visual dimensions (approx. 22–28%), both before and after disambiguation.
Semantic Shift: Disambiguation led to a significant reduction in semantic distance (labels became closer to the target) and semantic entropy (participants became more consistent in their labeling).

D. Non-Linear Relationship (The U-Shaped Curve)

The relationship between information gain (reduction in semantic distance/entropy) and subsequent subjective identification was non-linear (U-shaped).
High Identification: Occurred when information gain was either minimal (the initial guess was already close to the truth, and the unambiguous image confirmed it) or maximal (the initial guess was far off, and the unambiguous image provided a drastic correction).
Low Identification: Occurred with moderate information gain, where the new information neither strongly confirmed nor strongly violated the initial prediction, potentially creating a state of uncertainty or partial mismatch.

4. Key Contributions

Large-Scale Dataset: The release of a curated dataset of 1,854 Mooney images paired with >100,000 behavioral ratings, enabling robust statistical analysis of ambiguity resolution.
Mechanism of Shift: Empirical evidence that ambiguity resolution involves a dynamic reorganization of perceptual inference: a shift from high-level hypothesis generation (top-down) to low-level template matching (bottom-up) once a prior is established.
Non-Linearity of Learning: The discovery that information gain does not linearly improve perception. Instead, identification is optimized when new information either strongly confirms or strongly violates prior expectations, challenging linear models of perceptual learning.
Semantic vs. Visual: Demonstration that while high-level semantic information drives initial identification, the resolution of ambiguity relies heavily on the re-engagement of low-level visual features to match the newly formed high-level prior.

5. Significance and Implications

Predictive Processing Framework: The findings strongly support the predictive processing and analysis-by-synthesis frameworks. They illustrate how the brain minimizes prediction error by first generating high-level hypotheses and then, upon receiving disambiguating input, shifting focus to verifying low-level sensory evidence against those hypotheses.
Reverse Hierarchy Theory: The results align with the theory that detailed recognition requires a feedback loop to re-recruit low-level areas ("vision with scrutiny") after an initial high-level gist ("vision at a glance").
Perceptual Learning: The U-shaped relationship suggests that perceptual learning is not a simple accumulation of data but a complex interplay of prediction error. It implies that moderate ambiguity (where predictions are slightly off but not corrected) may be the most detrimental to subjective clarity.
Future Directions: The study highlights the need for neuroimaging to confirm the proposed neural mechanisms (e.g., pattern completion, reactivation of priors) and suggests that future research should explore how these mechanisms generalize to other forms of real-world ambiguity beyond Mooney images.

In conclusion, the paper provides a comprehensive computational and behavioral model of how humans resolve visual ambiguity, revealing that the process is a flexible, dynamic interplay between top-down semantic expectations and bottom-up sensory verification, governed by non-linear information dynamics.