Context-Dependent Affordance Computation in Vision-Language Models

Here is an explanation of the paper, translated into everyday language with some creative analogies.

The Big Idea: It's Not What You See, It's Who You Are

Imagine you are walking into a messy kitchen.

To a Chef, that scene is a treasure map of tools: a knife is for chopping, a pot is for boiling, and a cutting board is for prep.
To a Security Guard, that same scene is a list of threats: the knife is a weapon, the pot is a potential projectile, and the clutter is a tripping hazard.
To a 4-year-old, that scene is a playground: the chair is a climbing frame, the table is a fort, and the floor is a race track.

The paper argues that Vision-Language Models (AI that "sees" and "talks") work exactly like this. They don't just take a photo, analyze the shapes, and say, "That is a table." Instead, they instantly ask, "Who is looking at this?" and then rewrite the entire description of the world based on that answer.

The researchers call this "Context-Dependent Affordance Computation."

Affordance: What an object allows you to do (a chair affords sitting; a door affords opening).
Context-Dependent: The answer changes completely depending on your goal.

The Experiment: The "7 Personas" Test

The researchers took a standard dataset of 3,200 photos (from the famous COCO dataset) and showed them to two different AI models. But they didn't just ask, "What do you see?"

Instead, they pretended to be 7 different people looking at the same photo:

Neutral: Just an objective observer.
Chef: Looking for food prep.
Security Guard: Looking for dangers.
Child: Looking for fun toys.
Wheelchair User: Looking for obstacles or paths.
Emergency Survivor: Looking for survival tools in 30 seconds.
Relaxer: Looking for comfort.

They asked the AI to describe the objects and what you could do with them for each of these 7 personas.

The Shocking Result: The "90% Drift"

The results were massive. When the AI switched from the "Chef" persona to the "Security Guard" persona, 90% of the description changed.

The "Chef" saw: A cutting board, a knife, a stove.
The "Security Guard" saw: A weapon, a fire hazard, a potential barricade.
The "Child" saw: A climbing surface, a hiding spot.

The researchers measured this using a math tool called Jaccard Similarity (which measures how much two lists of words overlap). The overlap was only about 9%. This means that 91% of the words the AI used to describe the scene were completely different just because the "goal" of the viewer changed.

The Analogy: Imagine you have a photo of a forest.

If you ask a Lumberjack, he sees "timber, logs, and axes."
If you ask a Birdwatcher, she sees "nests, branches, and flight paths."
If you ask a Hiker, he sees "trails, elevation, and shade."

This paper proves that for these AIs, the "Lumberjack" and the "Birdwatcher" are seeing two completely different forests, not just the same forest with different labels.

The Hidden Structure: The "Culinary Manifold"

The researchers didn't just stop at "it changes." They used a mathematical technique (Tucker Decomposition) to find the pattern behind the changes. They found that the AI's brain organizes the world into specific "dimensions" or "lanes":

The "Culinary Manifold": When the AI is in "Chef mode," it jumps into a totally separate lane of thinking that has almost nothing in common with other modes. It's like a secret room in the AI's mind that only opens for cooking.
The "Access Axis": This is a sliding scale between "Open/Playful" (like a child seeing a slide) and "Blocked/Obstructed" (like a wheelchair user seeing a wall).

This proves the AI isn't just randomly guessing; it has learned a structured way of seeing the world that prioritizes function over geometry.

Why This Matters: The "Just-in-Time" World

Currently, most robots and AI systems try to build a static map of the world. They try to create one perfect, 3D model of a room that is true for everyone, all the time.

The paper argues this is the wrong approach.

If 90% of what matters in a room depends on what you are trying to do, then building a "perfect static map" is a waste of energy. You are spending 90% of your computing power describing things that don't matter for your current task.

The New Idea: "Just-in-Time" (JIT) Ontology
Instead of building a full map, the AI should only build the parts of the world it needs right now, based on the task.

Old Way: "Here is a 3D model of the kitchen with every object labeled."
New Way (JIT): "I am a chef. I only need to know where the knives and pots are. Ignore the rest."

The Bottom Line

This paper suggests that intelligence isn't about seeing everything clearly; it's about seeing the right things for your goal.

Just like a human doesn't notice the dust on the shelf when they are hungry for a sandwich, these AI models have learned to ignore the "boring" geometric details and focus entirely on the "useful" functional details.

The Takeaway for Robotics:
If we want robots to be smart, we shouldn't teach them to build a perfect picture of the world. We should teach them to ask, "What am I trying to do?" and then instantly reshape their understanding of the world to fit that goal. The world isn't a fixed stage; it's a set of tools that changes shape depending on who is holding the hammer.

Here is a detailed technical summary of the working paper "Context-Dependent Affordance Computation in Vision-Language Models" (DAI-2505).

1. Problem Statement

Contemporary computer vision (CV) operates on a "geometry-first" assumption: visual processing extracts geometric features and objects before computing semantic meaning or functional affordances (action possibilities). This pipeline treats space as a neutral container, assuming that the functional description of a scene is largely invariant to the observer's goals.

The paper challenges this paradigm by investigating Vision-Language Models (VLMs). It asks whether these models exhibit a "Semantic-First" architecture, where functional interpretation (affordances) precedes and structures geometric representation, and whether this interpretation is heavily dependent on the agent's context (goals, history, and motor capabilities). The core problem is quantifying the extent to which VLMs compute affordances dynamically based on context versus relying on static, context-invariant world models.

2. Methodology

The authors conducted a large-scale computational study to measure "affordance drift"—the degree to which functional scene descriptions change under different agent contexts.

Dataset: 3,213 scene-context pairs derived from the COCO-2017 validation set (479 unique images).
Models:
- Primary: Qwen-VL-30B-Instruct.
- Replication: LLaVA-1.5-13B (to verify architectural generality).
Experimental Design (Context Primes):
Each image was processed under 7 distinct agentic personas (context primes) to simulate different goals:
1. Neutral: Objective analysis.
2. Chef: Food preparation focus.
3. Security: Vulnerability/defense assessment.
4. Child: Play/exploration (4-year-old).
5. Mobility: Wheelchair user (obstruction/access).
6. Urgent: Immediate survival (30s emergency).
7. Leisure: Relaxation/enjoyment.
Data Extraction: Models were prompted to output structured JSON listing objects, their affordances, and reasoning.
Analysis Metrics:
- Jaccard Similarity: Computed at the word-level and object-level to measure lexical overlap between different context primes for the same image.
- Cosine Similarity: Used on sentence embeddings (Sentence-BERT) to measure semantic overlap.
- Tucker Decomposition: Applied to the tensor of (Image × Context × Embedding) to identify latent functional factors.
- Stochastic Controls: 2,384 inference runs across 4 temperatures and 5 seeds to distinguish genuine context effects from generation noise.

3. Key Results

A. Massive Affordance Drift

The study found that VLMs exhibit extreme context-dependence:

Lexical Drift: The mean Jaccard similarity between different context conditions was 0.095 (95% CI: [0.093, 0.096]). This implies that >90% of the lexical scene description changes based on the agent's context.
Semantic Drift: Sentence-level cosine similarity was 0.415, indicating that even at the semantic level, 58.5% of the meaning is context-dependent.
Object Selection: The model attended to different objects depending on the persona (e.g., a "Chef" identified kitchen tools, while "Security" identified potential weapons/tools), with object-level Jaccard similarity at 0.119.

B. Validation of Context Effects

Stochastic Baseline: Variance analysis confirmed that the observed drift is not due to model randomness. The variance ratio between cross-prime differences and within-prime noise was >3 across all temperatures, confirming that context drives the output, not sampling noise.
Cross-Model Replication: LLaVA-1.5-13B showed similar drift (83.9% context-dependent), suggesting this is a general property of VLMs trained on naturalistic data, not an artifact of a specific architecture.
Human Comparison: VLM outputs paralleled human affordance annotations (Visual Genome) in prioritizing functional descriptions over geometric ones, though VLMs made context-sensitivity explicit via goal-priming.

C. Latent Structure (Tucker Decomposition)

Decomposition revealed stable, interpretable latent factors:

Culinary Manifold: An isolated dimension dominated entirely by the "Chef" context (loading 0.95), orthogonal to all other contexts.
Access Axis: A dimension contrasting "Child" (spatial openness/play, +0.72) and "Mobility" (spatial constraint/obstruction, -0.60).
General Salience: A minor dimension (<1% variance) representing context-invariant geometric features.

4. Key Contributions

Empirical Quantification: The paper provides the first large-scale quantification of context-dependent affordance drift in VLMs, establishing that ~90% of functional scene ontology is context-dependent.
Theoretical Proposal (Semantic-First): It proposes a "Semantic-First" processing pipeline ( $I \to T \to C \to G|C \to A|C$ ) where geometry is conditioned on context, contrasting with the standard geometry-first pipeline.
Just-In-Time (JIT) Ontology: It suggests a new direction for robotics: instead of building static, comprehensive 3D world models, systems should project task-specific affordance structures at query time.
Formalization: Introduces concepts like Action-Distance (non-Euclidean distance based on action sequences) and formalizes the agent state $\Theta$ as a first-class parameter in affordance computation.

5. Significance and Implications

For Robotics: The findings suggest that static world models may be computationally inefficient, as they compute the "10% residual" (invariant geometry) while neglecting the "90%" (context-dependent affordances) that actually matters for action. The paper advocates for JIT Ontology, where robots dynamically construct representations based on the current task.
For Cognitive Science: The results support ecological psychology (Gibson) and phenomenology (Heidegger/Merleau-Ponty) by demonstrating that functional interpretation can structurally precede geometric decomposition in artificial systems. It suggests that "inattentional blindness" (ignoring irrelevant objects) is an optimization strategy, not a bug.
For Computer Vision: It challenges the standard CV pipeline, arguing that treating context as an auxiliary input is insufficient. Future architectures should treat task context as a constitutive input that shapes the primary representation.

Limitations & Caveats:
The authors explicitly state they do not claim to prove biological mechanisms or processing order within the VLM's internal layers (which would require attention probing). They claim only that the output behavior exhibits massive context-dependence, which serves as a proxy for computational viability and a design principle for embodied AI.