Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving

Imagine you have a very smart, but slightly inexperienced, robot driver. You want it to understand the road just like a human does: "Is there a pedestrian?" "How many cars are there?" "Is that truck turning left or right?"

This paper is like a mechanic's diagnostic tool for that robot's brain. The researchers wanted to figure out why the robot sometimes fails at these simple tasks, even though it's supposed to be "smart."

Here is the breakdown of their investigation using simple analogies:

1. The Setup: The "Three-Part Brain"

Think of the robot's brain (the Vision-Language Model) as a three-person team passing a message down a line:

The Eyes (Vision Encoder): Takes a photo and turns it into a list of visual features.
The Translator (Projector): Converts those visual features into a language the brain can understand.
The Thinker (LLM): Reads the message and decides on the answer.

The problem is, when the robot gets an answer wrong, you don't know who messed up. Did the Eyes miss it? Did the Translator mess up the translation? Or did the Thinker just ignore the facts?

2. The Experiment: The "Magic Mirror" (Counterfactuals)

To test this, the researchers created a special kind of "Magic Mirror." They took two pictures that were identical in every single way, except for one tiny detail.

Example: Picture A has a pedestrian. Picture B is the exact same street, but the pedestrian is gone.
Example: Picture A shows a truck with the left blinker on. Picture B is the same truck, but the right blinker is on.

They fed these "Magic Mirror" pairs into the robot's brain and watched the electrical signals (activations) as the image passed through the three team members.

3. The Detective Work: The "Linear Probe"

The researchers used a simple tool called a Linear Probe. Think of this as a metal detector.

They asked the metal detector: "Can you find the signal for 'Pedestrian' in this pile of electrical noise?"
If the detector beeps loudly (high accuracy), it means the concept is clearly stored in that part of the brain.
If the detector is silent, the concept is lost or hidden.

4. The Big Discoveries

A. What the Robot Sees Clearly vs. What it Misses

The "Is it there?" Test (Presence): The robot is great at this. If a person is standing there, the "Eyes" and the "Thinker" both know it. It's like a bright, loud alarm.
The "How many?" Test (Count): The robot is okay at this, but gets a bit fuzzy if the objects are far away.
The "Which way?" Test (Orientation/Direction): This is where it breaks. The robot often fails to tell if a person is walking left or right.
- The Analogy: Imagine looking at a blurry photo of a person walking. You can see the person (Presence), but you can't tell if they are facing left or right. The "Eyes" see the shape, but the "Thinker" can't figure out the direction.

B. The Distance Problem

The researchers found that distance is the enemy.

At 5 meters (close up), the robot sees things clearly.
At 50 meters (far away), the "Eyes" get confused. The signal gets so weak that even the "Thinker" can't make sense of it. It's like trying to read a street sign from a mile away; the letters just blur together.

5. The Two Types of Failure (The Most Important Part)

The researchers realized there are two different ways the robot can fail, and they need different fixes.

Type 1: Perceptual Failure (The "Blind" Robot)

What happens: The robot literally doesn't see the information. The "metal detector" finds nothing.
Analogy: You are wearing sunglasses that are too dark. You can't see the red traffic light, so you don't stop.
The Fix: You need better "Eyes" (a better camera or vision encoder).

Type 2: Cognitive Failure (The "Distracted" Robot)

What happens: The robot does see the information. The "metal detector" beeps loudly, proving the data is there. But when it has to give an answer, it guesses wrong anyway.
Analogy: You see the red traffic light clearly. You know it means "Stop." But your brain is so distracted by a song in your head that you accidentally step on the gas. The information was there, but you didn't use it correctly.
The Fix: You need better training for the "Thinker" to learn how to connect what it sees with the right words and actions.

6. Why This Matters for Self-Driving Cars

Self-driving cars need to handle "long-tail" scenarios—rare, weird situations that don't happen often.

If a car fails because it's blind (Perceptual), we need better cameras.
If a car fails because it's confused (Cognitive), we need better software training.

The paper concludes that we can't just blame the whole system. We need to know exactly which part of the brain is failing so we can fix the specific problem. Currently, small, lightweight robots (which are needed for real cars because big ones are too heavy/slow) are great at seeing "stuff," but they struggle with "where" and "which way" things are, especially when those things are far away.

In short: The robot isn't just "dumb"; sometimes it's blind, and sometimes it's just not paying attention. We need to figure out which one it is to make our self-driving cars safer.

1. Problem Statement

Vision-Language Models (VLMs) are increasingly being considered for automated driving due to their reasoning and generalization capabilities. However, these models frequently fail on simple, critical visual tasks (e.g., detecting a pedestrian's orientation or counting objects) that are essential for safety.

The Gap: Current literature lacks a granular understanding of where and why these failures occur within the model architecture. It is unclear if the failure stems from the vision encoder failing to extract features, the projector degrading information, or the Large Language Model (LLM) failing to align visual features with language semantics.
The Challenge: Automated driving requires lightweight models (e.g., <4B parameters) deployable on edge hardware (e.g., NVIDIA Jetson), yet the failure modes of these specific "small" VLMs on traffic-relevant tasks are poorly understood.

2. Methodology

The authors propose a framework using linear probing to trace the flow of visual information through the entire VLM architecture (Vision Encoder $\to$ Projector $\to$ LLM).

A. Counterfactual Dataset Generation

To isolate specific visual concepts, the authors generated synthetic datasets using CARLA. They created "counterfactual" image pairs that are identical in every aspect except for the target concept.

Target Concepts:
1. Presence: Object present vs. absent (Pedestrian, Traffic Barrel).
2. Count: Number of objects (0–4).
3. Spatial Relationship: Relative position (e.g., left/right blinker, pedestrian on left/right side).
4. Orientation: Direction of movement (Left/Right facing).
Variables: Images were generated at varying distances (5m to 50m) to test robustness.

B. Linear Probing Framework

The study extracts intermediate activations from four state-of-the-art (SOTA) lightweight VLMs (<4B params): Ovis2.5, InternVL3.5, VST-SFT, and VST-RL.

Activation Extraction:
- Vision Encoder & Projector: Activations are compressed using Average Pooling (for general concepts) and Region Pooling (splitting images into left/right regions to preserve spatial structure for spatial tasks).
- LLM: Activations are extracted from visual tokens and the final text token.
Probe Training: Simple linear classifiers (probes) are trained on these activations to distinguish between the counterfactual classes. High probe accuracy indicates that the concept is linearly encoded at that specific layer.

C. Validation

Cosine Similarity: Verified that probes learned the same underlying concept direction across different data categories (e.g., presence of a pedestrian vs. a barrel).
Activation Steering: Applied the learned probe weights as steering vectors to the model's activations during generation. If steering changed the model's output to match the concept, it confirmed the concept was causally encoded.
Out-of-Distribution (OOD) Testing: Evaluated probes on real-world nuScenes data to test generalization.

3. Key Contributions

Layer-wise Analysis: First study to map the linear encoding of traffic-relevant visual concepts across the entire VLM stack (Vision Encoder, Projector, and LLM) for lightweight models.
Failure Mode Taxonomy: Identification of two distinct failure mechanisms:
- Perceptual Failure: The visual information is not linearly encoded in the model's activations (the information is lost or never extracted).
- Cognitive Failure: The visual information is linearly encoded (high probe accuracy) but the model fails to align it with language semantics to produce the correct answer.
Distance Degradation Analysis: Quantified how increasing object distance rapidly degrades the linear separability of visual concepts.

4. Key Results & Findings

A. Encoding Patterns by Concept

Presence: Explicitly and linearly encoded from the middle of the Vision Encoder through the LLM. Performance drops significantly at long distances (30m+).
Count: Well-encoded for small counts, with a noticeable "jump" in accuracy occurring in the middle layers of the LLM.
Spatial Relationships:
- Explicit Encoding: Weak or non-existent in the Vision Encoder.
- Implicit Encoding: The Vision Encoder retains sufficient spatial structure (preserved via region pooling) to allow the LLM to infer the answer later.
- Bottleneck: The LLM is responsible for the "spike" in accuracy where the spatial relationship becomes explicitly linearly encoded.
Orientation:
- Poor Encoding: Not explicitly linearly encoded at any stage.
- Implicit Structure: Some spatial structure exists in the middle layers of the Vision Encoder but degrades quickly through the projector and LLM.

B. Distance Impact

Increasing object distance (5m $\to$ 50m) causes a rapid decline in linear separability.
The Vision Encoder is the primary bottleneck for long-distance failures, as the gap between short and long-distance performance emerges early in the pipeline.

C. Failure Modes (Perceptual vs. Cognitive)

The authors observed a significant accuracy gap between the Linear Probe (which can detect the concept) and the Model Output (which fails to answer correctly).

Perceptual Failure: Low probe accuracy + Low model accuracy. (Information lost).
Cognitive Failure: High probe accuracy + Low model accuracy. (Information present but misaligned with language).
- Observation: Models like InternVL3.5 and VST exhibited higher rates of cognitive failure compared to Ovis2.5.
- Implication: Cognitive failure suggests the model "sees" the object but cannot translate that visual signal into the correct linguistic token.

D. Model Comparisons

Ovis2.5: Generally performed best, particularly in spatial tasks, likely due to its unique projector architecture (learnable visual embedding table).
InternVL3.5: Showed unique behavior where encoding quality dropped in early LLM layers before recovering, and suffered from strong bottlenecks in the projector for orientation tasks.
VST: Despite being trained specifically for spatial reasoning, it did not significantly outperform others in linear encoding, highlighting that standard fine-tuning may not fix fundamental representation gaps.

5. Significance

Safety Critical: The findings reveal that lightweight VLMs, often proposed for edge deployment in autonomous vehicles, may suffer from cognitive failures where they possess the visual data but fail to utilize it for decision-making.
Diagnostic Tool: The proposed framework allows developers to pinpoint whether a failure is due to the vision encoder (perceptual) or the language alignment (cognitive), guiding targeted improvements (e.g., better vision pre-training vs. better multimodal alignment training).
Distance Sensitivity: The rapid degradation of concept encoding with distance is a critical warning for automated driving systems, suggesting that current VLMs may be unreliable for detecting distant hazards.
Future Directions: The paper argues for moving beyond "black box" evaluation to structural analysis, suggesting that future VLMs for driving need specific architectural changes to handle spatial reasoning and long-range object detection.