CARE Drive A Framework for Evaluating Reason-Responsiveness of Vision Language Models in Automated Driving

Imagine you are teaching a very smart, but somewhat literal, robot driver how to navigate the world. You show it a video of a cyclist on the road and ask, "What should the car do?"

The robot might say, "I will stay behind the cyclist."
Then you ask, "But what if the cyclist is going very slowly and blocking traffic? What if a passenger is in a huge rush?"
The robot might say, "Okay, I will overtake."

But here is the scary part: Is the robot actually thinking about the rush or the traffic? Or is it just making up a nice-sounding story after it has already decided what to do?

This is the problem the paper CARE-Drive tries to solve.

The Problem: The "Post-Hoc" Excuse

Think of a student taking a test.

Scenario A: The student solves the math problem correctly, then writes down the steps they took to get there. This is Reason-Responsive. The reasoning caused the answer.
Scenario B: The student guesses the answer, gets it right by luck, and then writes down a fancy explanation that looks like they did the math, even though they didn't. This is Post-Hoc Rationalization (making up an excuse after the fact).

Current AI models for driving are often like the student in Scenario B. They can generate a perfect-sounding explanation ("I overtook because the cyclist was slow"), but we don't know if that reason actually made them overtake, or if they would have overtaken anyway and just used that reason as a cover-up.

In safety-critical situations (like driving), this is dangerous. If an AI says, "I stopped because I saw a child," but it actually stopped because of a glitch, and then makes up a story about the child, we might trust it too much.

The Solution: The "CARE-Drive" Test

The authors created a framework called CARE-Drive (Context-Aware Reasons Evaluation for Driving). Think of it as a lie detector test for AI decision-making.

Instead of just asking the AI "What would you do?", CARE-Drive plays a game of "What If?" to see if the AI's brain actually changes its mind when the reasons change.

The Analogy: The Traffic Light Game

Imagine the AI is a driver at a crossroads.

The Baseline: You ask the AI, "Should I pass this cyclist?" It says "No."
The Test: You give the AI a specific reason: "Pass the cyclist because the passenger is late for a wedding."
The Observation:
- If the AI says, "Okay, I will pass," it is Reason-Responsive. The reason changed its behavior.
- If the AI still says "No," or if it says "Yes" but gives a totally different reason that ignores the wedding, it might be Reason-Insensitive (it's just guessing or following a hidden rule).

How They Did It (The Experiment)

The researchers set up a specific scenario: Overtaking a cyclist.

The Conflict: In real life, you have to balance Safety (don't hit the oncoming car), Legality (don't cross the double yellow line), and Efficiency/Comfort (don't annoy the cyclist or the passenger).
The Setup: They showed the AI a video of a cyclist.
- Group 1 (The Control): They asked the AI what to do with no extra reasons.
- Group 2 (The Test): They gave the AI a list of "Human Reasons" (e.g., "Prioritize safety," "Consider passenger urgency," "Follow traffic laws").

They then changed the "context" (the situation) like a video game:

What if there is a car coming the other way?
What if a car is honking behind us?
What if the passenger is screaming "Hurry up!"?

The Results: What Did They Find?

The results were a mix of good news and "it's complicated" news.

The AI Can Be "Trained" to Listen: When they gave the AI a structured list of human reasons (like a rulebook), the AI started making decisions that matched what human experts thought was right. It stopped being a "rule-follower" who never takes risks and started being a "reasoner" who weighs options.
The "Thinking Style" Matters: They found that if they told the AI to "Think step-by-step" (Chain of Thought) or "Explore different options before deciding" (Tree of Thought), it did a much better job of using the reasons. It was like giving the AI a moment to pause and think, rather than just blurting out an answer.
It's Not Perfectly Human Yet:
- Good: The AI got very sensitive to Safety. If the oncoming car was too close, it wouldn't pass, no matter how much the passenger yelled.
- Bad: The AI got weird about Urgency. When they told the AI "The passenger is in a hurry," the AI actually became more conservative and less likely to overtake! It seems the AI interpreted "hurry" as "don't take risks," whereas humans often interpret "hurry" as "take a calculated risk."
- The "Short Explanation" Trap: When they forced the AI to give a very short answer (like a text message), it almost never overtook. It seems the AI needs "space" to explain its reasoning to actually make the decision.

Why Does This Matter?

This paper is a big step toward Meaningful Human Control.

Imagine you are a passenger in a self-driving car. You want to know: "Did this car stop because it saw me, or because it had a glitch?"
If the car's decision-making is Reason-Responsive, we can trust that its actions are tied to the reasons it gives us. If it's just making up stories, we can't trust it.

CARE-Drive gives us a tool to check the AI's "conscience" without needing to open up its brain and look at the code. It's like checking if a driver is actually looking at the road, or just staring at a map and pretending they see the traffic.

The Takeaway

The paper shows that we can teach AI to make decisions based on human values (like safety and efficiency), but we have to test them carefully. We can't just ask them to "be nice"; we have to poke and prod them with different situations to see if they actually care about the reasons we give them.

In short: CARE-Drive is the "truth serum" that helps us figure out if our robot drivers are actually thinking, or just talking the talk.

1. Problem Statement

The integration of Foundation Models, specifically Vision-Language Models (VLMs), into automated driving (AD) offers capabilities for scene interpretation and natural language explanation generation. However, current evaluation methods for AD systems are predominantly outcome-based (e.g., collision rates, trajectory accuracy). These metrics fail to assess whether a model's decisions are genuinely driven by human-relevant reasoning (e.g., safety margins, social pressure, efficiency) or if the generated explanations are merely post-hoc rationalizations (plausible narratives generated after the decision is made).

In safety-critical domains, this lack of "reason-responsiveness" creates a risk of false confidence. Under the framework of Meaningful Human Control (MHC), automated systems must satisfy a "tracking condition," meaning their behavior must appropriately respond to the human-relevant reasons that justify a decision. Currently, there is no systematic, model-agnostic framework to evaluate whether VLMs in AD actually track these reasons or if their reasoning is decoupled from their decision-making process.

2. Methodology: The CARE-Drive Framework

The authors propose CARE-Drive (Context-Aware Reasons Evaluation for Driving), a model-agnostic framework designed to evaluate whether explicit human reasons causally influence VLM decision behavior. The framework operates in two stages:

Stage 1: Prompt Calibration

Goal: Identify a stable prompt configuration (Model $M$ and Thought Strategy $T$ ) that produces decisions aligned with expert recommendations before testing sensitivity.
Process:
- The system tests various combinations of VLMs (e.g., GPT-4.1 variants) and reasoning strategies (No-Thought, Chain-of-Thought, Tree-of-Thought).
- It uses a reference decision derived from domain experts (who generally recommend overtaking a cyclist in specific ambiguous scenarios despite legal restrictions, prioritizing efficiency/comfort over strict legality).
- The goal is to find the configuration $(M^*, T^*)$ that maximizes alignment with the expert decision while maintaining consistency across stochastic runs.
Outcome: This stage isolates prompt-level stability from contextual sensitivity, ensuring that subsequent changes in behavior are due to context, not model instability.

Stage 2: Contextual Reasons Evaluation

Goal: Measure the sensitivity of the calibrated model to variations in observable driving contexts that correspond to human normative reasons.
Process:
- The calibrated configuration $(M^*, T^*)$ is fixed.
- Systematic Perturbation: The model is tested under a full-factorial design where observable context variables ( $O$ $O$ ) are varied. These variables include:
  - Time-to-Collision (TTCo): Safety margin with oncoming traffic.
  - Vehicle Behind ( $B$ ): Social pressure from a following vehicle.
  - Passenger Urgency ( $U$ ): Efficiency/comfort constraints.
  - Following Time ( $F$ ): Accumulated delay behind a cyclist.
- Reason Augmentation: The model is prompted with explicit normative reasons ( $R$ ) (e.g., "prioritize safety, efficiency, and comfort") compared to a baseline with no reasons.
- Analysis: A binary logistic regression model is used to quantify how changes in context variables affect the probability of a specific decision (e.g., overtaking vs. staying behind).

Use Case

The framework is validated using a cyclist overtaking scenario on a two-lane road with double solid lines (legally prohibiting overtaking). This scenario creates a conflict between legality (stay behind) and efficiency/comfort (overtake), mirroring real-world ethical ambiguities.

3. Key Contributions

CARE-Drive Framework: A novel, model-agnostic methodology to evaluate reason-responsiveness in VLMs without modifying model parameters or requiring access to internal weights.
Two-Stage Evaluation Procedure: A rigorous process that first calibrates for prompt stability and then systematically tests for context-sensitive reasoning, distinguishing between genuine responsiveness and stochastic noise.
Operationalization of MHC: The framework translates the philosophical "tracking condition" of Meaningful Human Control into a quantifiable behavioral metric for end-to-end foundation models.
Empirical Evidence: Demonstrates that explicit normative guidance can shift model decisions toward expert-aligned behavior, but highlights that this responsiveness is uneven across different types of reasons.

4. Key Results

The study was conducted using GPT-4.1 models in a simulated environment (CARLA) with 30 stochastic runs per condition.

Baseline vs. Reason-Augmented: Without explicit human reasons, the VLM defaulted to strict rule compliance (0% overtaking rate), ignoring efficiency/comfort. With explicit reasons, the model shifted toward expert-recommended behavior (overtaking), proving that reasons influence decisions.
Calibration Findings:
- Tree-of-Thought (ToT) outperformed Chain-of-Thought (CoT) in safety-critical scenarios, maintaining high alignment with expert decisions (93.33% overtaking) even under pressure, whereas CoT became unstable (30% overtaking).
- Explanation Length: Constraining explanation length (Few-Sentences) significantly suppressed overtaking behavior, suggesting that reasoning bandwidth is critical for complex trade-offs.
Context Sensitivity (Logistic Regression Results):
- Safety (TTCo): Strong positive correlation. Larger safety margins significantly increased the probability of overtaking (Odds Ratio $\approx$ 20.4).
- Social Pressure (Vehicle Behind): Significant positive correlation. The presence of a following vehicle increased overtaking likelihood (Odds Ratio $\approx$ 3.8), indicating sensitivity to social norms.
- Efficiency (Passenger Urgency): Counter-intuitive result. The presence of passenger urgency decreased the likelihood of overtaking (Odds Ratio $\approx$ 0.42). The model became more conservative, prioritizing safety over the stated efficiency goal.
- Delay (Following Time): No statistically significant effect on overtaking decisions after controlling for other variables.
Validation: The calibrated decisions were successfully implemented in the CARLA simulator, producing executable behaviors that matched the statistical findings (e.g., overtaking when safe, staying behind when oncoming traffic is present).

5. Significance and Implications

Beyond Post-Hoc Rationalization: CARE-Drive provides empirical evidence that VLMs can be made responsive to human reasons, but also reveals that they do not automatically do so. It helps distinguish between models that explain decisions and models that reason to make decisions.
Safety-Critical AI: The framework offers a practical tool for regulators and developers to audit whether automated systems in safety-critical domains (like AD) exhibit behavior consistent with human-centered reasoning, a prerequisite for Meaningful Human Control.
Uneven Responsiveness: The findings highlight a critical limitation: while models are sensitive to safety and social cues, they may fail to appropriately weigh efficiency or urgency cues, potentially leading to overly conservative or unpredictable behavior in complex trade-off scenarios.
Future Direction: The paper suggests that future AD systems should not just generate explanations but be explicitly evaluated on their ability to track human-relevant reasons, potentially requiring new training paradigms or architectural changes to ensure causal links between reasons and actions.

In summary, CARE-Drive bridges the gap between philosophical requirements for human control and technical evaluation of foundation models, offering a robust methodology to ensure that automated driving decisions are not just safe, but also reasonably aligned with human values and situational context.