Reachability-based Temporal Logic Verification for Reliable LLM-guided Human-Autonomy Teaming

Imagine you are the captain of a high-tech spaceship (the Autonomous Agent), and you have a very smart, very chatty co-pilot named LLM (the Large Language Model). Your goal is to work together as a team to complete a mission.

The problem? You speak "Human," and the spaceship speaks "Robot Math." The LLM is the translator between you two. But here's the catch: LLMs are great at chatting, but they sometimes make up impossible instructions or miss safety rules, like telling the ship to fly through a mountain because it sounds cool in a story.

This paper proposes a new system to make sure your team doesn't crash. They call it a Safety Filter with a "Reality Check."

Here is how it works, broken down into simple steps:

1. The Translator (The LLM)

You give a command in plain English, like: "Fly to the red zone, then the blue zone, but never go near the black hole, and do it all before lunch."

The LLM translates this into STL (Signal Temporal Logic). Think of STL as a strict, mathematical recipe that the robot can actually follow.

The Risk: The LLM might write a recipe that looks perfect grammatically but is physically impossible (e.g., "Fly to the moon in 5 seconds").

2. The "Reality Check" Filter (The SFF)

This is the star of the show. Before the robot tries to follow the recipe, it passes it through a Safety Filter (SFF).

Instead of just saying "Yes, this works" or "No, this fails," this filter acts like a Demolition Expert or a Chef tasting a soup.

The Old Way: If the whole recipe was bad, the filter would just say "Bad Recipe" and stop. You wouldn't know why.
The New Way: The filter breaks the big recipe down into tiny, individual steps (subformulas). It checks each step one by one.
- Step 1: "Fly to red zone." -> Pass.
- Step 2: "Fly to blue zone." -> Pass.
- Step 3: "Fly through the black hole." -> FAIL.

3. The "Reachability" Test (The Map Check)

How does the filter know a step is impossible? It uses a concept called Reachability Analysis.

Imagine you are standing in a room with a locked door.

Reachability asks: "If I run as fast as I can, can I actually reach that door before the time runs out?"
The filter draws an invisible map (a "Tube") of every single place the robot could possibly go given its speed and the obstacles.
If the instruction asks the robot to go to a spot outside that invisible tube, the filter knows immediately: "That's impossible. The robot can't get there."

4. The Feedback Loop (The "Why" Explanation)

This is the most important part for the human.

Old System: "Error: Mission Failed." (You are left guessing what went wrong).
New System: The filter tells the LLM, "Hey, Step 3 is impossible because the robot can't fly that fast."
The LLM then talks to you: "Captain, I tried to translate your order, but the robot can't reach the blue zone before lunch because of the black hole. How about we skip the black hole or give the robot more time?"

The Big Benefits

Safety First: It catches dangerous or impossible orders before the robot even starts moving.
Smart Communication: Instead of a silent failure, the human gets a clear explanation of why a command didn't work.
Speed: By breaking big, complex instructions into small, simple chunks, the computer can check them much faster (like checking a grocery list item-by-item instead of trying to read the whole list at once).

The Real-World Test

The authors tested this with a tiny flying drone (a Crazyflie).

Scenario 1: They asked the drone to fly through a "school zone" that was closed. The LLM thought it was fine. The Safety Filter said, "Nope, that zone is closed," and stopped the mission.
Scenario 2: They asked the drone to do something physically impossible. The filter found the exact impossible step and told the LLM to explain it to the human.

In a Nutshell

This paper builds a smart translator and a strict safety inspector for human-robot teams. It ensures that when a human gives a command, the robot doesn't just blindly try to do it and crash; instead, it checks if the command is physically possible, breaks it down to find the problem, and politely tells the human how to fix it. It turns a "Robot vs. Human" misunderstanding into a helpful conversation.

1. Problem Statement

The integration of Large Language Models (LLMs) into Human-Autonomy Teaming (HAT) offers significant potential for natural language interaction. However, relying solely on LLMs to translate human commands into actionable robot plans introduces critical safety and reliability risks:

Hallucination and Feasibility: LLMs may generate specifications (commands) that are logically inconsistent or physically impossible for the autonomous agent to execute given its dynamics and environmental constraints.
Lack of Formal Guarantees: Existing approaches often perform verification after planning or execution (post-verification), which is too late for safety-critical scenarios. There is a need for pre-verification before mission planning begins.
Binary Feedback Limitations: Traditional verification often yields a simple "feasible/infeasible" (Boolean) result. This lacks the granularity needed to inform the human operator why a command failed, hindering effective collaboration and command modification.

2. Methodology

The authors propose a framework that bridges natural language and formal verification using Signal Temporal Logic (STL) and Reachability Analysis. The architecture consists of three main stages:

A. LLM as a Translator

Input: A human operator provides a natural language command (e.g., "Drive to the school zone and avoid construction").
Translation: An LLM translates this command into an STL specification ( $\phi$ ). The LLM is fine-tuned to output structured JSON containing the STL formula and a dictionary of atomic predicates (variables and constraints).
Context: The system assumes shared knowledge of the environment (map, obstacles) among the human, LLM, and verification module.

B. STL Feasibility Filter (SFF)

The core innovation is the SFF, which validates the LLM-generated STL before any trajectory planning occurs.

Decomposition: The complex, nested STL formula is decomposed into simpler subformulas ( $\phi'_i$ $ϕ_{i}^{'}$ ) using specific logical rules (e.g., distributing $\square$ $□$ over $\land$ $\land$ , and conservatively decomposing $\diamond$ $⋄$ and $U$ $U$ operators).
- Benefit: This allows for parallel processing and isolates specific parts of the command that might be problematic.
Reachability Analysis (BRT): For each subformula, the system computes a Backward Reachable Tube (BRT).
- The BRT defines the set of initial states from which the system can satisfy the specification while respecting dynamics and disturbances.
- Feasibility Check: If the system's current state lies within the BRT of a subformula, that subformula is feasible. If the BRT is empty (or the state is outside), the subformula is infeasible.
Aggregation: The feasibility of the whole mission is determined by the intersection of subformula feasibilities (using $\max$ on Bellman value functions for conjunctions).

C. Informative Feedback Loop

If Feasible: The system proceeds to MILP-based Mission Planning to generate a trajectory.
If Infeasible: The SFF identifies the specific subformula(s) causing the failure. This information is fed back to the LLM.
LLM Response: The LLM uses the specific infeasible subformula to generate natural language feedback explaining why the mission failed (e.g., "The school zone is closed during the mission time, making the goal unreachable") and suggests modifications to the human operator.

3. Key Contributions

Reliable HAT Framework: A novel architecture that integrates LLMs with formal methods (STL and Reachability Analysis) to ensure safety and feasibility before execution.
STL Feasibility Filter (SFF): A mechanism that decomposes complex STL specifications to identify specific infeasible subformulas, moving beyond binary verification to granular error detection.
Informative Feedback Generation: By isolating infeasible logic, the system enables the LLM to provide detailed, reasoned feedback to humans, enhancing situational awareness and trust.
Computational Efficiency: The decomposition strategy significantly reduces the computational burden of Mixed-Integer Linear Programming (MILP) planning by breaking down complex constraints into parallelizable, simpler sub-problems.

4. Experimental Results

The framework was validated using numerical simulations and real-world experiments with a Crazyflie nano-quadrotor.

Scenario 1 (Feasibility Filtering):
- Setup: A mission required a "school bus" to reach a goal, but the school zone was closed during the mission time.
- Result: Three different LLMs (ChatGPT-4o, Claude 3.7, Llama 4) all incorrectly deemed the mission feasible. The proposed SFF correctly identified the mission as infeasible due to the temporal constraint, preventing a failed execution.
Scenario 2 (Feedback Generation):
- Setup: An intentionally modified STL contained a physically impossible constraint (entering a region while avoiding a goal).
- Result: The SFF identified the specific infeasible subformula. The LLM used this to generate a precise explanation for the human, rather than a generic error message.
Scenario 3 (Decomposition Efficiency):
- Setup: Comparing mission planning times for original STL vs. decomposed STL (Levels 0, 1, and 2).
- Result: Decomposition reduced the average computation time for MILP planning by approximately 50% (from ~54s to ~25s) while maintaining trajectory validity. Level 0 decomposition was exact, while Levels 1 and 2 were conservative but still effective.

5. Significance

This paper addresses a critical gap in the deployment of AI-driven autonomous systems: the reliability of natural language interfaces.

Safety: It prevents the execution of impossible or unsafe missions by catching logical errors in LLM outputs before they reach the controller.
Human-in-the-Loop: It transforms the interaction from a "black box" failure into a collaborative dialogue, where the system explains its limitations to the human.
Scalability: The decomposition approach makes formal verification computationally tractable for complex, real-time mission planning tasks.

The proposed framework represents a significant step toward robust, safe, and communicative Human-Autonomy Teaming in dynamic environments.