Talking with Verifiers: Automatic Specification Generation for Neural Network Verification

Imagine you have built a super-smart robot chef (a Neural Network) that can cook amazing meals. You want to make sure this chef never burns the food or serves something poisonous. To do this, you hire a safety inspector (a Verification Tool).

The Problem: The Language Barrier
The problem is that the safety inspector speaks only one language: Math. They can only understand instructions like, "If you change the temperature of the oven by exactly 5 degrees, the food must still be safe."

But you, the human owner, think in stories and concepts. You want to say things like:

"The chef shouldn't burn the meal if the spicy pepper is hidden under the sauce."
"The credit approval shouldn't change if the applicant is under 50 years old."

Currently, you have to be a translator. You have to manually figure out exactly which pixel on a photo represents the "spicy pepper" or which number in a spreadsheet represents "age," and then write a complex math equation for the inspector. If you get the coordinates wrong, the inspector can't help you. This is slow, boring, and prone to mistakes.

The Solution: The "Talking with Verifiers" Bridge
This paper introduces a new team member: an Automatic Translator (a pipeline using AI). This team member sits between you (the human) and the safety inspector (the math tool).

Here is how the new process works, using a simple analogy:

1. The Translator (The Parser)

You tell the translator your worry in plain English: "What if the bird's beak is covered?"
The translator (using a Large Language Model) understands your intent. It doesn't just hear words; it understands the concept of a "beak" and the action "covering."

2. The Detective (The Detector)

The translator then asks a Detective (a Vision AI model) to look at the specific picture of the bird you are worried about.

Old way: You had to tell the detective, "Look at pixels 100 to 200."
New way: The translator tells the detective, "Find the beak."
The detective scans the image, finds the beak, and says, "Ah, the beak is right here, in this specific square box."

3. The Architect (The Specification Generator)

Now, the translator takes the detective's findings and builds a math instruction that the safety inspector can understand.

It says to the inspector: "Okay, take this specific box where the beak is, pretend it's covered in black paint, and check if the robot chef still thinks it's a bird."

4. The Inspector (The Verifier)

The safety inspector receives this new, clear math instruction. It runs its calculations and gives you a definitive answer: "SAFE" (the chef is still smart even with a covered beak) or "UNSAFE" (here is a picture where the chef gets confused).

Why is this a big deal?

1. It speaks your language.
You don't need to be a math genius or a programmer. You can just ask the system questions in natural language, just like you would ask a colleague.

2. It handles the "Unseen" stuff.
Imagine a photo of a bird. Every bird looks different. One has its beak on the left, another on the right. Old tools couldn't handle this because they needed a fixed rule for every single picture. This new system is like a smart spotlight. It finds the beak wherever it is in the picture, then checks that specific spot.

3. It works for many things.
The paper shows this works for:

Images: "Is the car safe if the stop sign is covered by a tree?"
Audio: "Will the alarm still ring if the drilling noise gets louder?"
Spreadsheets: "Does the loan get rejected if the age is under 50?"

The Bottom Line

Think of this paper as building a universal remote control for safety checks. Before, you had to manually wire every single button to the machine's internal circuits. Now, you just press a button labeled "Check the beak," and the system automatically figures out the wiring, finds the beak, and runs the test.

It doesn't change how the safety inspector works (the math is still the same); it just makes it possible for anyone to ask the inspector the right questions. This makes AI safety much more accessible and practical for the real world.

1. Problem Statement

Current formal verification tools for Deep Neural Networks (DNNs) are limited by their reliance on low-level numerical specifications. These tools typically require users to define constraints over fixed input dimensions (e.g., $x_3 \leq 50$ ) or uniform perturbation regions (e.g., $\ell_p$ -balls). This creates a significant usability bottleneck:

Expressiveness Gap: Many high-level, semantically meaningful requirements (e.g., "The bird is classified correctly even if its beak is occluded") cannot be naturally encoded because the relevant input regions vary across samples and are not fixed coordinates.
Manual Translation Burden: End-users must manually translate natural language intent into specialized, tool-specific constraint languages, a process that is error-prone and requires deep expertise.
Domain Limitations: Existing tools struggle with unstructured data (images, audio) where semantic objects (like a "beak" or "drilling noise") do not map to static input indices.

2. Methodology

The authors propose a novel integration layer that bridges the gap between natural language specifications and existing DNN verifiers. Instead of developing new verification algorithms, they assemble a pipeline that leverages Foundation Models (LLMs, VLMs, ALMs) to automatically generate formal verification queries.

The pipeline consists of three distinct stages (Algorithm 1):

A. Parsing (Semantic Extraction)

Input: A natural language property $P$ (e.g., "The credit decision should not change for applicants younger than 50").
Mechanism: A Large Language Model (LLM) parses $P$ $P$ to extract:
1. Semantic Objects: The specific entities to be located (e.g., "applicant age," "bird beak," "drilling noise").
2. Operations: The transformation to apply (e.g., "occlude," "amplify," "set to 0").
Output: Structured semantic components ready for grounding.

B. Detection (Grounding)

Input: The concrete input sample $x$ (e.g., an image, audio clip, or tabular row) and the extracted objects.
Mechanism: An off-the-shelf perception model localizes the semantic objects within the specific input.
- Tabular Data: Direct mapping of feature names to input indices.
- Images: Open-vocabulary object detection (e.g., Grounding DINO) identifies pixel coordinates of the object.
- Audio: Open-vocabulary sound event detection (e.g., DASM, FlexSED) identifies temporal intervals.
Output: Concrete coordinates (spatial or temporal) defining the region of interest.

C. Specification Generation

Input: The original input $x$ , the detected coordinates, and the parsed operation.
Mechanism: A generator constructs a standard formal verification query $P_x$ $P_{x}$ .
- For images, this creates a local robustness constraint: $\forall x' \in B(x, \text{coords}), N(x') = N(x)$ , where $B$ represents the region defined by the bounding box.
- For tabular data, it generates numerical constraints on specific dimensions.
Output: A formal query compatible with state-of-the-art verifiers (e.g., Marabou, Reluplex) without modifying the verifier itself.

3. Key Contributions

Usability-Driven Framework: The paper identifies the "specification bottleneck" as a primary barrier to DNN verification adoption and proposes a solution that allows users to specify requirements in natural language.
Automated Specification Generation: A mechanism that maps natural language intent to formal constraints by combining LLMs for intent parsing and open-vocabulary perception models for grounding.
Modular Integration: The approach is verifier-agnostic. It does not modify the underlying verification engines; it simply translates high-level intents into standard inputs that existing tools can process.
Multi-Modal Support: The framework is instantiated for:
- Structured Data: Tabular inputs (e.g., credit scoring).
- Unstructured Data: Image classification (using open-vocabulary detection).
- Audio: Conceptually defined for sound event localization (though not fully implemented in the experiments).

4. Experimental Results

The authors evaluated the pipeline on two benchmarks: the Statlog (German Credit) dataset (tabular) and CUB-200-2011 (fine-grained bird classification).

Parsing Accuracy (LLM):
- Models like GPT-5 Mini and Gemini 3 Flash achieved high accuracy (85%–100%) in extracting objects and actions from natural language.
- GPT-5 Mini showed slightly higher accuracy (up to 100% on Statlog), while Gemini 3 Flash offered faster inference times.
Grounding Accuracy (Perception):
- Open-vocabulary detection (Grounding DINO) faced challenges with fine-grained localization.
- Single configuration accuracy ranged from 23% to 55%.
- However, a disjunction of all configurations (checking if any parameter setting succeeded) reached 83%, suggesting that while individual heuristics are imperfect, the system is highly capable of finding the correct region with the right parameters.
Qualitative Success:
- The system successfully translated complex queries like "Can the prediction change if the purple thorn in the bottom is noisier?" into precise bounding-box constraints, which standard verifiers could then process to find counterexamples.

5. Significance and Impact

Bridging the Semantic Gap: This work fundamentally shifts DNN verification from a low-level, math-heavy task to a high-level, user-centric workflow. It allows safety-critical systems to be verified against requirements that match human intuition (e.g., "occlusion of a specific object") rather than abstract pixel perturbations.
Practical Applicability: By reusing mature components (LLMs, VLMs, existing verifiers), the approach is immediately deployable. It lowers the barrier to entry for formal methods, making them accessible to domain experts who are not verification specialists.
Future Directions: The framework sets the stage for verifying temporal constraints in video and audio, and refining spatial grounding to pixel-level segmentation, further expanding the scope of verifiable semantic properties.

In conclusion, the paper demonstrates that automatic specification generation is a viable and effective strategy to extend the reach of formal verification to real-world, high-level requirements, transforming how neural networks are validated in safety-critical domains.