VeriStruct: AI-assisted Automated Verification of Data-Structure Modules in Verus

Imagine you are building a massive, complex city out of LEGO bricks. You want to be 100% sure that every bridge, tunnel, and skyscraper is structurally sound before anyone lives there. In the world of software, this "structural soundness" is called formal verification. It's a mathematical way of proving code is bug-free.

However, doing this manually is like hiring a team of architects to check every single brick by hand. It takes forever, is incredibly expensive, and requires a PhD in math.

Recently, we gave the architects an AI assistant (a Large Language Model or LLM) to help. The AI is great at writing code, but when it comes to checking the math, it often gets confused. It's like giving a brilliant but inexperienced intern a set of blueprints written in a secret language they've never seen before. They might write a beautiful building, but they might forget to check if the foundation can actually hold the weight.

Enter VeriStruct.

VeriStruct is a new "super-intern manager" designed to help AI assistants verify complex software modules (like data structures) using a tool called Verus (which speaks the language of Rust, a popular programming language).

Here is how VeriStruct works, using some everyday analogies:

1. The Problem: The "Translation" Gap

Imagine you ask an AI to verify a Ring Buffer (a type of data storage that works like a circular conveyor belt).

The AI's Mistake: The AI might try to describe the conveyor belt by listing every single brick's position. This is technically true, but it's so messy that the math checker gets overwhelmed and gives up.
The Syntax Trap: Verus has very strict rules (like "you can't touch the global state while writing a proof"). The AI, having read a lot of general code but little Verus code, often breaks these rules, like trying to use a hammer to drive a screw.

2. The Solution: The "Project Manager" (The Planner)

Instead of just asking the AI, "Fix this code," VeriStruct acts like a Project Manager.

The Planner Module: Before the AI writes a single line, the Planner looks at the code and asks: "What do we actually need here?"
- Do we need a View? (Think of this as a "simplified map" of the data. Instead of showing every brick, the map just says, "Here is the list of items on the belt.")
- Do we need a Type Invariant? (This is a "safety rule" that must always be true, like "The belt must always have at least one empty spot.")
- Do we need Proof Blocks? (These are little hints to the math checker on how to solve a tricky puzzle.)
The Planner decides which tools to use and in what order, so the AI doesn't waste time trying to build a bridge when it only needs to lay a sidewalk.

3. The "Drafting" Phase (Generation)

Once the plan is set, VeriStruct asks the AI to write the annotations (the safety rules and maps).

The Prompt: The AI isn't just told to "do it." It's given a cheat sheet (syntax guidelines) and examples of how other experts solved similar problems. This stops the AI from using the "secret language" incorrectly.
The Refinement: Sometimes the AI draws a map that is too detailed (too many bricks). VeriStruct has a "Refinement Step" that tells the AI: "Hey, simplify this. We don't need to see the individual bricks; just show us the flow of traffic." This makes the math much easier to check.

4. The "Inspector" Phase (Repair)

Even with a great plan, the AI will make mistakes. The code might fail the math check.

The Repair Loop: Instead of giving up, VeriStruct acts like a Quality Control Inspector.
- The math checker says, "Error: You tried to use a hammer on a screw!"
- VeriStruct catches this specific error, routes it to a specialized "Repair Module" (a mini-AI trained just to fix that specific type of mistake), and asks the main AI to try again.
- It does this over and over, like a game of "Hot and Cold," until the code passes every check.

The Results: A Supercharged Team

The researchers tested VeriStruct on 11 different complex data structures (like ring buffers, trees, and locks).

Without VeriStruct: A standard AI could only verify about 4 out of 11 modules.
With VeriStruct: The system successfully verified 10 out of 11 modules, and checked 99.2% of all the individual functions inside them.

The Big Picture

Think of VeriStruct as the difference between throwing a raw, untrained intern into a construction site and giving them a Project Manager, a Safety Manual, a specialized Tool Kit, and a Quality Control Team.

It doesn't replace the human or the AI; it orchestrates them. It takes the raw creativity of AI and channels it through a rigorous, step-by-step process that understands the strict rules of formal verification. This brings us one giant step closer to a future where our critical software (like self-driving cars or banking systems) can be automatically proven safe, without needing a team of human mathematicians to check every single line of code.

1. Problem Statement

The paper addresses the critical challenge of scaling formal program verification to complex software components, specifically data structure modules, using Large Language Models (LLMs).

Context: While generative AI improves coding productivity, it introduces security risks and correctness errors. Formal verification (using tools like SMT solvers) can mathematically prove code correctness but requires extensive, expert-written logical annotations (preconditions, postconditions, invariants).
Limitations of Current AI-Assisted Verification: Existing LLM-based approaches (e.g., AutoVerus) focus primarily on verifying single functions or textbook algorithms. They struggle with data structure modules because:
1. Complexity: Verifying a module requires generating multiple interdependent artifacts: mathematical abstractions (Views), Type Invariants (properties preserved across all operations), and specifications for multiple methods that must be verified jointly.
2. Domain Specificity: LLMs often lack deep understanding of specialized verification languages like Verus (a Rust verification extension). They frequently misuse syntax (e.g., calling executable functions inside specification contexts) or fail to grasp verification-specific semantics.
Goal: To develop a framework that automates the generation of correct verification annotations for entire data structure modules, reducing the manual burden on developers while maintaining rigorous correctness guarantees.

2. Methodology: The VeriStruct Framework

VeriStruct is an AI-assisted workflow implemented in Verus that orchestrates the generation and repair of verification annotations. It operates in a two-stage pipeline:

Stage 1: Systematic Generation

Instead of asking an LLM to generate all annotations at once, VeriStruct decomposes the task into specialized modules managed by a Planner.

Planner Module: Analyzes the input code and test suite to determine which specific annotation components are necessary (e.g., does this data structure need a View? Does it have complex field relationships requiring Type Invariants?). It outputs an execution plan (e.g., $\langle M_1, M_2, M_3, M_4 \rangle$ ).
Specialized Generation Modules:
1. View Module: Generates a mathematical abstraction (e.g., mapping a circular buffer to a logical sequence) to hide implementation details.
2. Type Invariant Module: Generates logical formulas that must hold for all instances of the data structure (e.g., head < capacity).
3. Specification Module: Generates preconditions (requires), postconditions (ensures), and specification functions.
4. Proof Block Module: Generates proof hints and loop invariants.
Prompt Engineering: To mitigate LLM hallucinations regarding Verus syntax, prompts include:
- Syntax Guidelines: Extracted from Verus tutorials and standard libraries.
- Step-by-Step Instructions: Explicit procedural guides.
- In-Context Learning: Examples of verified data structures (e.g., doubly-linked lists).
View Refinement: A specific sub-step encourages the LLM to abstract away concrete implementation details (like circular indices) rather than producing trivial Cartesian-product views, which simplifies subsequent proofs.

Stage 2: Iterative Repair

Since a single generation pass rarely yields a fully verified module, VeriStruct employs an iterative repair loop.

Verification Loop: The generated code is passed to the Verus verifier. If errors occur, the error message is analyzed.
Error Routing: A lightweight pattern-matching system routes specific error types to dedicated Repair Modules:
- Mode Misuse: Fixing calls to executable functions within specification contexts.
- Mutability Issues: Correcting Rust ownership/mutability violations.
- Test Failures: Performing interprocedural analysis to strengthen postconditions of methods invoked before a failing test assertion.
- Arithmetic/Type Mismatches: Fixing overflow or type logic errors.
Fallback: If no specific pattern matches, a default module attempts a repair based on the raw error message.
Sampling: Each generation/repair step produces $n$ samples, selecting the one that maximizes the number of successfully verified functions.

3. Key Contributions

Novel Workflow for Data Structures: Extends AI-assisted verification from single functions to complex modules, introducing the synthesis of Views and Type Invariants as distinct, coordinated tasks.
The VeriStruct Tool: A fully implemented framework integrating a Planner, specialized generation modules, and a multi-module repair system.
Syntax and Semantics Guidance: A robust strategy combining structured prompt guidelines and automated repair loops to overcome LLM limitations in specialized verification languages.
Comprehensive Evaluation: A benchmark suite of 11 diverse Rust data structures (including concurrent locks, trees, and bitmaps) used to validate the approach.

4. Experimental Results

The authors evaluated VeriStruct on 11 data structure benchmarks containing a total of 129 functions.

Success Rate: VeriStruct successfully verified 10 out of 11 benchmarks.
Function Coverage: It verified 128 out of 129 functions (99.2%).
Comparison with Baselines:
- Simple Baseline (Iterative LLM calls without planning): Solved only 4/11 benchmarks and verified 52/129 functions.
- Claude Code (Agentic approach): Solved 8/11 benchmarks and verified 102/129 functions.
- VeriStruct: Outperformed both, verifying significantly more functions with comparable or lower token consumption (22k tokens vs. 24k for Claude).
Efficiency: The framework required a maximum of 13 LLM invocations per benchmark, demonstrating that the structured workflow converges faster than unstructured agentic loops.
Case Study (Bitmap): The LLM generated a solution for the Bitmap benchmark that was fundamentally different from the human-written ground truth (using a simpler 1D array abstraction instead of a 2D one), proving the system can discover novel, valid verification strategies.

5. Significance

Scalability of Formal Verification: VeriStruct demonstrates that AI can move beyond verifying trivial algorithms to handling the complex, interconnected logic of real-world data structure libraries.
Reducing the Annotation Burden: By automating the creation of Views, invariants, and proofs, the framework drastically reduces the manual effort required to adopt formal verification, making it accessible to Rust engineers.
Robustness against LLM Hallucinations: The combination of syntax-guided prompts and specialized repair modules effectively mitigates the "black box" nature of LLMs, ensuring that generated code adheres to the strict semantic rules of the Verus verifier.
Foundation for Verified Libraries: Successfully verifying reusable data structure modules allows client code to rely on these components, lowering the barrier for verifying larger, more complex systems built upon them.

In conclusion, VeriStruct represents a significant step toward automatic, AI-assisted formal verification at scale, bridging the gap between the theoretical power of formal methods and the practical constraints of modern software development.