CircuitSense: A Hierarchical MLLM Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process

Imagine you are trying to teach a super-smart robot how to be an electrical engineer. You show it pictures of circuit diagrams (the blueprints of electronic devices) and ask it to do three things:

See: "What parts are in this picture?" (Is that a resistor or a capacitor?)
Think: "If I turn this on, what will happen mathematically?" (Can you write the formula that describes how the electricity flows?)
Build: "Design a new circuit that meets these specific rules."

This paper, CircuitSense, is a giant test designed to see if today's most advanced AI robots (called Multi-modal Large Language Models, or MLLMs) can actually do these things, or if they are just very good at guessing.

The "CircuitSense" Exam

The researchers built a massive exam with over 8,000 questions. They didn't just use old textbook problems; they invented a "synthetic generator" (like a video game engine) to create brand-new, never-before-seen circuit puzzles. This ensures the AI can't just cheat by memorizing answers from the internet.

The exam is organized like a ladder of difficulty:

Level 1 (The Bottom): Simple resistor networks (like a basic ladder).
Level 2-4: Getting complex with transistors and tiny electronic switches.
Level 5 (The Top): System-level blueprints, like looking at a whole radio or a computer chip as a set of black boxes.

The test covers three main skills:

Perception: Spotting the parts (easy for AI).
Analysis: Deriving the math equations from the picture (the hard part).
Design: Creating a working circuit from scratch (the hardest part).

The Big Surprise: The "Eye" vs. The "Brain"

The results were shocking, like finding out a student who aced the reading comprehension test failed the math test.

The Eyes are Great: When asked to simply identify parts (e.g., "Point to the capacitor"), the top AI models got it right 85% to 100% of the time. They can "see" the circuit perfectly.
The Brain is Broken: When asked to derive the math equation (e.g., "Write the formula for how this circuit amplifies sound"), the same models crashed. Their accuracy dropped to below 19%.

The Analogy:
Imagine showing a human a picture of a car engine and asking, "What is this?" They say, "It's a V8 engine." (Perfect score!).
Then you ask, "If I turn the key, how much torque will the wheels produce at 3,000 RPM?"
The AI tries to answer but just starts guessing random numbers or making up formulas that look like math but don't work. It's like a person who can recognize a piano but doesn't know how to play a single note.

Why Does This Matter?

The paper argues that for AI to be a true "engineer's assistant," it needs to do more than just recognize patterns. It needs Symbolic Reasoning.

Pattern Matching: "I've seen this shape before; it's usually a resistor." (AI is good at this).
True Understanding: "Because this is a resistor connected to a capacitor in this specific way, the voltage will drop by 50% at this frequency." (AI is bad at this).

The researchers found that the AI models that were slightly better at doing the math (deriving equations) were also the only ones that could successfully design new circuits. This proves that math is the bridge between seeing a picture and building a machine. Without the math, the AI is just a sophisticated photo album, not an engineer.

The Takeaway

CircuitSense is a wake-up call. It tells us that while AI is amazing at looking at pictures and chatting, it is still terrible at the core of engineering: translating a visual blueprint into a working mathematical model.

Until AI can reliably do the math behind the picture, it cannot be trusted to design the critical systems that run our world (like the power grid or medical devices). The paper suggests that future AI research needs to focus less on "seeing better" and more on "thinking mathematically."

1. Problem Statement

Engineering design relies on a hierarchical abstraction process, moving from high-level system specifications to low-level component implementations. A critical capability for engineers is visual-to-mathematical reasoning: the ability to translate visual circuit diagrams (schematics and block diagrams) into precise symbolic mathematical models (e.g., transfer functions, nodal equations).

While Multi-modal Large Language Models (MLLMs) have excelled at natural image understanding and basic visual question answering, they face a fundamental limitation in engineering domains: the inability to derive symbolic equations from visual inputs. Existing benchmarks focus on shallow tasks like component recognition or multiple-choice questions, failing to test whether models truly understand circuit topology or merely memorize visual patterns. There is a lack of evaluation frameworks that assess the complete engineering workflow (Perception $\to$ Analysis $\to$ Design) across multiple levels of abstraction, specifically targeting the derivation of symbolic equations.

2. Methodology: CircuitSense Benchmark

The authors introduce CircuitSense, a comprehensive benchmark comprising 8,006+ problems designed to evaluate MLLMs across three task categories and six hierarchy levels.

A. Task Categories

Perception (890 problems): Tests basic visual understanding, including component detection, connection identification (netlist conversion), and function classification.
Analysis (7,043 problems): The core of the benchmark. It tests the ability to extract mathematical models from visual circuits. Subtasks include:
- Transfer Function Analysis
- Transient Response
- Small Signal Analysis
- Noise & Jitter, Power & Energy, Frequency Response, CMR & PSRR.
Design (157 problems): Requires generating circuit specifications or sizing components based on performance metrics (e.g., designing an Op-Amp with specific gain and bandwidth).

B. Hierarchy Levels

The benchmark spans six levels of abstraction, mirroring the engineering design process:

Level 0: Resistive Networks (DC analysis).
Level 1: RLC Circuits (Frequency domain).
Level 2: Small Signal Analysis (Controlled sources).
Level 3: Transistor-level circuits.
Level 4: Block-level (Op-Amp abstraction).
Level 5: System-level block diagrams (e.g., PLLs, ADCs).

C. Data Generation Strategy

To ensure unbiased evaluation and prevent dataset contamination, the authors employed a Hierarchical Synthetic Generation Pipeline:

Circuit Schematic Generator: Extends the MAPS framework to generate grid-based analog circuits with 18 component types (passive, sources, controlled sources, op-amps). It enforces electrical validity (no floating nodes, single voltage source) and uses Lcapy to automatically derive ground-truth symbolic transfer functions and nodal equations.
Block Diagram Generator: Constructs control systems using standard transfer function blocks and summing junctions. It utilizes Mason's Gain Formula to compute the exact symbolic system transfer function, handling complex feedback loops and nested structures.
Curated Data: Supplemented with ~2,986 problems from authoritative textbooks (e.g., Gray, Razavi) and university courses to ensure educational validity.

3. Key Contributions

First Multi-Level Visual-to-Analytical Benchmark: CircuitSense is the first benchmark to systematically evaluate understanding across the full engineering abstraction spectrum, from system blocks to transistor-level schematics, specifically testing the connection between visual patterns and mathematical representations.
Hierarchical Synthetic Pipeline: Developed a dual-generation pipeline producing novel circuits with guaranteed ground-truth symbolic equations. This allows for the isolated evaluation of visual comprehension and mathematical reasoning at each abstraction level without relying on memorized training data.
Comprehensive Evaluation Framework: Established rigorous evaluation metrics for symbolic derivation, using SymPy for algebraic simplification and verification, and Ngspice for simulation-based design validation.

4. Experimental Results

The authors evaluated 8 state-of-the-art MLLMs (including GPT-4o, Gemini-2.5-Pro, Claude-Sonnet-4, and open-source models like InternVL3 and Qwen2.5-VL).

Perception Tasks: Closed-source models performed well, achieving >85% accuracy in component recognition and topology identification. This indicates that visual parsing is not the primary bottleneck.
Symbolic Analysis (The Critical Gap): Performance collapsed when models were required to derive equations.
- Gemini-2.5-Pro was the best performer but still achieved only ~19% accuracy on synthetic symbolic derivation tasks.
- Other models fell below 10% on synthetic problems.
- There was a massive drop in performance when moving from multiple-choice questions (where models could guess or eliminate options) to open-ended symbolic derivation.
Design Tasks: Models struggled significantly with schematic-level design (7–36% accuracy) but performed better on block-level design (30–67%).
Correlation: A strong correlation was found between a model's ability to derive symbolic equations and its success in design tasks, confirming that mathematical understanding is a prerequisite for engineering synthesis.

Key Finding: Models exhibit a "catastrophic failure" in symbolic reasoning. While they can recognize a circuit diagram, they cannot reliably translate it into the underlying mathematical laws (e.g., deriving $H(s)$ from a schematic).

5. Significance and Implications

Redefining Engineering Competence: The paper establishes that symbolic reasoning is the key metric for engineering competence in AI, distinguishing true understanding from pattern matching.
Limitations of Current MLLMs: Current models are "sophisticated but superficial" pattern matchers in engineering contexts. They lack the fundamental capability to perform the mathematical translation required for safety-critical engineering tasks (e.g., predicting instability in a PLL).
Future Directions: The results suggest that improving MLLMs for engineering requires a focus on algebraic reasoning and symbolic manipulation rather than just visual comprehension or larger training datasets.
Benchmark Utility: CircuitSense provides a necessary tool for researchers to track progress in bridging the gap between visual perception and mathematical logic, essential for developing AI assistants that can genuinely accelerate analog circuit design cycles.

In conclusion, CircuitSense reveals a critical deficit in current AI systems: the inability to perform the visual-to-mathematical translation that is foundational to engineering. Until models can reliably derive symbolic equations from diagrams, they cannot be trusted as autonomous engineering tools.

CircuitSense: A Hierarchical MLLM Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process

The "CircuitSense" Exam

The Big Surprise: The "Eye" vs. The "Brain"

Why Does This Matter?

The Takeaway

1. Problem Statement

2. Methodology: CircuitSense Benchmark

A. Task Categories

B. Hierarchy Levels

C. Data Generation Strategy

3. Key Contributions

4. Experimental Results

5. Significance and Implications

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization