SURFACEBENCH: A Geometry-Aware Benchmark for Symbolic Surface Discovery

Imagine you are an archaeologist, but instead of digging for pottery shards, you are digging through a cloud of 3D data points to find the hidden "recipe" that created them.

This paper introduces SurfaceBench, a new, extremely difficult test designed to see if Artificial Intelligence (AI) can actually "think" like a scientist when trying to figure out the mathematical laws that govern 3D shapes.

Here is the breakdown of what they did, why it matters, and what they found, using simple analogies.

1. The Problem: The "Flat" vs. The "Round" World

For a long time, AI researchers have tested machines on 2D curves.

The Old Way: Imagine drawing a line on a piece of paper. The AI looks at the dots and guesses the equation (like $y = x^2$ ). This is easy because it's just one line.
The New Challenge: Real science isn't flat lines; it's 3D surfaces. Think of a sphere, a twisted ribbon, or a complex wave. These shapes exist in 3D space ( $x, y, z$ ).

The authors realized that current AI tests are like asking a student to solve a math problem on a flat piece of paper, but then expecting them to build a skyscraper. The skills are different. A 3D surface can be described in many different ways (like describing a ball as "round," "a sphere," or "a set of points equidistant from a center"), and current AI tests don't know how to grade that.

2. The Solution: SurfaceBench (The "Gym" for AI)

The researchers built a massive gym called SurfaceBench with 183 different 3D puzzles.

The Puzzles: Each puzzle is a 3D shape (like a torus, a sphere, or a complex wave) generated by a real scientific formula.
The Twist: They didn't just give the AI the formula. They gave the AI a cloud of 3D dots (data) and asked, "What is the secret equation that makes these dots form this shape?"
The Variety: The puzzles come in three flavors:
1. Explicit: "Here is the height ( $z$ ) for every spot ( $x, y$ )."
2. Implicit: "Here is a rule that says which points belong inside the shape and which are outside."
3. Parametric: "Here is a set of instructions to draw the shape step-by-step."

3. The New Grading System: "Does it Look Right?"

This is the most clever part of the paper.
In the past, if an AI guessed the equation $x^2 + y^2 = 1$ and the answer was $x^2 + y^2 - 1 = 0$ , the computer would say, "Wrong! The letters are different."

But in the real world, those two equations describe the exact same circle.

The Old Grader: A strict teacher who only checks if the spelling matches.
The SurfaceBench Grader: A sculptor. The AI generates a shape based on its guess. The grader then compares the AI's shape to the real shape.
- If the AI's shape is a perfect sphere, even if the math looks weird, the AI gets a high score.
- They use two specific tools to measure this:
  - Chamfer Distance: Measures the average gap between the two shapes. (Is the whole thing slightly too big?)
  - Hausdorff Distance: Measures the worst gap. (Is there a giant hole or a spike sticking out where it shouldn't be?)

4. The Results: The AI is "Good at Guessing, Bad at Fine-Tuning"

The researchers tested many different AI models, including the newest "Large Language Models" (LLMs) that are famous for writing code and solving math.

The Findings:

The "Memorization" Trap: Many AIs tried to cheat by memorizing famous formulas they saw during training, rather than actually figuring out the shape from the dots.
The "Structure vs. Numbers" Gap: The AI was surprisingly good at guessing the type of shape (e.g., "It's a sine wave!"). But it was terrible at getting the numbers right (e.g., "It's a sine wave, but the height is 5.2, not 5.0").
- Analogy: Imagine the AI correctly identifies a song as "Beethoven's 5th," but when it tries to play it, it hits the wrong notes. The melody is right, but the performance is off.
The 3D Struggle: The AI struggled the most with complex 3D shapes that required multiple equations working together (like a parametric surface). It's like asking a chef to bake a cake, but the recipe requires three different ovens to be set at different temperatures simultaneously. The AI kept forgetting to turn on one of the ovens.

5. Why This Matters

This paper is a wake-up call for the scientific AI community.

Current AI is fragile: If you give it noisy data (like a sensor with a glitch), it falls apart.
We need better tools: We can't just rely on AI to "guess" the math. We need systems that can reason about geometry, not just text.
The Future: SurfaceBench is now a public tool. It's like a standardized driving test for AI. Before, we only tested if AI could drive in a straight line on a sunny day. Now, we are testing if it can drive a race car through a storm on a winding mountain road.

In a nutshell: The authors built a tough new test to see if AI can truly understand the geometry of the universe. The results show that while AI is getting smarter, it still struggles to turn a rough sketch of a 3D shape into a perfect mathematical blueprint. There is a lot of work left to do before AI can truly replace the human scientist in discovering new laws of physics.

1. Problem Statement

Symbolic regression (equation discovery) aims to recover interpretable mathematical expressions from data. While recent Large Language Model (LLM) approaches have shown promise, existing benchmarks suffer from three critical limitations:

Dimensionality: Most benchmarks focus on low-dimensional scalar functions ( $y = f(x)$ ), failing to capture the multi-variable coupling and geometric structure inherent in real-world scientific phenomena.
Evaluation Metrics: Current metrics rely on string-level matching or pointwise regression errors (e.g., NMSE). These fail to account for symbolic non-uniqueness, where algebraically distinct expressions (e.g., implicit vs. parametric forms) describe identical geometric objects.
Memorization: Synthetic benchmarks often allow models to memorize canonical formulas rather than reason from data, and they lack the structural diversity required to test compositional generalization.

The paper argues that discovering 3D surface equations presents a fundamentally harder and more realistic challenge than curve fitting, requiring reasoning over coupled outputs, latent coordinate systems, and topological constraints.

2. Methodology: SurfaceBench

The authors introduce SurfaceBench, the first geometry-aware benchmark designed for symbolic surface discovery.

A. Dataset Construction

Scale & Diversity: The benchmark comprises 183 analytically constructed surface equations spanning 15 scientifically grounded categories (e.g., optics, fluid dynamics, electromagnetics).
Representation Paradigms: Tasks cover three distinct forms:
- Explicit: $z = f(x, y)$
- Implicit: $f(x, y, z) = 0$
- Parametric: $(x(u,v), y(u,v), z(u,v))$
Anti-Memorization Pipeline: To prevent models from retrieving memorized templates, the dataset generation pipeline includes:
- Operator Augmentation: Functional nesting, operator substitution, and coordinate reparameterization.
- Novelty & Stability Checks: Ensuring equations are analytically solvable, numerically stable, and distinct from prior benchmarks (e.g., AI Feynman, SRBench).
- Data Generation: Synthetic 3D point clouds are sampled with adaptive density in high-curvature regions.

B. Evaluation Framework

SurfaceBench moves beyond string matching to a geometry-aware evaluation protocol:

Object-Space Metrics: Predicted and ground-truth surfaces are sampled into dense point clouds, aligned via similarity transforms, and compared using:
- Chamfer Distance: Measures mean geometric fidelity (global shape).
- Hausdorff Distance: Measures worst-case deviation (local structural failures).
Symbolic Accuracy: Uses LLM-based equivalence checks with algebraic simplification to verify structural correctness beyond exact string matches.
Regression Error: Normalized Mean Squared Error (NMSE) is retained for pointwise fit comparison.

3. Experimental Setup

The authors evaluated a broad spectrum of symbolic regression frameworks:

LLM-Driven: LLM-SR, LaSR, SGA, OpenEvolve (using backbones like GPT-4o-mini, Llama-3.1-8B, Qwen3-8B).
Non-LLM (Classical/Neural): PySR (Genetic Programming), DSR/uDSR (Reinforcement Learning), NeSymReS (Neural), E2E, TPSR, gplearn.

4. Key Results

Empirical evaluation reveals that no current method achieves consistent performance across all representation types.

Overall Performance: Exact recovery rates are extremely low: 4% for LLM-based frameworks and 6% for traditional methods.
Representation Bias:
- Explicit Surfaces: Models often recover the correct functional family (high Symbolic Accuracy) but fail to calibrate parameters, leading to poor geometric fit (high Chamfer/Hausdorff distances).
- Implicit Surfaces: Distance-driven search methods often produce geometrically accurate surfaces even when the algebraic form is not exact (low geometric distances, lower symbolic accuracy).
- Parametric Surfaces: This remains the most underexplored area. Only OpenEvolve and PySR reliably handle the multi-output coupling required for parametric equations.
Robustness Analysis:
- Noise Sensitivity: LLM-based methods degrade significantly under noise (1%–10% Gaussian), whereas traditional methods are more robust, suggesting LLMs are sensitive to input perturbations.
- Out-of-Domain (OOD) Generalization: Models trained on $[-5, 5]$ often fail to extrapolate to $[-10, -5] \cup [5, 10]$ . Hausdorff distances spike disproportionately, indicating localized structural breakdowns rather than uniform drift.
- Domain Priors: Injecting domain knowledge (e.g., "spherical coordinates") into prompts provided only marginal improvements for LLMs, suggesting they struggle to translate structural cues into optimization outcomes.

Failure Analysis

The authors decomposed errors into two categories:

Search Failures: The model selects the wrong functional family (e.g., polynomial instead of trigonometric).
Equation-Fitting Failures: The model identifies the correct family but fails to optimize constants or composition.

Finding: LLMs suffer heavily from equation-fitting failures. Their autoregressive generation lacks the tight coupling between discrete structural search and continuous parameter optimization found in evolutionary or gradient-based methods. They often converge to a suboptimal structural hypothesis early and lack the iterative refinement capability to correct it.

5. Key Contributions

SurfaceBench Benchmark: A large-scale, geometry-aware dataset of 183 surfaces across 15 categories and 3 representation types, designed to stress-test symbolic composition and multi-output reasoning.
Geometry-Aware Evaluation Protocol: A unified framework integrating Chamfer/Hausdorff distances with symbolic equivalence checks, addressing the issue of representational non-uniqueness in 3D space.
Diagnostic Insights: A comprehensive error taxonomy revealing that current LLM-based approaches lack robustness in parameter calibration and multi-equation reasoning, exposing a "pipeline gap" between structure discovery and geometric alignment.

6. Significance

SurfaceBench bridges the gap between symbolic reasoning and geometric reconstruction. It demonstrates that high performance on scalar benchmarks does not translate to complex scientific modeling. The paper highlights that future progress requires:

Tighter integration of discrete structure search with continuous parameter optimization.
Mechanisms for iterative self-refinement and geometric calibration.
Benchmarks that prioritize functional fidelity in object space over algebraic syntax.

By releasing the dataset and evaluation pipeline, the authors aim to catalyze research at the intersection of symbolic regression, geometric learning, and scientific induction.