Pneuma-Seeker: A Relational Reification Mechanism to Align AI Agents with Human Work over Relational Data

Imagine you are a detective trying to solve a complex case, but instead of a crime scene, you have a massive library filled with millions of messy, unorganized filing cabinets. You know you need to find a specific piece of evidence to solve the mystery, but you can't quite describe exactly what you're looking for yet.

In the past, if you asked a smart computer assistant (like an AI) to "find the evidence," it might guess, pull out a random file, and confidently tell you, "Here is the answer!" But often, the AI would be wrong, hallucinating facts or missing crucial details because your question was too vague.

Pneuma-Seeker is a new system designed to fix this. Instead of just guessing the answer, it acts like a collaborative architect who helps you build a blueprint before digging for the treasure.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Vague Wish"

Imagine you tell a chef, "I want a delicious dinner."

The Old Way: The chef guesses you want pizza, makes it, and serves it. You realize you actually wanted a salad, but the pizza is already gone. The chef is frustrated because you didn't give clear instructions.
The Pneuma-Seeker Way: The chef says, "Okay, let's write down a recipe together first. Do you want it spicy? Do you want meat or veggies? What's the budget?"

In the world of data, users often have a "vague wish" (e.g., "Why are we losing money on shipping?"). Pneuma-Seeker doesn't jump to the answer. Instead, it helps you turn that vague wish into a concrete blueprint called a Relational Schema. Think of this schema as a detailed shopping list and a map of exactly which ingredients (data columns) you need and how they should be mixed.

2. The Core Idea: "Blueprinting" Before Building

The paper calls this "Relational Reification." That's a fancy way of saying: "Let's make your idea a real, physical object we can both look at."

The Blueprint (The Schema): The system proposes a table structure: "Okay, to answer your question, we need a table with columns for 'Item Type,' 'Shipping Date,' and 'Return Rate.' Does that look right?"
The Conversation: You look at the blueprint and say, "Wait, 'Item Type' is too broad. I only care about 'Radioactive Materials,' not all hazardous stuff."
The Correction: The system updates the blueprint. Now, both you and the AI agree on exactly what the final data table should look like before the AI goes hunting for the data.

This stops the AI from guessing. It's like agreeing on the house design before the construction crew starts laying bricks.

3. The Team of Agents: The "Conductor" and the "Workers"

Pneuma-Seeker isn't just one AI talking to you; it's a team of specialized workers managed by a Conductor (like an orchestra leader).

The Conductor: This is the project manager. It listens to your vague idea, helps build the blueprint, and then assigns tasks. It doesn't do the heavy lifting itself; it just keeps the plan on track.
The Retriever (The Scout): This agent runs to the library (the database) to find the right filing cabinets. It doesn't bring everything back; it brings only the files relevant to your blueprint.
The Materializer (The Builder): This agent takes the files the Scout found and actually builds the table (the blueprint) using the rules you agreed on.
The Micro-Context Manager (The Detective's Magnifying Glass): This is a super cool feature. Sometimes the files are huge (millions of rows). The AI can't read the whole book at once. So, this agent is allowed to ask the database specific questions like, "Hey, do you have any rows from 2025?" or "Show me the top 10 values in this column." It lets the AI "peek" at the data to make sure it's not making assumptions.

4. Why This Matters: Trust and Transparency

In the old days, if an AI gave you a wrong answer, you had to take its word for it. It was a "black box."

With Pneuma-Seeker, because you built the blueprint together, you can see exactly how the answer was made.

The "Why" is Visible: The system shows you the path it took: "I found these three tables, I joined them on this column, and I filtered out these rows."
No Magic: If the answer is wrong, you can look at the blueprint and say, "Ah, you used the wrong definition for 'hazardous materials'." You can fix the blueprint, and the system re-runs the calculation.

5. The Real-World Test

The researchers tested this at the University of Chicago. They found that:

Accuracy: It got the right answers much more often than other AI systems because it didn't guess; it followed the agreed-upon blueprint.
Trust: Users felt more confident in the results because they could see the "receipts" (the data lineage) showing where the numbers came from.
Efficiency: It saved time. Instead of spending weeks manually hunting for data and arguing about definitions, the team could iterate on the blueprint in minutes.

Summary Analogy

Think of Pneuma-Seeker as a smart GPS for data.

Old AI: You say "Go to the beach," and it drives you to a random parking lot, gets lost, and says, "Here is the beach!" (You are confused and wet).
Pneuma-Seeker: You say "I want to go to the beach." The GPS says, "Okay, let's define 'beach.' Do you mean the sandy one or the rocky one? Do you want to swim or sunbathe?" Once you agree on the destination and the route (the blueprint), it drives you there, showing you the map every step of the way so you know you're on the right track.

It turns the chaotic, confusing process of finding data into a structured, collaborative conversation where the human and the machine build the solution together, step-by-step.

Here is a detailed technical summary of the paper "Pneuma-Seeker: Relational Reification of Information Needs for Agentic Data Discovery and Preparation."

1. Problem Statement

The paper addresses the persistent bottleneck in data management: data discovery and preparation. While Large Language Models (LLMs) can interpret natural language, they struggle when user information needs are vague, evolving, or under-specified.

The Gap: Users often cannot articulate precise data requirements upfront. Their needs evolve through iteration (e.g., realizing "hazardous materials" needs to be split into "radioactive" vs. "toxic").
LLM Limitations: Directly prompting LLMs to answer questions leads to brittleness, hallucinations (inventing fields), incorrect join assumptions, and ungrounded reasoning, especially when dealing with heterogeneous, large-scale tabular corpora.
Current Solutions: Traditional tools (BI dashboards, catalogs) require manual iteration and expert knowledge to translate high-level questions into concrete data queries, which is time-consuming and error-prone.

2. Core Methodology: Relational Reification

The central innovation of Pneuma-Seeker is Relational Reification. Instead of treating a user's question as a prompt to be answered directly, the system treats it as a specification that must be negotiated and made concrete.

The Concept: The system reifies the user's evolving information need ( $I^+$ $I^{+}$ ) into an explicit Relational Schema defined as a pair $(T, S)$ $(T, S)$ :
- $T$ (Target Tables): A set of derived views or tables representing the data model required to answer the question.
- $S$ (Transformation): An executable program (SQL or Python) that transforms source data into $T$ and computes the final answer.
The Workflow:
1. Iterative Refinement: The system proposes a schema $(T, S)$ . The user inspects it, critiques assumptions (e.g., "hazardous" is too broad), and refines the schema.
2. Discovery & Preparation: Once the schema is agreed upon, the system discovers relevant source tables, materializes $T$ by joining/filtering sources, and executes $S$ to produce the final document $D$ .
3. Provenance: The system records the derivation as a Directed Acyclic Graph (DAG), allowing users to trace exactly how data was transformed, ensuring trust and inspectability.

3. System Architecture

Pneuma-Seeker is an LLM-powered agentic system designed with two layers of context management to handle large, heterogeneous datasets efficiently.

A. Macro-Level Context Management (Orchestration)

To mitigate LLM context window limits and performance degradation, the workflow is decomposed into specialized agents:

Conductor (Orchestrator): The planner. It interacts with the user, manages the state of $(T, S)$ , and decides the next steps (e.g., retrieve more tables, refine schema, execute). It loops until the user is satisfied or a max iteration count is reached.
Retriever: Searches the data catalog (and web) to find relevant tables based on the current schema requirements. It uses a hybrid approach combining semantic search (Pneuma) with content-aware scanning (regex matching for specific values).
Materializer: Responsible for constructing the tables in $T$ . It prefers structured relational operators (joins, unions) over free-form code generation to reduce errors, falling back to Python only when necessary.

B. Micro-Level Context Management (Interaction)

Even with retrieval, loading full tables into the LLM context is infeasible (e.g., millions of rows).

Strategy: The system does not passively feed data to the LLM. Instead, it empowers the LLM to actively probe the database.
Mechanism: The LLM can invoke executable scripts via the DBService to run targeted queries (e.g., "Check if 2025 records exist," "Summarize distribution of column X"). This allows the LLM to ground its reasoning in empirical data distributions rather than hallucinating based on schema names alone.

4. Key Contributions

Relational Reification of Intent: Formalizing the idea that evolving information needs should be represented as a concrete, inspectable relational schema $(T, S)$ rather than a natural language prompt. This bridges the gap between human intent and machine execution.
Context Management Techniques:
- Macro: Decomposing tasks into specialized agents (Conductor, Materializer, Retriever) to manage workflow complexity.
- Micro: Introducing an action space for LLMs to execute database probes, enabling fine-grained evidence acquisition without overwhelming context windows.
The Pneuma-Seeker System: A full-stack implementation featuring dynamic planning, schema refinement, and provenance tracking (DAG) for trustworthy data discovery.
Empirical Validation: Extensive evaluation showing that reification improves answer accuracy, trust, and the ability to handle complex, multi-table integration tasks compared to state-of-the-art baselines.

5. Evaluation Results

The system was evaluated on KramaBench (a benchmark of 6 real-world datasets across scientific, legal, and public domains) and a real-world procurement dataset from the University of Chicago.

Answer Quality (RQ1): Pneuma-Seeker significantly outperformed baselines (DS-Guru, smolagents).
- On the Biomedical dataset, it achieved 94.44% accuracy, beating DS-Guru by ~28% and smolagents by ~39%.
- Ablation Studies: Removing Context Extraction (micro-management) or $(T, S)$ (reification) caused significant drops in accuracy, proving both are critical for handling complex data distributions and ensuring correct materialization.
Cost & Scalability (RQ2):
- Pneuma-Seeker occupies the Pareto-optimal frontier for cost vs. quality.
- Memory Efficiency: Unlike baselines that load entire tables into memory (causing high RAM usage), Pneuma-Seeker uses database-backed execution, keeping memory usage low (e.g., ~135MB at 1GB scale vs. 4.4GB for DS-Guru).
- Runtime: While slightly slower than direct generation due to the structured processing overhead, it remains competitive and scales better with data size.
Trust & Iteration (Qualitative): In user studies, exposing $(T, S)$ allowed users to catch logical misalignments (e.g., incorrect definitions of "friction" or "hazardous") that natural language answers alone would hide, facilitating faster convergence to the correct information need.

6. Significance

Pneuma-Seeker represents a paradigm shift in LLM-mediated data systems:

From "Black Box" to "Glass Box": By forcing the system to commit to a schema and a transformation plan before answering, it makes the reasoning process inspectable and verifiable.
Human-in-the-Loop Convergence: It acknowledges that data needs are rarely static. The system acts as a catalyst, helping users refine vague questions into precise data specifications through a shared, concrete artifact.
Scalable Agentic Data Work: It demonstrates that agentic systems can effectively manage large-scale, heterogeneous tabular data by combining structured database execution with LLM reasoning, moving beyond simple single-table QA to complex, multi-source data preparation.

In summary, Pneuma-Seeker proves that relational reification is a practical foundation for building trustworthy, accurate, and scalable AI agents for data discovery and preparation.