Multi-View Encoders for Performance Prediction in LLM-Based Agentic Workflows

Imagine you are trying to build the perfect team of robots to solve a complex puzzle, like writing a computer program, solving a math problem, or answering a tricky question.

In the world of Artificial Intelligence, these robots are called LLM-based Agents. They don't work alone; they talk to each other, use tools, and follow specific instructions (prompts) to get the job done.

The problem is: There are too many ways to build these teams.

You could have a team of 3 robots or 10. They could talk in a circle, a line, or a web. They could use different "languages" (prompts) or different tools. Trying to find the best team by building them, testing them, and seeing if they fail is like trying to find a needle in a haystack by building a million haystacks first. It's slow, expensive, and frustrating.

This paper introduces a new tool called Agentic Predictor. Think of it as a "Crystal Ball" or a "Talent Scout" for robot teams.

The Problem: The "Trial-and-Error" Trap

Currently, to see if a robot team will work, you have to actually let them run the task.

Analogy: Imagine you are a chef trying to find the perfect recipe for a new soup. Instead of tasting a spoonful, you have to cook the entire pot of soup, serve it to a panel of judges, and wait for their score. If it tastes bad, you throw it away and start over.
The Cost: Doing this thousands of times costs a fortune in time and money (computing power).

The Solution: The "Crystal Ball" (Agentic Predictor)

The authors built a lightweight AI that can look at a robot team's blueprint and predict if it will succeed before you even run it.

Analogy: Instead of cooking the whole soup, the Crystal Ball looks at the recipe card, the list of ingredients, and the chef's notes. It says, "I'm 90% sure this recipe will be delicious," or "This one is going to be a disaster." You only cook the ones the Crystal Ball says are promising.

How Does the Crystal Ball Work? (The "Multi-View" Magic)

The secret sauce is that the Crystal Ball doesn't just look at one thing. It looks at the robot team from three different angles (views) at the same time:

The Map (Graph View): It looks at the structure. Who talks to whom? Is the team organized like a pyramid or a chaotic circle?
The Manual (Code View): It reads the actual instructions and tools the robots are using. Are they using the right tools for the job?
The Personality (Prompt View): It reads the "personality" and instructions given to the robots. Are they being told to be creative, strict, or helpful?

Analogy: Imagine hiring a new employee.

A bad manager only looks at their Resume (Code).
A better manager looks at their Resume and their Job History (Graph).
The best manager (our Crystal Ball) looks at the Resume, the Job History, AND listens to how they talk in an interview (Prompt). By combining all three, they can predict if the person will be a star employee much better than anyone else.

The "Training" Trick: Learning Without a Teacher

Usually, to teach a Crystal Ball, you need thousands of examples of "Good Team" vs. "Bad Team." But getting those examples is expensive (because you have to run the tests!).

The authors used a clever trick called Cross-Domain Unsupervised Pretraining.

Analogy: Imagine you want to teach a student how to be a great detective. You don't have enough solved crime cases (labeled data) to teach them. So, you first let them read millions of mystery novels and police reports (unlabeled data) from different genres. They learn the patterns of how detectives think, how clues connect, and how stories unfold.
Once they understand the patterns, you only need to show them a few actual crime cases to teach them how to solve your specific mystery.
This allows the Crystal Ball to learn the "language" of robot teams without needing to run expensive tests first.

Why Does This Matter?

Speed: It's instant. The Crystal Ball predicts the result in milliseconds, while running the actual robot team takes seconds or minutes.
Money: It saves a huge amount of money. You stop wasting resources on bad ideas.
Quality: Because you can test more ideas quickly, you end up finding better robot teams than before.

The Bottom Line

This paper is about stopping the waste. Instead of blindly guessing which robot team configuration will work and paying a high price to find out, we now have a smart, fast, and cheap "Talent Scout" that can tell us which teams are winners before we even hire them. It turns a slow, expensive guessing game into a fast, efficient science.

1. Problem Statement

The optimization of Large Language Model (LLM)-based agentic systems is currently hindered by the vast search space of agent configurations, prompting strategies, and communication patterns. Existing automated design methods (e.g., G-Designer, AFlow) rely on execution-based evaluation, where candidate workflows are generated and then rigorously tested via repeated LLM API calls. This approach suffers from two critical issues:

High Computational Cost: Exhaustive evaluation is prohibitively expensive and time-consuming due to the latency and financial cost of LLM inference.
Data Scarcity & Heterogeneity: Training effective predictors requires large amounts of labeled data (workflow configuration $\to$ success/failure), which is scarce because generating labels is expensive. Furthermore, agentic workflows are highly heterogeneous, varying in structure (topology), semantics (prompts), and logic (code), making it difficult for single-view models to generalize.

The paper proposes a predictive approach to replace costly execution-based evaluation with a lightweight, learned performance predictor, thereby accelerating the search for optimal agentic workflows.

2. Methodology: Agentic Predictor

The authors introduce Agentic Predictor, a framework designed to estimate workflow performance without executing the workflow. The methodology consists of three core components:

A. Multi-View Workflow Encoding

To address workflow heterogeneity, the framework encodes agentic workflows into a unified latent representation using three complementary views:

Graph View: Models the structural dependencies and communication channels between agents as a Directed Acyclic Graph (DAG). It utilizes a Multi-Graph approach where node features are derived from three sub-graphs:
- Prompt Graph: Embeddings of system/instruction prompts.
- Code Graph: Embeddings of function-call code.
- Operator Graph: Embeddings of operator types and definitions.
- Mechanism: Uses Graph Neural Networks (GNNs) with cross-view self-attention to aggregate these structural signals.
Code View: Encodes the holistic program-level semantics, control flow, and tool usage patterns of the entire workflow code using an MLP.
Prompt View: Encodes the global context, agent roles, and behavioral specifications of the system prompts using a separate MLP.

These views are aggregated via a learnable fusion layer to produce a unified workflow embedding $Z$ .

B. Cross-Domain Unsupervised Pretraining (Agentic Predictor+)

To mitigate the scarcity of labeled performance data, the authors introduce a pretraining phase:

Objective: The multi-view encoder is pretrained on a large corpus of unlabeled workflows from diverse domains (Code, Math, Reasoning) using two objectives:
1. Reconstruction Loss: Reconstructing the input graph, code, and prompt embeddings from the latent representation.
2. Contrastive Loss: Aligning representations of the same workflow across different views (e.g., matching the graph view of a workflow with its code view) while pushing apart different workflows.
Benefit: This allows the model to learn robust, transferable structural and semantic priors without needing performance labels, significantly improving sample efficiency during the supervised fine-tuning stage.

C. Performance Prediction & Search

Task Integration: A Task Encoder generates embeddings for the specific task description ( $T$ ). The final input to the predictor is the concatenation of the workflow embedding ( $Z$ ) and the task embedding ( $T$ ).
Prediction Head: A lightweight MLP predicts the performance metric (e.g., binary Pass/Fail or a scalar score).
Search Strategy: The trained predictor acts as a ranker. Instead of executing thousands of candidates, the system samples candidates, scores them using the predictor, and selects the top- $k$ for actual execution. This transforms the search from an expensive execution loop to a label-efficient guided procedure.

3. Key Contributions

Multi-View Encoding Framework: Proposes a novel architecture that jointly models structural (graph), logical (code), and semantic (prompt) aspects of agentic workflows, outperforming single-view graph baselines.
Cross-Domain Unsupervised Pretraining: Introduces a strategy to leverage abundant unlabeled workflows across domains to pretrain encoders, effectively solving the data scarcity problem for agentic workflow prediction.
Agentic Predictor System: Unifies these components into a search-agnostic framework that significantly reduces the cost of workflow optimization while maintaining high accuracy.
Comprehensive Benchmarking: Provides extensive evaluation across three domains (Code Generation, Math, Reasoning) using the FLORA-Bench, demonstrating state-of-the-art performance.

4. Experimental Results

The framework was evaluated on the FLORA-Bench (spanning HumanEval, MBPP, GSM8K, MATH, MMLU, etc.) against strong baselines including MLPs, GCNs, GATs, Graph Transformers, and few-shot LLM predictors.

Predictive Accuracy: Agentic Predictor achieved an average accuracy of 79.97%, outperforming the best baseline (Graph Transformer) by 2.05%. In specific domains, improvements reached up to 6.90%.
Workflow Utility: Measured by the ability to rank the top- $k$ workflows correctly, the model achieved an average utility of 76.33%, a 3.79% to 5.87% improvement over baselines.
Label Efficiency (Pretraining): In low-label regimes (0.1 ratio), the pretrained variant (Agentic Predictor+) maintained accuracy above 73%, whereas baselines dropped near 70%. This confirms the efficacy of unsupervised pretraining.
Out-of-Distribution (OOD) Generalization: The model demonstrated strong generalization when trained on one framework (e.g., AFlow) and tested on another (e.g., G-Designer), and across different task domains.
Comparison with LLM Predictors: Few-shot LLM classifiers (GPT-4.1, Claude) achieved significantly lower accuracy (~62.86%) and incurred massive latency/cost. Agentic Predictor was orders of magnitude faster and cheaper.
Resource Efficiency: Inference time is 0.054ms per sample with 0.49GB memory, compared to ~2 seconds and high monetary cost for LLM-based predictors. The training cost is amortized after only ~110 evaluations.

5. Significance and Impact

Paradigm Shift: Moves the field from "execute-to-evaluate" to "predict-to-rank," making the automated design of complex agentic systems economically feasible.
Scalability: Enables the exploration of massive design spaces that were previously too costly to search, facilitating the discovery of more robust and efficient agentic workflows.
Generalization: The multi-view and pretraining approach offers a blueprint for handling heterogeneous, data-scarce problems in AI system design beyond just agentic workflows.
Practicality: The framework is lightweight, search-agnostic (can be paired with any search algorithm), and significantly reduces the financial barrier to entry for developing advanced agentic systems.

In conclusion, Agentic Predictor establishes a new standard for performance prediction in LLM-based agentic systems, proving that rich, multi-view representations combined with unsupervised pretraining can effectively replace expensive trial-and-error evaluations.