Lexara: A User-Centered Toolkit for Evaluating Large Language Models for Conversational Visual Analytics

Imagine you have a very smart, chatty robot assistant named Lexi. You can talk to Lexi in plain English, asking it to show you data, like "Show me which products sold best last month." Lexi then tries to draw a chart and explain what it found.

The problem? Sometimes Lexi gets it right, sometimes it gets it mostly right, and sometimes it draws a chart that looks okay but is actually lying about the numbers.

Before this paper, checking if Lexi was doing a good job was a nightmare for the people who built her. They had to be expert programmers to write tests, they were using "one-size-fits-all" tests that didn't match real life, and they had no clear way to say, "Well, the chart is right, but the explanation is a bit confusing."

Enter Lexara: The "Car Test Drive" for Data Robots.

The authors (Srishti Palani and Vidya Setlur) built a toolkit called Lexara. Think of it as a specialized test track and a dashboard for evaluating these data robots. Here is how it works, broken down simply:

1. The Real-World Driving Test (The Test Cases)

Old tests were like asking a robot to drive in a perfect, empty parking lot with a single, straight line. Real life is messy.

The Old Way: "Drive forward 10 feet."
The Lexara Way: "Drive to the grocery store, but remember I asked for milk earlier, and oh, it's raining, so slow down."
How they did it: They interviewed 22 experts and watched 16 real people use these tools. They found that real conversations are messy. People ask follow-up questions, use vague words, and expect the robot to remember what they said five minutes ago. Lexara uses these real-life scenarios as its test questions, not fake ones made up by computers.

2. The Multi-Part Report Card (The Metrics)

When a robot gives an answer, it usually gives two things: a Chart (the picture) and a Story (the text explanation).

The Old Way: The old tests were like a teacher grading a student's essay but ignoring the math problems, or grading the math but ignoring the handwriting. They used simple "yes/no" or "how many words match" scores.
The Lexara Way: Lexara gives a detailed report card with two main sections:
- The Chart Score: Did it pick the right type of graph? (e.g., A line for trends, a bar for comparisons). Did it get the numbers right? Did it label the axes correctly?
- The Story Score: Did the robot explain the chart clearly? Did it make up facts? Did it remember the context from the previous question?
- The "Partial Credit" System: This is the best part. In the real world, a robot might get the right chart but forget to sort the data. Old tests would give it a 0. Lexara gives it a 70%. It understands that "mostly right" is different from "completely wrong."

3. The Interactive Dashboard (The Tool)

Imagine trying to compare 10 different robots by looking at 1,000 spreadsheets. It's impossible.

The Old Way: You had to be a coder to run the tests and read the results. If you were a designer or a manager, you were locked out.
The Lexara Way: It's a visual dashboard. You can upload your own data, pick which robots you want to test, and hit "Go."
- It shows you the robot's chart right next to the "perfect" chart.
- It highlights exactly where they differ (e.g., "This robot used a Pie Chart, but a Bar Chart would be better").
- It lets you click on a score to see why the robot got that score.
- No Coding Required: You don't need to know Python or SQL. You just click and drag.

Why Does This Matter?

Think of it like buying a car.

Before Lexara: You could only test drive cars on a closed track with a professional racer. You didn't know how the car would handle in the rain, with kids in the back, or on a bumpy road.
With Lexara: You can take the car for a spin on the actual streets, with a dashboard that tells you exactly how the brakes, the engine, and the GPS performed in real traffic.

The Result:
The researchers tested Lexara with real developers. They found that Lexara helped them spot exactly why a robot was failing. It helped them choose the right robot for the job and fix the "prompts" (the instructions they give the robot) to make the robot smarter.

In a nutshell: Lexara is a user-friendly, real-world testing kit that helps humans judge if their AI data assistants are actually helpful, or if they are just confidently making things up. It turns a confusing, technical headache into a clear, visual report card.

Here is a detailed technical summary of the paper "Lexara: A User-Centered Toolkit for Evaluating Large Language Models for Conversational Visual Analytics."

1. Problem Statement

The integration of Large Language Models (LLMs) into Conversational Visual Analytics (CVA) tools allows users to generate visualizations and insights via natural language. However, evaluating these systems presents significant challenges:

Misalignment with Reality: Existing benchmarks (e.g., nvBench, Spider) are often synthetically generated, focus on single-turn interactions, and fail to capture the multi-turn, context-dependent nature of real-world data analysis.
Lack of Interoperability: Traditional evaluation requires programming expertise to set up and interpret, creating barriers for non-technical stakeholders (product managers, designers).
Inadequate Metrics: Standard NLP metrics (BLEU, ROUGE) and existing visualization metrics fail to handle multi-format outputs (text + rendered charts + code), graded correctness (partial credit), and the nuance of ambiguous user intents.
Fragmented Workflows: Practitioners currently rely on ad-hoc, manual comparisons (spreadsheets, screenshots) that are not scalable or systematic.

2. Methodology

The authors employed a mixed-method approach to design and validate Lexara, a user-centered evaluation toolkit.

A. Formative Studies

To ground the toolkit in real-world needs, the authors conducted two studies:

Developer Interviews: Semi-structured interviews with 22 CVA tool developers (engineers, PMs, designers) to identify evaluation workflows, criteria, and pain points.
End-User Observational Study: Lab sessions with 16 professional data analysts using a browser extension to log real-world CVA interactions. Participants performed multi-turn tasks, rated output quality, and compared model responses side-by-side.

Key Findings from Studies:

Multi-turn & Multi-format: Real CVA usage involves iterative conversations where context carries over, and outputs include text, charts, and JSON specifications.
Ambiguity: User prompts often contain syntactic, semantic, or pragmatic ambiguities requiring the model to infer intent.
Evaluation Criteria: Practitioners evaluate based on:
- Visualization Quality: Data fidelity, field similarity, chart type appropriateness, axis/filter/sort accuracy, and design clarity.
- Language Quality: Factual grounding, analytical reasoning (insightfulness), and conversational coherence.
Workflow Gaps: Current tools lack low-code interfaces, interpretable metrics for partial correctness, and the ability to compare multi-format outputs systematically.

B. Toolkit Design & Implementation

Based on the findings, the authors defined seven design goals (D1–D7) and built Lexara as a web-based application (React/TypeScript frontend, Flask backend).

Test Cases: A curated suite of real-world, multi-turn conversations annotated with expected outputs (JSON specs and text explanations) and labels for ambiguity types.
Evaluation Metrics: A hybrid system combining rule-based checks and "LLM-as-a-Judge" methods.
- Visualization Metrics: Rule-based algorithms calculate scores for Data Fidelity, Field Similarity (using cosine similarity on stemmed field names), Chart Type alignment (vs. Tableau's Show Me), Axis/Filter/Sort accuracy, and Visual Encoding/Interactivity.
- Language Metrics: LLM-as-a-Judge prompts, grounded in few-shot examples from the formative study, evaluate Factual Grounding, Assumptions Disclosure, Insightfulness, Coherence, and Follow-up Relevance.
- Graded Scoring: Metrics output continuous scores (0–100%) rather than binary pass/fail, allowing for partial credit.
Interactive Interface: A low-code dashboard allowing users to upload datasources, define prompts, select models, and view results in a hierarchical table. Features include side-by-side rendering of expected vs. actual charts, JSON spec diffs, and drill-down analytics.

C. Validation

Diary Study: A two-week study with 6 CVA developers from the initial cohort. Participants used Lexara to run 38 experiments across 57 test cases, comparing 10 LLMs and 6 prompts.
Quantitative Validation: A study comparing Lexara's automated metrics against human ratings (Cohen's $\kappa$ and Spearman's $\rho$ ) on 120 sampled responses to assess reliability and alignment.

3. Key Contributions

Lexara Toolkit: A publicly available, open-source, low-code platform for evaluating LLMs in CVA contexts. It operationalizes real-world use cases into a structured benchmarking environment.
Novel Evaluation Metrics:
- Graded Visualization Metrics: Algorithms that assess data fidelity, semantic field matching, and functional correctness (axes, filters, sorts) with partial credit capabilities.
- Context-Aware Language Metrics: LLM-as-a-Judge prompts specifically tuned to evaluate analytical reasoning, factual grounding, and conversational continuity in multi-turn settings.
Real-World Test Suite: A collection of annotated, multi-turn CVA conversations derived from actual end-user interactions, capturing ambiguity and context carryover.
Human-in-the-Loop Workflow: A design that surfaces "why" a score was assigned (via hover explanations and JSON diffs), enabling practitioners to override automated judgments and debug specific failure modes.

4. Results

User Feedback (Diary Study):
- Participants found the realism of test cases (multi-turn, ambiguity) highly valuable for capturing actual usage patterns.
- The graded metrics were praised for distinguishing between "technically correct but misleading" outputs and "perfect" outputs, a nuance missing in binary metrics.
- The interactive interface (side-by-side charts, JSON diffs) significantly reduced cognitive load compared to manual spreadsheet comparisons.
- Limitations noted: Non-technical users found the YAML-based test case authoring challenging, suggesting a need for point-and-click interfaces.
Quantitative Validation:
- Inter-Rater Reliability: Human raters showed moderate-to-high agreement with Lexara's metrics (Median $\kappa = 0.65$ for visualization, $0.63$ for language).
- Metric-Human Alignment: Lexara's metrics correlated strongly with human judgments (Spearman's $\rho$ range: $0.57–0.82 $). Notably, *Factual Grounding* showed the highest alignment ($ \rho=0.82$).
- Model Preference Alignment: Lexara's aggregate scores correlated well with human preference rankings of models ( $\rho=0.79$ for visualization, $0.74$ for language), validating that the toolkit effectively tracks practitioner intuition.

5. Significance

Bridging the Gap: Lexara addresses the critical gap between academic benchmarks and the practical needs of CVA developers and analysts, moving evaluation from synthetic, single-turn tasks to complex, multi-format, real-world scenarios.
Democratization: By providing a low-code interface and interpretable metrics, it enables non-engineers (PMs, designers) to participate in rigorous model evaluation, fostering better collaboration.
Responsible AI: The toolkit supports the "responsible AI" vision by enabling systematic detection of hallucinations, data fidelity issues, and reasoning flaws before deployment.
Future Direction: It establishes a framework for evaluating the "LLM-mediated" layer of visual analytics, complementing traditional user-centered evaluation of the final sensemaking process.

In summary, Lexara is a comprehensive solution that transforms CVA evaluation from an ad-hoc, technical hurdle into a systematic, interpretable, and user-centered process, directly addressing the complexities of multi-turn, multi-format interactions in data analysis.

Lexara: A User-Centered Toolkit for Evaluating Large Language Models for Conversational Visual Analytics

1. The Real-World Driving Test (The Test Cases)

2. The Multi-Part Report Card (The Metrics)

3. The Interactive Dashboard (The Tool)

Why Does This Matter?

1. Problem Statement

2. Methodology

A. Formative Studies

B. Toolkit Design & Implementation

C. Validation

3. Key Contributions

4. Results

5. Significance

More like this

MASEval: Extending Multi-Agent Evaluation from Models to Systems

LDP: An Identity-Aware Protocol for Multi-Agent LLM Systems

Quantifying the Accuracy and Cost Impact of Design Decisions in Budget-Constrained Agentic LLM Search

Interpretable Markov-Based Spatiotemporal Risk Surfaces for Missing-Child Search Planning with Reinforcement Learning and LLM-Based Quality Assurance

AgentOS: From Application Silos to a Natural Language-Driven Data Ecosystem