Lexara: A User-Centered Toolkit for Evaluating Large Language Models for Conversational Visual Analytics

This paper presents Lexara, a user-centered toolkit designed to address the challenges of evaluating Large Language Models for Conversational Visual Analytics by providing real-world test cases, interpretable multi-format metrics, and an interactive interface that enables developers and end-users to assess model performance without programming expertise.

Srishti Palani, Vidya Setlur

Published Mon, 09 Ma
📖 5 min read🧠 Deep dive

Imagine you have a very smart, chatty robot assistant named Lexi. You can talk to Lexi in plain English, asking it to show you data, like "Show me which products sold best last month." Lexi then tries to draw a chart and explain what it found.

The problem? Sometimes Lexi gets it right, sometimes it gets it mostly right, and sometimes it draws a chart that looks okay but is actually lying about the numbers.

Before this paper, checking if Lexi was doing a good job was a nightmare for the people who built her. They had to be expert programmers to write tests, they were using "one-size-fits-all" tests that didn't match real life, and they had no clear way to say, "Well, the chart is right, but the explanation is a bit confusing."

Enter Lexara: The "Car Test Drive" for Data Robots.

The authors (Srishti Palani and Vidya Setlur) built a toolkit called Lexara. Think of it as a specialized test track and a dashboard for evaluating these data robots. Here is how it works, broken down simply:

1. The Real-World Driving Test (The Test Cases)

Old tests were like asking a robot to drive in a perfect, empty parking lot with a single, straight line. Real life is messy.

  • The Old Way: "Drive forward 10 feet."
  • The Lexara Way: "Drive to the grocery store, but remember I asked for milk earlier, and oh, it's raining, so slow down."
  • How they did it: They interviewed 22 experts and watched 16 real people use these tools. They found that real conversations are messy. People ask follow-up questions, use vague words, and expect the robot to remember what they said five minutes ago. Lexara uses these real-life scenarios as its test questions, not fake ones made up by computers.

2. The Multi-Part Report Card (The Metrics)

When a robot gives an answer, it usually gives two things: a Chart (the picture) and a Story (the text explanation).

  • The Old Way: The old tests were like a teacher grading a student's essay but ignoring the math problems, or grading the math but ignoring the handwriting. They used simple "yes/no" or "how many words match" scores.
  • The Lexara Way: Lexara gives a detailed report card with two main sections:
    • The Chart Score: Did it pick the right type of graph? (e.g., A line for trends, a bar for comparisons). Did it get the numbers right? Did it label the axes correctly?
    • The Story Score: Did the robot explain the chart clearly? Did it make up facts? Did it remember the context from the previous question?
    • The "Partial Credit" System: This is the best part. In the real world, a robot might get the right chart but forget to sort the data. Old tests would give it a 0. Lexara gives it a 70%. It understands that "mostly right" is different from "completely wrong."

3. The Interactive Dashboard (The Tool)

Imagine trying to compare 10 different robots by looking at 1,000 spreadsheets. It's impossible.

  • The Old Way: You had to be a coder to run the tests and read the results. If you were a designer or a manager, you were locked out.
  • The Lexara Way: It's a visual dashboard. You can upload your own data, pick which robots you want to test, and hit "Go."
    • It shows you the robot's chart right next to the "perfect" chart.
    • It highlights exactly where they differ (e.g., "This robot used a Pie Chart, but a Bar Chart would be better").
    • It lets you click on a score to see why the robot got that score.
    • No Coding Required: You don't need to know Python or SQL. You just click and drag.

Why Does This Matter?

Think of it like buying a car.

  • Before Lexara: You could only test drive cars on a closed track with a professional racer. You didn't know how the car would handle in the rain, with kids in the back, or on a bumpy road.
  • With Lexara: You can take the car for a spin on the actual streets, with a dashboard that tells you exactly how the brakes, the engine, and the GPS performed in real traffic.

The Result:
The researchers tested Lexara with real developers. They found that Lexara helped them spot exactly why a robot was failing. It helped them choose the right robot for the job and fix the "prompts" (the instructions they give the robot) to make the robot smarter.

In a nutshell: Lexara is a user-friendly, real-world testing kit that helps humans judge if their AI data assistants are actually helpful, or if they are just confidently making things up. It turns a confusing, technical headache into a clear, visual report card.