ConCISE: A Reference-Free Conciseness Evaluation Metric for LLM-Generated Answers

Imagine you're ordering a sandwich at a deli. You ask for a simple ham and cheese. Instead of just handing you the sandwich, the chef gives you a 20-page essay on the history of wheat, the migration patterns of pigs, and a philosophical debate about the ethics of cheese, then finally hands you the sandwich.

It's technically correct, but it's exhausting, confusing, and you're paying for every single word the chef wrote.

This is exactly what happens with Large Language Models (LLMs) today. They are brilliant at answering questions, but they often suffer from "wordiness." They ramble, repeat themselves, and fill their answers with fluff. This is bad for users (who want quick answers) and bad for companies (who pay by the word/token).

The paper you shared introduces a new tool called ConCISE (Conciseness Evaluation Metric) to solve this problem. Here is how it works, explained simply.

The Problem: How do we measure "Too Much"?

Usually, to grade an essay, you need a "perfect" answer (a gold standard) to compare it against. But in the real world, we don't always have a perfect answer. We just have the AI's answer.

So, how do you tell if an AI is being too chatty without a teacher's key? ConCISE is a "reference-free" judge. It doesn't need a perfect answer to know if the current one is too long. It acts like a smart editor that knows exactly what to cut.

The Solution: The "Three-Cut" Method

ConCISE doesn't just guess; it runs the AI's answer through three different "filters" to see how much fluff can be removed while keeping the meaning intact. Think of it like a sculptor trying to find the statue inside the stone.

The "Rewrite" (Abstractive Summary):
Imagine asking a different, very smart AI to rewrite your long answer in its own words, but making it much shorter. If the original answer was 500 words and the rewrite is 100 words, that's a big clue: the original had a lot of extra stuff.
The "Highlighter" (Extractive Summary):
Imagine asking an AI to just highlight the most important sentences in the original text and ignore the rest. If the original was a 10-page novel and the "highlighted" version is only 2 pages, the original was bloated.
The "Scissors" (Word Removal):
This is the most direct test. The AI is asked to take a pair of scissors and cut out every single word that isn't absolutely necessary to keep the meaning. If you can cut out 80% of the words and the sentence still makes sense, the original was very verbose.

The Score: ConCISE takes the results of these three tests, averages them, and gives you a score. The higher the score, the more "fluff" was removed, meaning the original answer was too wordy.

The Experiment: Did it Work?

The researchers tested this on a dataset of questions and answers (based on Wikipedia). They created two versions of answers:

The Good One: Short and sweet.
The Bad One: The same facts, but rewritten to be incredibly long, repetitive, and boring (like the chef's 20-page essay).

They then asked humans to rate which answers were better. Afterward, they let ConCISE rate them.

The Results:

ConCISE was a match for human judgment. When humans said, "This answer is too long," ConCISE agreed.
Old Methods: Other standard AI grading tools (which just give a score out of 10) failed miserably. They often thought the long, rambling answers were better because they sounded more "confident" or detailed. ConCISE correctly identified them as wasteful.

Why This Matters

Think of ConCISE as a bouncer at a club.

Old AI graders were like bouncers who let anyone in as long as they had a ticket, even if they were screaming and dancing on tables.
ConCISE is the bouncer who checks the list, sees who is actually needed, and kicks out the people who are just taking up space and wasting the DJ's time.

The Bottom Line

This paper gives us a practical way to automatically check if an AI is being too chatty without needing a human to read every single answer. It helps developers build AI that is efficient, clear, and respectful of the user's time—and saves money on computing costs by not generating unnecessary words.

In short: ConCISE helps AI learn to say more with less.

1. Problem Statement

Large Language Models (LLMs) frequently generate responses that are overly verbose, containing redundant or unnecessary details. This verbosity leads to:

Reduced User Satisfaction: Long-winded answers can overwhelm users and obscure clarity.
Increased Costs: Proprietary models often charge based on output token count; unnecessary verbosity directly increases operational costs.
Evaluation Gaps: Traditional metrics (e.g., BLEU, ROUGE) rely on gold-standard references and focus on lexical overlap, failing to capture verbosity. Existing reference-free metrics often lack specific mechanisms to quantify "non-essential" content without human annotations.

The core challenge is to develop an automated, reference-free metric that can accurately quantify the conciseness of an LLM's output by detecting and penalizing redundancy without requiring ground-truth human answers.

2. Methodology: The ConCISE Metric

The authors propose ConCISE, a novel metric that leverages the generative capabilities of LLMs to simulate human judgments of brevity. The metric operates by calculating the average of three distinct compression ratios derived from a single LLM call.

Core Mechanism

Given a question $q$ and an LLM-generated answer $a(q)$ , the system prompts an LLM to generate three derivative versions of the answer:

Abstractive Summary ( $S_a$ ): A paraphrased summary capturing main ideas using new phrasing.
Extractive Summary ( $S_e$ ): Selection of the most relevant sentences directly from the original text.
Pruned Text ( $W_r$ ): A minimalist version where all non-essential words are removed while preserving the core meaning and named entities.

Validation Step:
Before calculation, the LLM is asked to verify that these three derivative texts maintain semantic equivalence and named entity preservation (dates, locations, etc.) relative to the original answer. If the meaning is altered, the compression is invalid.

Calculation Formula

The final ConCISE score is the average of three compression ratios:

$\text{ConCISE} = \frac{1}{3} \left[ \left(1 - \frac{|A| - |S_a|}{|A|}\right) + \left(1 - \frac{|A| - |S_e|}{|A|}\right) + \left(1 - \frac{|A| - |W_r|}{|A|}\right) \right]$

Where:

$|A|$ is the word length of the original answer.
$|S_a|$ , $|S_e|$ , and $|W_r|$ represent the word lengths of the abstractive summary, extractive summary, and pruned text, respectively.
Note: If a derivative text is longer than the original (negative compression), its value is treated as zero.

Interpretation: A higher ConCISE score indicates that the LLM removed more non-essential content while preserving meaning, signifying higher conciseness.

3. Experimental Design

Dataset: The WikiEval dataset, containing human-annotated question-context-answer triples from Wikipedia.
Data Augmentation: To create a testbed for verbosity, the authors used GPT-4o to rewrite original answers into "verbose" versions by adding redundancy and filler while keeping facts intact.
Human Ground Truth: Three human annotators provided:
1. Likert Scale Ratings: 1 (very verbose) to 5 (very concise).
2. Pairwise Comparisons: Ranking which of two answers was more concise.
Baselines:
1. GPT Score: An LLM assigning a 0–10 numeric score for conciseness.
2. GPT Ranking: An LLM selecting the more concise answer in a pair.
Judges: Multiple LLMs (GPT-4o, Claude-4, Gemini-2.0, Mistral-Large-2) were used as evaluators to ensure robustness against model-specific bias.

4. Key Results

The study evaluated ConCISE against human judgments using Spearman's rank correlation ( $r_s$ ) and Kendall's Tau ( $\tau$ ), as well as pairwise accuracy.

Correlation with Human Ratings:
- ConCISE (GPT-4o Judge): Achieved the strongest correlation with human Likert ratings ( $r_s = 0.628$ , $\tau = 0.523$ ), both statistically significant ( $p < 0.001$ ).
- Other LLM Judges: ConCISE remained effective across different judge models (Claude-4, Gemini-2.0, Mistral), with $r_s$ ranging from 0.47 to 0.54.
- Baseline Failure: The "GPT Score" baseline showed weak/negative correlation ( $r_s = -0.108$ ), suggesting direct numeric scoring prompts are unreliable for conciseness.
Pairwise Accuracy:
- ConCISE: Achieved 94% accuracy in aligning with human preferences when choosing the more concise answer between two options.
- GPT Ranking Baseline: Only achieved 39% accuracy, performing barely better than random chance.

5. Key Contributions

Novel Reference-Free Metric: Introduction of ConCISE, the first mechanism to assess LLM output length and redundancy without requiring gold-standard reference answers.
Three-Pronged Compression Approach: A unique methodology combining abstractive summarization, extractive summarization, and word-removal pruning to quantify non-essential content.
Empirical Validation: Demonstration that ConCISE significantly outperforms direct LLM scoring and ranking baselines, showing high alignment with human judgment in both rating and ranking tasks.

6. Significance and Limitations

Significance:

Cost & Efficiency: Provides a practical tool for developers to automatically detect and penalize verbose outputs, potentially reducing token costs and improving user experience in conversational AI.
Scalability: Eliminates the need for expensive and time-consuming human annotation of reference texts, making it suitable for large-scale evaluation.
Reliability: Proves that LLMs can be effectively used as "judges" for brevity if the evaluation is structured around compression rather than direct scoring.

Limitations & Future Work:

Context Dependency: The definition of "non-essential" varies by domain (e.g., regulatory details in finance are verbose but necessary). The current metric may struggle with domain-specific nuances.
Prompt Bias: Using a single prompt to generate all three compression types might introduce cross-technique bias. Future work suggests using separate prompts for each technique.
Generalizability: While effective on WikiEval, broader validation across diverse datasets and industries is required to establish universal robustness.