Information Capacity: Evaluating the Efficiency of Large Language Models via Text Compression

Imagine you are trying to pack a massive library of books into a single suitcase for a trip.

Some people are terrible packers: they fold clothes poorly, leave huge gaps, and end up needing three suitcases. Others are master packers: they fold everything perfectly, squeeze out every bit of air, and fit the whole library into one tiny bag.

In the world of Artificial Intelligence (AI), specifically Large Language Models (LLMs) like the ones powering chatbots, there is a similar problem. These models are getting incredibly smart, but they are also becoming giant, heavy, and expensive to run. They eat up massive amounts of electricity and computer power just to write a single sentence.

The authors of this paper, from China Telecom's AI institute, asked a simple question: "How can we measure how 'efficient' an AI is, not just how 'smart' it is?"

They realized that being smart and being efficient are two different things. A model might be a genius at writing code but terrible at saving space (and energy). To solve this, they invented a new score called Information Capacity.

Here is how it works, broken down with some everyday analogies:

1. The Core Idea: Compression is Intelligence

Think of an AI model as a super-forecaster. If you show it the first half of a story, a smart AI can guess the next word with high accuracy.

The Compression Trick: In the world of data, if you can predict the next word perfectly, you don't need to write it down. You can just say, "I know what comes next!" and save space.
The Analogy: Imagine a game of "20 Questions." If your friend knows you so well that they can guess your answer before you speak, you don't need to say the word out loud. You save "bits" of information.
The Paper's Insight: The better an AI is at predicting the next word (compression), the "smarter" it is. But, if it takes a massive amount of energy to make that prediction, it's an inefficient genius.

2. The New Score: "Information Capacity"

The authors created a formula to measure Efficiency.
$\text{Information Capacity} = \frac{\text{How much space you saved (Intelligence)}}{\text{How much energy you spent (Cost)}}$

High Score: The model saved a lot of space (was very smart) but didn't use much energy. This is the gold standard.
Low Score: The model saved a little space but used a ton of energy. This is wasteful.

3. The Hidden Villain: The "Tokenizer"

Here is a secret most people don't know: Before an AI reads a sentence, it breaks it down into tiny chunks called tokens.

The Analogy: Imagine you are translating a book.
- Model A breaks the book into individual letters: "T-h-e- -q-u-i-c-k..." (This is inefficient; you have to process thousands of tiny pieces).
- Model B breaks the book into whole words or phrases: "The", "quick", "brown", "fox". (This is efficient; fewer pieces to process).
The Paper's Discovery: The authors found that the way a model breaks up words (the tokenizer) matters more than people thought. A model with a "lazy" tokenizer that splits words into tiny pieces has to do way more work, even if the AI brain itself is smart. This is like trying to carry a suitcase full of loose sand instead of a suitcase full of bricks.

4. What They Found (The Plot Twist)

The team tested 56 different AI models. Here are their big discoveries:

Size Doesn't Always Matter: Usually, bigger models are smarter. But the authors found that within a family of models (like the "Qwen" family or "Llama" family), the "Information Capacity" stays roughly the same whether the model is small or huge. It's like a family of runners: the dad is faster than the son, but they both run with the same efficiency (stride length vs. energy).
The Language Bias: Some models are great at English but terrible at Chinese or code. It's like a chef who is a master at making Italian pasta but burns every time they try to make sushi. The "Information Capacity" score revealed these biases clearly.
The "MoE" Secret: Some models use a "Mixture of Experts" (MoE) architecture. Imagine a team of 100 specialists, but for every question, only 5 of them wake up to answer. This makes the model super efficient. The paper showed these models often have the highest "Information Capacity" because they get the job done with less energy.

5. Why Should You Care?

Currently, companies are building AI models that are getting bigger and bigger, burning more electricity and costing more money.

The Old Way: "Is this model smarter than the last one?" (Yes/No).
The New Way (Information Capacity): "Is this model getting smarter without getting too heavy and expensive?"

This new metric helps developers build AI that is leaner and greener. It tells them: "Hey, stop just making the model bigger; fix your tokenizer or change your architecture to be more efficient."

Summary

Think of Information Capacity as the Miles Per Gallon (MPG) rating for AI cars.

Some cars (models) are fast (smart) but drink a gallon of gas every mile (high cost).
Some cars are slow but get 50 MPG (efficient).
This paper gives us a new dashboard gauge to measure exactly how many miles of "intelligence" we get for every drop of "computational fuel."

It's a wake-up call for the AI industry to stop just building bigger engines and start building better, more efficient vehicles.

Here is a detailed technical summary of the paper "Information Capacity: Evaluating the Efficiency of Large Language Models via Text Compression."

1. Problem Statement

The rapid advancement of Large Language Models (LLMs) has led to soaring demands for computational resources, exacerbated by techniques like test-time scaling. While model capabilities are improving, there is a critical lack of a rigorous metric to evaluate inference efficiency across diverse:

Tokenizers: Different tokenizers convert text to tokens with varying efficiencies, directly impacting inference costs and context length, yet this is often ignored in evaluations.
Parameter Counts & Architectures: Comparing dense models, Mixture-of-Experts (MoE) models, and models with different attention mechanisms (e.g., GQA vs. MLA) is difficult using traditional metrics like parameter count or "capability density," which fail to bridge the gap between model size and actual computational cost (FLOPs).
Linguistic Bias: Mainstream LLMs often exhibit significant performance disparities across different languages and domains, which standard benchmarks may not fully capture in terms of efficiency.

2. Methodology: Information Capacity

The authors introduce Information Capacity (IC), a unified metric that evaluates an LLM's efficiency by measuring its text compression performance relative to its computational complexity.

Core Concept

Based on the correlation between compression and intelligence, the metric posits that a model's ability to predict the next token (minimizing cross-entropy loss) is equivalent to minimizing the bit length required to encode that text via entropy coding (e.g., Arithmetic Coding or ANS).

Mathematical Formulation

The Information Capacity is defined as the ratio of Model Intelligence (compression gain) to Model Inference Complexity (FLOPs):

$IC = \frac{\text{Compression Gain}}{\log_2(\text{Inference FLOPs})}$

Where:

Compression Gain: Calculated as the difference between the original text size ( $C$ $C$ ) and the sum of negative log-likelihoods (NLL) of the tokens predicted by the model.
- Formula: $C - \sum -\log_2 p(x_i | x_{<i})$
Inference Complexity: Measured by the total Floating Point Operations (FLOPs) required for inference, normalized by token count and expressed on a logarithmic scale.
Offset ( $b$ ): To ensure models of different sizes within the same series exhibit consistent IC values (rather than a linear decline with size), a negative offset $b$ is added to the numerator.

Key Innovations in Measurement

Tokenizer Efficiency Integration: The metric explicitly accounts for the average text size per token. A more efficient tokenizer reduces the total token count, lowering FLOPs and increasing IC.
FLOPs-Based Complexity: Unlike previous metrics relying on parameter counts, IC uses calculated FLOPs based on specific architectural details (e.g., MoE sparsity, MLA vs. GQA, auxiliary FFNs).
Ablation Controls: The study controls for variables such as softmax temperature (default $T=1$ ), post-training effects (evaluating base models), and test sample length.

3. Key Contributions

New Metric (Information Capacity): A unified, architecture-agnostic metric that quantifies inference efficiency by combining compression capability, tokenizer efficiency, and computational cost.
Comprehensive Evaluation: The authors evaluated 56 open-source models across 5 heterogeneous datasets (Mixed text, FinePDFs-en, Ch-FineWeb-Edu, FineWeb-Edu, and NextCoder), covering English, Chinese, code, and PDFs.
Discovery of Consistency: They demonstrated that within a specific model series (e.g., Qwen, Llama), models of varying sizes exhibit consistent Information Capacity, allowing for cross-scale performance prediction.
Identification of Key Factors: The study isolates three primary drivers of IC:
- Tokenizer Efficiency: The dominant factor; highly correlated with IC ( $r > 0.98$ ).
- Pretraining Data: Larger, higher-quality datasets improve IC, though with diminishing returns.
- MoE Architecture: MoE models achieve higher IC than dense counterparts by maintaining low FLOPs (via sparsity) while retaining high predictive accuracy.

4. Key Results & Findings

Performance Rankings

Top Performers: The latest MoE models (e.g., DeepSeek-V3.1, GLM-4.5) achieved the highest Information Capacity across multiple datasets, followed by dense models like Qwen3 and Hunyuan.
Linguistic Bias: Rankings varied significantly by language. For instance, Meta's Llama and Google's Gemma series performed poorly on Chinese corpora (Ch-FineWeb-Edu) compared to Chinese-developed models (Qwen, GLM), highlighting strong linguistic biases in mainstream LLMs.

Factor Analysis

Tokenizer Impact: There is a near-linear correlation between the average text size per token and Information Capacity. Models with efficient tokenizers (e.g., DeepSeek-V3.1) gain a significant advantage.
MoE Advantage: MoE architectures (e.g., Qwen1.5-MoE, Llama-4) outperform dense models of similar activated parameters because they leverage a larger total parameter count (lower sparsity ratio) to improve prediction accuracy without increasing FLOPs.
Post-Training Degradation: Instruction tuning and Reinforcement Learning (RL) often degrade the model's raw text compression capability (and thus IC), as the model shifts focus from probability modeling to instruction following.

Performance Prediction

Cross-Scale Prediction: Using the consistent IC of a model series, the authors successfully predicted the NLL performance of larger models based on a single smaller reference model.
- Accuracy: Prediction errors were tightly bounded (e.g., within ±3% for Qwen3 series), significantly outperforming traditional power-law scaling methods which showed errors exceeding 25%.
Benchmark Correlation: IC correlates strongly with downstream benchmark scores (e.g., MMLU, LiveCodeBench, C-Eval) when the evaluation dataset matches the benchmark domain (e.g., IC on NextCoder correlates with coding benchmarks).

5. Significance and Implications

Efficiency Benchmarking: Information Capacity provides a fair, holistic way to compare models with different architectures (Dense vs. MoE) and tokenizers, addressing the limitations of parameter-count-based comparisons.
Resource Optimization: By identifying tokenizer efficiency as a dominant factor, the metric guides developers to optimize tokenization strategies alongside model architecture to reduce inference costs and latency.
Development Acceleration: The ability to predict the performance of massive models using only a small reference model (via IC consistency) can significantly reduce the computational resources required for scaling law research and model development.
Bias Awareness: The metric exposes linguistic and domain-specific biases in current LLMs, urging the community to adopt more diverse training corpora and evaluation protocols.

In conclusion, Information Capacity shifts the evaluation paradigm from "how smart is the model?" to "how efficiently does the model use compute to achieve that intelligence?" offering a critical tool for the sustainable scaling of future LLMs.