COMI: Coarse-to-fine Context Compression via Marginal Information Gain

Here is an explanation of the COMI paper, translated into simple, everyday language using analogies.

The Big Problem: The "Too Much Information" Traffic Jam

Imagine you are a Detective (the AI) trying to solve a mystery (answer a question). You have a massive evidence board with 10,000 sticky notes (the long text context).

The Issue: Most of those sticky notes are useless. Some are just repeats of the same fact written in different ways. Some are completely irrelevant.
The Bottleneck: Your brain (the computer) can only look at a few notes at a time. If you try to read all 10,000 notes, you get overwhelmed, slow down, or miss the crucial clue because it's buried under the noise.
The Current Solution: Previous methods tried to shrink the evidence board by just picking the notes that seemed most related to the mystery. But they made a mistake: they picked 50 notes that all said the exact same thing. You still have 50 notes of the same info, which is a waste of space.

The New Solution: COMI (The Smart Editor)

The authors propose COMI, a new way to shrink that evidence board. They call it "Coarse-to-Fine Context Compression via Marginal Information Gain."

That's a mouthful, so let's break it down with a Library Analogy.

1. The Core Concept: "Marginal Information Gain" (MIG)

Imagine you are curating a "Best Of" playlist for a friend.

Relevance: You pick songs the friend likes.
Redundancy: But if you pick 50 different versions of the same song, that's annoying. It adds no new value.

MIG is a score that asks two questions for every piece of information:

"How much does this help answer the question?" (Relevance)
"How much is this just repeating what I already have?" (Redundancy)

The Rule: If a piece of info is super relevant but totally unique, give it a high score. If it's relevant but just a copy of something you already picked, give it a low score.

2. The Two-Step Process

COMI doesn't just delete things randomly; it uses a Coarse-to-Fine strategy (Big picture first, then details).

Step A: The Coarse-Grained "Group Reallocation" (The Big Picture)
Imagine the 10,000 sticky notes are divided into 100 piles (groups).

Old Way: You give every pile the same amount of space on the final board (e.g., 10 notes per pile).
COMI Way: You look at each pile.
- Pile #1 has the smoking gun evidence. It gets more space (maybe 30 notes).
- Pile #50 is just about the weather. It gets less space (maybe 2 notes).
- Pile #20 has 10 notes that all say the same thing. It gets less space because the info is redundant.
Result: You allocate your limited "board space" to the piles that actually matter.

Step B: The Fine-Grained "Token Merging" (The Details)
Now, inside the important piles, you have to shrink the notes themselves.

Old Way: You just take the average of the notes. If you have 5 notes saying "The butler did it," the average is still just "The butler did it," but you wasted space on 5 notes.
COMI Way: You look at the notes inside the pile.
- Note A says "The butler did it with a candlestick."
- Note B says "The butler did it."
- Note C says "The butler did it with a candlestick" (again).
- COMI realizes Note A is the most unique and informative. It merges the group into a single, super-dense note that keeps the "candlestick" detail but drops the repetitive fluff.

Why is this better?

Think of it like packing for a trip.

Old Methods: You pack 10 identical t-shirts because they are all "good for summer." You run out of suitcase space and can't fit your shoes.
COMI: You pack 1 t-shirt (because 10 is redundant), 1 pair of shoes, and a swimsuit. You fit everything you actually need into a tiny bag.

The Results

The paper tested this on huge questions (like "Who killed the victim in this 500-page novel?") and summarization tasks.

The Win: Even when they forced the AI to shrink the text by 32 times (keeping only 1/32nd of the original words), COMI was much smarter than other methods.
The Score: On a test called "NaturalQuestions," COMI improved the accuracy by 25 points compared to the next best method. That's a massive jump.

Summary in One Sentence

COMI is a smart AI editor that doesn't just cut out the boring parts of a long story; it also deletes the parts that are just repeats of the good parts, ensuring the final summary is short, unique, and packed with the most important clues.

Here is a detailed technical summary of the paper "COMI: Coarse-to-Fine Context Compression via Marginal Information Gain", published as a conference paper at ICLR 2026.

1. Problem Statement

Large Language Models (LLMs) struggle with long-context scenarios due to two primary bottlenecks:

Computational Inefficiency: The quadratic complexity of the Transformer attention mechanism makes processing long sequences computationally expensive.
Information Redundancy: Long contexts often contain significant semantic redundancy. Existing compression methods fail to address this effectively:
- Task-Agnostic Methods: Compress context without considering the user query, inevitably discarding query-relevant information.
- Task-Aware Methods: Focus on relevance but ignore semantic redundancy. They often retain multiple highly similar tokens that are all relevant to the query, leading to "over-similarity." This redundancy can mislead LLMs into generating erroneous outputs and reduces information diversity.

Core Research Question: How can we retain query-relevant information while simultaneously identifying and eliminating semantic redundancy among compressed representations, especially under high compression rates?

2. Methodology: COMI Framework

The authors propose COMI (Coarse-to-Fine Context Compression via Marginal Information Gain), an encoder-decoder based framework that jointly optimizes for semantic relevance and diversity.

A. Core Metric: Marginal Information Gain (MIG)

The foundation of COMI is the Marginal Information Gain (MIG) metric, defined for a token $x_i$ relative to a query $q$ and a context set $X$ :
$G(x_i, q, X) = \underbrace{\text{cos}(x_i, q)}_{\text{Relevance}} - \underbrace{\max_{x_j \in X, j \neq i} \text{cos}(x_i, x_j)}_{\text{Redundancy}}$

Relevance: Measured by cosine similarity between the token and the query.
Redundancy: Measured by the maximum cosine similarity between the token and any other token in the context.
Goal: MIG prioritizes tokens that are highly relevant to the query but distinct from other selected tokens, effectively penalizing "relevant but redundant" content.

B. Two-Stage Compression Strategy

COMI operates in a hierarchical, coarse-to-fine manner:

Stage 1: Coarse-Grained Group Reallocation

Process: The input context is divided into equal-length segments (groups).
Mechanism: The MIG is calculated for each group (using a representative token).
Allocation: Compression rates are dynamically adjusted based on inter-group MIG. Groups with high MIG (high relevance, low redundancy) are assigned lower compression rates (preserved more faithfully). Groups with low MIG are compressed more aggressively.
Outcome: The compression budget is adaptively reallocated to information-dense regions of the context.

Stage 2: Fine-Grained Token Merging

Process: Within each group, tokens are fused into a single compressed token.
Mechanism: Tokens are weighted by their intra-group MIG.
Formula: The merged token $\tilde{h}_i$ is a weighted sum where weights are the softmax of the token's MIG:
$\tilde{h}_i = \sum_{h_k \in S_i} \frac{e^{G(h_k, q, S_i)}}{\sum e^{G(h_k, q, S_i)}} \cdot h_k$
Outcome: This ensures that key semantic units contribute more to the representation while preventing the accumulation of redundant, similar content.

C. Training Paradigm

Architecture: Encoder-Decoder with Layer Semantic Alignment (LSA). LSA bridges the semantic gap between the compressed encoder output and the decoder's original input semantics.
Training: The encoder and LSA are fully fine-tuned. The decoder is partially fine-tuned (updating only attention matrices $W_Q, W_K, W_V, W_O$ ) to ensure knowledge extraction capabilities are preserved.
Objective: Joint instruction tuning using cross-entropy loss on the compressed representation.

3. Key Contributions

Marginal Information Gain (MIG): A novel metric that jointly models task relevance and semantic redundancy. It overcomes the limitations of "relevance-only" methods by explicitly penalizing redundancy.
Coarse-to-Fine Adaptive Framework: A two-stage strategy that dynamically allocates compression budgets across context segments (Coarse) and performs redundancy-aware token fusion within segments (Fine).
State-of-the-Art Performance: Extensive experiments demonstrate that COMI significantly outperforms existing baselines (including task-agnostic and task-aware methods) under high compression constraints.

4. Experimental Results

The authors evaluated COMI on Question Answering (QA) (NaturalQuestions, 2WikiMQA, HotpotQA, NarrativeQA) and Summarization (MultiNews) using backbones like LLaMA-2-7B and Qwen2-7B.

High Compression Performance: Under a 32x compression constraint on NaturalQuestions with Qwen2-7B, COMI achieved an Exact Match (EM) improvement of ~25 points over the suboptimal baseline (Activation Beacon).
Robustness: COMI consistently outperformed baselines across all compression ratios (2x to 32x) and task types (single-hop, multi-hop, and long-document QA).
Efficiency: COMI achieved a 2x end-to-end speedup compared to processing the original prompt, with low overhead in the compression stage.
Native Long-Context Models: Even when applied to models with native long-context capabilities (e.g., Qwen3-4B-Instruct), COMI improved performance (e.g., doubling F1 scores on NaturalQuestions at 16x compression), proving it enhances information density even for models designed for long contexts.
Ablation Studies: Removing either the coarse-grained reallocation or the fine-grained merging caused significant performance drops, validating the necessity of both stages.

5. Significance

Theoretical Insight: The paper establishes that relevance alone is insufficient for context compression; diversity (low redundancy) is equally critical to prevent LLM hallucination and performance degradation.
Practical Impact: COMI provides a scalable solution for deploying LLMs in real-world long-context applications (e.g., RAG, legal document analysis, book summarization) where computational resources are limited.
Generalizability: The framework is model-agnostic regarding the backbone (tested on LLaMA and Qwen) and effective across diverse tasks, suggesting it is a robust approach for future long-context modeling.

In conclusion, COMI represents a significant advancement in context compression by shifting the paradigm from simple "relevance filtering" to "marginal information optimization," ensuring that compressed contexts are both informative and diverse.