InfoFlow KV: Information-Flow-Aware KV Recomputation for Long Context

The Big Problem: The "Library Overload"

Imagine you are a brilliant detective (the AI) trying to solve a mystery. To do this, you need to read a library containing 100,000 books (the long context).

The Old Way (Full Context): Every time you get a new question, you walk into the library, pull out every single book, and read the first few pages of all of them to get your bearings. This takes forever. If you have to answer 100 questions, you are walking through the library 100 times. It's exhausting and slow.
The "Smart" Way (Pre-computing): To save time, you decide to read the first few pages of every book once and write a summary note (a KV Cache) for each one. You put these notes on a shelf. Now, when a question comes in, you just grab the relevant notes instead of re-reading the whole books.
The Glitch: The problem is that these notes were written when the books were sitting on the shelf individually. But when you answer a question, you need to see how Book A connects to Book B. If you just grab the notes, you miss the connections between the books. The detective gets confused because the notes don't tell the whole story.

The Current "Fix" (And Why It's Flawed)

Other researchers tried to fix this by saying, "Okay, let's re-read a few pages of the books that seem important."

Method A (CacheBlend): They guess which pages are important by checking if the summary note looks weird compared to the full text. But they only check the "shallow" parts of the brain, missing the deep connections.
Method B (EPIC): They just re-read the first page of every book, no matter what. It's like re-reading the introduction of a cookbook just because you are trying to solve a murder mystery. It's a waste of time.

Neither method really asks: "Does this specific sentence actually help me solve the puzzle right now?"

The InfoFlow Solution: The "Traffic Controller"

The authors of this paper propose a new way to think about the problem. They call it Information Flow.

Imagine the library is a busy city, and the books are neighborhoods. The "Question" is a delivery truck that needs to drop off a package (the answer).

The Traffic Signal (Attention Norms): The authors realized that the "Question" naturally sends out a signal (like a traffic light) that says, "Hey, I need to talk to this specific street corner in Book A and that specific alley in Book B."
The Map (RoPE Geometry): The tricky part is that the library has a weird map system (called RoPE). If you look at the map from the wrong angle, the street corners look like they are in different cities, even if they are next to each other.
- The Paper's Insight: You must look at the map from the exact same angle the delivery truck will use when it drives. If you look at the map from a different angle, you might pick the wrong street corners to re-read.
The Strategy:
- Step 1: Look at the "Traffic Signal" (the attention score) from the Question to the books.
- Step 2: Only re-read the specific sentences that the signal is pointing at. These are the sentences that actually carry the "information flow" needed to solve the problem.
- Step 3 (The Bonus): If the books are independent (like a stack of random documents), the paper suggests reordering the stack. Put the most important books closest to the delivery truck so the signal reaches them faster.

The Result: Faster and Smarter

By using this "Traffic Controller" method:

Speed: The AI doesn't waste time re-reading irrelevant pages. It only re-reads the critical few sentences that connect the dots.
Accuracy: Because it re-reads the right sentences using the right map, it understands the story much better than the old methods.
Versatility: It works for text (LLMs) and even for images and text mixed together (VLMs), like reading a chart or an infographic.

Summary Analogy

Think of the AI as a chef making a soup.

Old Way: The chef tastes the whole pot of soup every time a customer orders, even though they already tasted the ingredients earlier.
Bad Fix: The chef tastes the first spoonful of every ingredient jar, even if the customer asked for a spicy dish.
InfoFlow Way: The chef looks at the customer's order (the question), sees exactly which spices (tokens) are needed to make the flavor work, and only tastes those specific spices again to make sure they mix well. It's faster, and the soup tastes perfect.

In short: This paper teaches AI how to be a better librarian by figuring out exactly which pages to re-read to connect the dots, saving massive amounts of time while keeping the answers accurate.

1. Problem Statement

Retrieval-Augmented Generation (RAG) for long-context question answering (QA) faces a significant efficiency bottleneck during the inference-time prefilling stage. When models process tens or hundreds of thousands of tokens (retrieved documents, images, etc.), computing Key-Value (KV) caches for the entire context is computationally expensive and memory-intensive.

Existing Solutions & Limitations:

Offline Precomputation: A common strategy is to precompute KV caches for individual documents offline. However, autoregressive decoding requires a single, globally ordered sequence with global causal dependencies. Independent document caches lack these cross-document interactions.
Selective Recomputation: To fix this, existing methods (e.g., CacheBlend, EPIC) recompute a small subset of tokens under a global causal mask.
- CacheBlend selects tokens based on representation discrepancies in shallow layers, which may not reflect ultimate generation influence.
- EPIC uses fixed positional heuristics (e.g., document boundaries), ignoring semantic relevance to the query.
The Gap: Current methods fail to explicitly model information flow: they do not determine if a selected token is both semantically relevant to the query and structurally positioned to propagate information effectively to downstream generation under the global attention graph.

2. Methodology: InfoFlow KV

The authors propose InfoFlow KV, which frames selective KV recomputation as an information flow problem. The goal is to restore the pathways through which retrieved evidence influences answer generation.

A. Core Insight: Attention Norms & RoPE Geometry

The method identifies recomputation targets using a simple prompt-conditioned attention norm signal.

Mechanism: It calculates the sum of attention weights from all prompt tokens to each context token ( $s_j = \sum A_{ij}$ ). Tokens with high attention norms are deemed critical for information propagation.
Crucial Condition (RoPE Consistency): The authors found that this signal is only reliable if the Rotary Positional Embedding (RoPE) geometry used during token selection matches the geometry used during actual inference.
- If chunks are assigned local positions (chunk-local RoPE) but the model expects global positions, the attention norms become unstable and misleading.
- Solution: The method reconstructs global positional assignments for retrieved chunks before calculating attention norms.

B. Algorithm Workflow

Chunk-wise Prefilling: Input documents are partitioned into chunks. Each chunk is processed independently to generate a local KV cache using chunk-local RoPE.
Global Positional Reconstruction: At query time, chunks are concatenated with the prompt. The system assigns global positions to all tokens to ensure RoPE consistency with the final decoding order.
Token Selection:
- Compute prompt-to-context attention norms under the global RoPE geometry.
- Select the top- $k$ context tokens with the highest aggregated attention mass.
Selective Recomputation: The selected tokens are recomputed using a standard forward pass under the full global context. Their new KV states replace the cached local states.
Optional Chunk Reordering: For independent segments (e.g., multi-document retrieval), the method reorders chunks to place the most informative ones (those containing high-impact tokens) closer to the prompt. This optimizes the RoPE geometry for better information flow.

3. Key Contributions

Information-Flow Criterion: Proposes a simple, effective criterion (prompt-conditioned attention norms) that jointly captures semantic relevance and structural information flow capability.
Inference-Consistent RoPE: Demonstrates that reliable token selection requires a positional layout consistent with inference-time decoding. Introduces a global positional reconstruction strategy to enable stable token ranking.
Chunk Reordering Strategy: Introduces a dynamic reordering mechanism based on identified critical tokens to further align RoPE geometry with information flow, improving downstream attention effectiveness.
Plug-and-Play Efficiency: The method operates at inference time without modifying pretrained models or requiring additional training.

4. Experimental Results

The method was evaluated on LLM (Qwen, LLaMA, ChatGLM) and VLM (Qwen-VL) benchmarks across long-context QA tasks (2WikiMQA, MuSiQue, HotpotQA, NarrativeQA) and multimodal benchmarks (OCRBench, ChartQA).

Performance Gains:
- LLMs: Consistently outperformed baselines (CacheBlend, EPIC) and "No Recompute" strategies. On Qwen and LLaMA, it achieved the best or second-best results across all benchmarks, with significant gains on multi-hop reasoning tasks (e.g., +1.30 F1 on HotpotQA).
- VLMs: Showed robust improvements in vision-language tasks, particularly on structurally demanding tasks like ChartQA and OCRBench, proving the method generalizes beyond text.
- Needle-in-a-Haystack: The method recovered high retrieval accuracy even when relevant facts were buried deep in long contexts, whereas non-recomputed chunking failed completely.
Efficiency (Latency):
- Single-GPU: Achieved a better speed-accuracy trade-off (Pareto frontier) compared to baselines.
- Multi-GPU (Sequence Parallelism): When combined with sequence parallelism, InfoFlow KV significantly reduced Time-To-First-Token (TTFT). At 32K sequence length, it achieved a 3.49x speedup over single-GPU full prefilling and outperformed Ring Attention by 2.57x in latency while maintaining higher accuracy.
- Communication Overhead: By only communicating a small subset of tokens for recomputation, it drastically reduced inter-GPU communication compared to methods requiring full KV cache gathering.

5. Significance

Theoretical Insight: The paper establishes that positional alignment (RoPE consistency) is a prerequisite for effective token selection in chunked inference. It shifts the paradigm from heuristic-based selection to information-flow-based selection.
Practical Impact: It enables efficient, high-quality long-context inference for both text and multimodal models without retraining. This is critical for interactive RAG systems where retrieval results change dynamically, making full-context prefilling prohibitively expensive.
Scalability: The approach scales efficiently across multiple GPUs, making it a viable solution for deploying large models on extremely long contexts (32K+ tokens) in production environments.

In summary, InfoFlow KV solves the efficiency-accuracy trade-off in long-context RAG by intelligently selecting which tokens to recompute based on their ability to transmit information, provided that the positional encoding is consistent with the final decoding geometry.

InfoFlow KV: Information-Flow-Aware KV Recomputation for Long Context

The Big Problem: The "Library Overload"

The Current "Fix" (And Why It's Flawed)

The InfoFlow Solution: The "Traffic Controller"

The Result: Faster and Smarter

Summary Analogy

1. Problem Statement

2. Methodology: InfoFlow KV

A. Core Insight: Attention Norms & RoPE Geometry

B. Algorithm Workflow

3. Key Contributions

4. Experimental Results

5. Significance

More like this

A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset

Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

FLeX: Fourier-based Low-rank EXpansion for multilingual transfer

Spectral Edge Dynamics Reveal Functional Modes of Learning

S3S^3S3: Stratified Scaling Search for Test-Time in Diffusion Language Models

$S^3$ : Stratified Scaling Search for Test-Time in Diffusion Language Models