OSCAR: Online Soft Compression And Reranking

Imagine you are a brilliant detective (the Large Language Model or LLM) trying to solve a mystery. To do your job, you need to read a stack of case files (the retrieved documents) before you can write your report (the answer).

The problem? In the modern world, the stack of case files is getting huge. Sometimes it's a few pages; sometimes it's a library. Reading every single word of every single file takes a long time, costs a fortune in energy, and slows down your investigation.

This is where OSCAR comes in.

The Old Ways: The "Hard" and "Soft" Problems

Before OSCAR, detectives had two main ways to handle this mountain of paperwork, and both had flaws:

The "Hard" Cut (Hard Compression): Imagine a strict editor who grabs a red pen and physically cuts out sentences from the documents, leaving only the "most important" bits.
- The Good: It's fast because the text is shorter.
- The Bad: You can only cut so much before you lose crucial clues. It's like trying to summarize a whole novel into a single sentence; you lose the nuance.
The "Soft" Summary (Soft Compression): Imagine a super-smart assistant who reads the files before you even get the case, writes a perfect summary, and hands you a tiny, magical note.
- The Good: You get a huge amount of information packed into a tiny space.
- The Bad: The assistant is slow and expensive. Also, they usually write the summary without knowing what your specific question is. They might summarize the whole file, including the parts you don't care about.

Enter OSCAR: The "Smart, On-the-Fly" Assistant

OSCAR (Online Soft Compression And Reranking) is a new kind of assistant that solves both problems. Think of it as a specialized, lightning-fast librarian who works while you are asking your question.

Here is how it works, using a simple analogy:

1. The "Query-Dependent" Magic

In the past, assistants summarized documents blindly. OSCAR is different. It waits until you ask your question (e.g., "Who won the Palme d'Or?").

The Analogy: Imagine you are looking for a specific needle in a haystack. The old assistants would summarize the entire haystack. OSCAR looks at your question, realizes you only care about the "needle," and instantly compresses the haystack down to just the straw that holds the needle. It ignores everything else.

2. The "Online" Speed

Old "soft" compression methods were like hiring a team of scholars to write summaries days in advance. If you changed your question, the summaries were useless.

The Analogy: OSCAR is a live translator. As soon as you ask a question, it instantly translates the relevant parts of the documents into a "compressed code" (a few special tokens) that your detective brain understands perfectly. It happens in real-time, so you don't have to wait.

3. The "Double Duty" (Reranking)

Usually, after finding documents, you have to hire a second person to decide which documents are actually useful (this is called reranking).

The Analogy: OSCAR is a two-in-one tool. While it is compressing the documents into its "magic code," it is also whispering to you, "Hey, this document is super relevant, but that one is junk." It does the compression and the sorting at the exact same time, saving you the cost of hiring a second person.

Why is this a Big Deal?

The paper shows that OSCAR is a game-changer for three reasons:

It's Blazing Fast: By compressing the documents on the fly, the detective (the AI) has to read way less text. The paper says this makes the whole process 2 to 5 times faster. It's like switching from reading a novel to reading a perfectly written, 3-sentence cheat sheet.
It's Still Accurate: Even though it's skipping most of the text, it doesn't lose the important clues. The AI still gets the right answer almost as often as if it had read every single word.
It Scales: Whether you are using a small AI (1 Billion parameters) or a giant one (24 Billion parameters), OSCAR makes the big ones even more efficient. It's like giving a Ferrari a turbocharger without adding extra weight.

The Bottom Line

OSCAR is like having a super-intelligent, instant filter for your AI's reading list. Instead of forcing the AI to read a whole library, OSCAR instantly distills the library down to the exact few sentences the AI needs to solve your specific problem, and it tells the AI which books to ignore—all in the blink of an eye.

It's the difference between reading a 500-page manual to fix a toaster and having a technician who instantly tells you, "Just flip this one switch, ignore the rest."

Here is a detailed technical summary of the paper "OSCAR: Online Soft Compression And Reranking" (March 2026).

1. Problem Statement

Retrieval-Augmented Generation (RAG) significantly improves Large Language Models (LLMs) by integrating external knowledge, but scaling RAG pipelines is computationally expensive. As the number of retrieved documents grows, the computational cost (FLOPs) and latency increase quadratically due to the attention mechanism's complexity with long contexts.

Existing solutions face a trade-off:

Hard Compression (e.g., Pruning/Summarization): Methods like Provence or RECOMP operate online and are query-dependent but achieve low compression rates (approx. 2×) because they must retain readable text.
Soft Compression: Methods like PISCO or xRAG map documents to continuous embeddings, achieving high compression (approx. 16×). However, they are typically offline (query-independent), requiring heavy pre-computation, or suffer from significant accuracy degradation when compressed too aggressively. They often fail to utilize the specific query during the compression step, leading to information loss.

The Gap: There is a lack of a method that combines high compression rates (like soft compression) with online, query-dependent operation (like hard compression) without sacrificing accuracy or requiring offline pre-computation.

2. Methodology: OSCAR

OSCAR (Online Soft Compression And Reranking) addresses this by introducing a novel query-dependent online soft compression framework. It compresses retrieved documents into a small set of embedding tokens at inference time, conditioned on the user query.

Core Architecture

The pipeline consists of two main components:

Compressor LLM: A smaller, efficient model that takes the query ( $q$ $q$ ), a retrieved document ( $d_i$ $d_{i}$ ), and a set of learnable memory tokens ( $[MEM]$ $[M E M]$ ) as input.
- It outputs a compressed representation ( $c_i$ ) consisting of $l$ embedding vectors (e.g., 8 vectors for a 128-token document).
- Architectural Variants:
  - OSCAR-N-Layers: Uses the first $N$ layers of the generator's backbone (e.g., Mistral-7B) without the final projection head. No pre-training is required; the hidden states naturally align with the generator.
  - OSCAR-llama: Uses a distinct, smaller LLM (e.g., Llama-1B) as the compressor. This requires a pre-training phase (auto-encoding/text continuation) to align the compressor's hidden space with the generator's embedding space, followed by fine-tuning.
Generator LLM: The main LLM receives the query and the compressed embeddings ( $c_1, \dots, c_k$ ) instead of the raw text. It generates the answer based on these compressed representations.

Training Strategy

Distillation: The system is trained end-to-end using sequence-level distillation. A "Teacher" LLM (Mistral-7B) generates answers using full, uncompressed documents. The OSCAR pipeline (Compressor + Generator) is trained to mimic these answers using only the compressed embeddings.
Loss Function: The loss is calculated based on the cross-entropy between the generator's output and the teacher's output.
Simultaneous Reranking: OSCAR adds a [RR] (Rerank) token to the compressor's prompt. A dense layer maps this token's hidden state to a relevance score. This allows the model to perform document reranking in the same forward pass as compression, effectively making the compression cost "free" in a RAG pipeline that already requires reranking.

3. Key Contributions

First Online Soft Compression: OSCAR is the first method to achieve high compression rates (up to 16×) in an online, query-dependent manner, eliminating the need for offline pre-computation.
Dual Architecture: It introduces two flexible compressor backbones:
- N-Layers: Lightweight, no pre-training, easy to deploy.
- OSCAR-llama: Uses a dedicated small LLM, offering superior performance and flexibility.
Integrated Reranking: By unifying compression and reranking into a single forward pass, OSCAR removes the computational overhead typically associated with adding a reranker to a RAG pipeline.
Scalability: The method is backbone-agnostic and has been validated on LLMs ranging from 1B to 24B parameters.

4. Experimental Results

The authors evaluated OSCAR on multiple benchmarks (Natural Questions, TriviaQA, HotpotQA, ASQA, PopQA, BioASQ) using various backbones (Mistral-7B/24B, Llama-1B, Qwen-7B).

Efficiency:
- Speed-up: OSCAR achieves a 2× to 5× speed-up in end-to-end inference compared to uncompressed RAG.
- FLOPs: For the Mistral-24B backbone, OSCAR-llama reduces computational complexity by 4.8× while improving overall results.
- Comparison: It outperforms hard compression baselines (Provence, RECOMP) in efficiency while matching or exceeding their accuracy. It also surpasses offline soft compression (PISCO) in online settings.
Accuracy:
- Minimal Loss: OSCAR models show little to no loss in accuracy compared to uncompressed baselines. In some cases (e.g., Mistral-24B), accuracy slightly improves.
- Query Dependence: Ablation studies confirm that removing the query from the compression step causes a significant performance drop (up to 6%), validating the necessity of query-dependent compression.
Robustness:
- Noisy Retrieval: OSCAR maintains performance even when retrieval quality degrades (e.g., using BM25 without reranking), performing similarly to its uncompressed backbone.
- Long Context: When tested with up to 50 retrieved documents (approx. 7k tokens), OSCAR remains robust, offering 5× fewer FLOPs than the uncompressed baseline.
Reranking: Models trained with the reranking head achieve reranking performance (nDCG@10 on BEIR) nearly identical to the teacher model (DeBERTa-v3) without additional inference cost.

5. Significance

OSCAR represents a paradigm shift in RAG optimization. By successfully bridging the gap between soft compression (high efficiency) and online query-dependence (high relevance), it solves the scalability bottleneck of RAG pipelines.

Practical Impact: It enables the deployment of RAG systems on larger models (e.g., 24B parameters) with latency and cost profiles comparable to smaller models, making high-quality, knowledge-intensive AI applications more feasible for real-time, large-scale production.
Future Directions: The paper suggests that combining OSCAR with other long-context optimizations (like KV-cache compression) is orthogonal and promising. It also highlights that the "compression" operation can be viewed as a powerful, task-specific reranker, unifying two critical steps of the RAG pipeline.

In summary, OSCAR provides a state-of-the-art solution that is faster, more accurate, and more efficient than existing hard or soft compression methods, making it a critical advancement for scalable Retrieval-Augmented Generation.