Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

This paper introduces EDJE, an efficient discriminative joint encoder that precomputes and compresses visual tokens to overcome the high computational and storage costs of existing vision-language rerankers, enabling high-throughput inference with minimal disk usage while maintaining state-of-the-art retrieval performance.

Mitchell Keren Taraday, Shahaf Wagner, Chaim Baskin

Published 2026-02-24
📖 5 min read🧠 Deep dive

The Big Problem: The "Slow Chef" in a Fast Food Kitchen

Imagine you run a massive library (a database of millions of images) and people are asking for books (images) based on descriptions (text).

  1. The Old Way (Embedding Models): You have a super-fast librarian who can quickly scan the spine of every book and guess what's inside. This is fast, but sometimes the guess is wrong because they didn't read the actual pages.
  2. The "Perfect" Way (Joint Encoders): You have a genius scholar who reads the entire book and the entire description, then compares them word-for-word. This is incredibly accurate, but it takes forever. If you have to do this for 50,000 books, your customer waits hours.

The Bottleneck: The paper points out that the "genius scholar" (like the famous BLIP model) is too slow because they spend 90% of their time just looking at the pictures to understand them before they even start reading the text. It's like a chef spending 45 minutes chopping vegetables before they even start cooking the meal.

The Solution: EDJE (The "Pre-Cooked" Chef)

The authors introduce EDJE (Efficient Discriminative Joint Encoder). Their big idea is to change when the chef does the hard work.

1. The "Pre-Cooked" Strategy (Offline Precomputation)

Instead of chopping vegetables (extracting image features) every time a customer orders, EDJE says: "Let's chop all the vegetables once, store them in the fridge, and just grab them when needed."

  • How it works: They take all the images in the database once, process them into a compact "summary" (tokens), and save them on the hard drive.
  • The Benefit: When a user types a query, the system doesn't need to "look" at the raw image again. It just grabs the pre-made summary. This saves a massive amount of time.

2. The "Condensed Summary" (Token Compression)

Here is the tricky part. Even the "summary" of an image is huge. If you have 1 million images, and each summary is 100MB, your hard drive will explode.

  • The Metaphor: Imagine the image summary is a 500-page novel. Storing 1 million novels is impossible.
  • The Fix: EDJE uses a special Adapter (a smart filter). It reads the 500-page novel and writes a 64-word summary that captures the most important parts.
  • The Magic: This 64-word summary is tiny (only 49 kilobytes per image!), but it still contains the "soul" of the image. It's like turning a whole movie into a perfect 1-minute trailer that still tells you exactly what the movie is about.

3. The "Fast Matchmaker" (The Joint Encoder)

Now, when a user asks, "Show me a picture of a dog playing in the snow," the system:

  1. Takes the text.
  2. Grabs the tiny 64-word summaries of the top 50,000 candidate images from the fridge.
  3. Feeds the text and the summaries into a small, fast language model (like a MiniLM).
  4. This model acts as a matchmaker, comparing the text to the summaries instantly.

Because the heavy lifting (looking at the raw image) was done offline, and the data is tiny, this matchmaker can process 50,000 pairs per second. That's like checking 50,000 books in the time it takes to blink.

Why This Matters (The Results)

  • Speed: It is up to 53 times faster than previous "perfect" models.
  • Storage: It uses 49KB of space per image (compressed) instead of megabytes. You could store millions of these on a standard laptop.
  • Accuracy: Even though it's fast and uses tiny summaries, it is just as good at finding the right picture as the slow, heavy models. In fact, it beats the old "fast but dumb" models significantly.

A Real-World Analogy: The Dating App

  • Old Fast Models (CLIP): You swipe through photos based on a quick glance. "Looks like a dog." Fast, but you might miss the specific breed or the fact that the dog is wearing a hat.
  • Old Slow Models (BLIP): You read a full biography of every person before deciding to swipe. Accurate, but you'd never find a date because it takes too long.
  • EDJE: You have a database of "Pre-written Bios" (the compressed tokens). When you search for "Dog with a hat," the system instantly matches your text against these pre-written bios. It's fast enough to handle millions of users, but smart enough to know the difference between a dog in a hat and a dog in a tuxedo.

Summary

The paper solves the problem of "How do we make AI look at pictures and read text together really fast?" by saying: "Don't look at the pictures in real-time. Summarize them beforehand, shrink the summaries down to their absolute essentials, and then match them quickly."

This allows us to build massive, smart image search engines that are both incredibly accurate and lightning-fast.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →