Decoupling Vision and Language: Codebook Anchored Visual Adaptation

Imagine you have a brilliant Translator (the Language Model) who speaks perfect English and can explain anything in the world. However, this translator relies on a Camera (the Vision Encoder) to see the world and describe it to them.

The problem? The camera is a general-purpose one. It's great at taking photos of cats, cars, and sunsets, but if you point it at a medical X-ray or a rare flower, it gets confused. It might say, "I see a hole," when it's actually seeing fluid, or "I see a red dot," when it's actually a specific disease.

When the camera gives a bad description, the brilliant translator gets misled and gives a wrong answer, even though the translator is smart.

The Old Way: The "Re-Training" Nightmare

Previously, if you wanted to fix the camera for medical X-rays, you had to:

Tweak the camera to see better.
Re-teach the translator how to understand the camera's new, weird way of speaking.

This is like hiring a new camera operator, then forcing your translator to go to a whole new school to learn their new dialect. If you want to use a different translator later, you have to start the whole re-teaching process again. It's expensive, slow, and breaks the translator's ability to speak naturally.

The New Way: CRAFT (The "Universal Dictionary")

The authors of this paper, CRAFT, came up with a clever solution. They realized the translator and the camera don't need to speak a continuous, fluid language. Instead, they can speak using a fixed set of building blocks (a "Codebook").

Think of the Codebook as a Universal Dictionary or a Lego set with 16,000 specific, pre-defined blocks.

Block #11745 always means "white background."
Block #5825 always means "a dog's ear."
Block #3918 always means "a flower petal."

How CRAFT Works:

The Camera Learns the Dictionary: Instead of trying to describe an image with a million tiny, fluid details, the camera learns to look at an image and say, "This part is Block #5825, and that part is Block #3918."
The Translator is Frozen: The brilliant translator already knows this dictionary perfectly. It doesn't need to be re-taught. It just reads the blocks and builds a sentence.
The Magic Trick: To make the camera good at X-rays, you only train the camera to pick the right blocks for medical images. You don't touch the translator at all.
- Analogy: Imagine you have a translator who knows the dictionary. You hire a new camera operator and say, "Just point to the right dictionary words for this X-ray." The translator instantly understands because the words haven't changed.

Why This is a Game-Changer

1. Plug-and-Play Compatibility
Because everyone uses the same "Universal Dictionary" (Codebook), you can train a camera on a small computer (using a small "surrogate" translator) and then plug that camera into a super-powerful, massive translator later. They speak the same language immediately. No re-training needed!

2. No "Amnesia"
When you try to re-teach a translator to understand a new camera, it often forgets how to speak normally (a problem called "catastrophic forgetting"). It might start giving one-word answers like "Yes" or "No" instead of explaining why.

CRAFT's Result: The translator keeps its full personality and ability to explain things. It can still say, "Yes, there is fluid, because I see a bright circle with a dark center," just like a human doctor would.

3. It's Efficient
Training the whole system (Camera + Translator) is like trying to move a mountain. CRAFT is like moving a few pebbles. You only train the camera.

The Pruning Bonus: The paper also adds a "pruning" step. Imagine the camera takes a photo and generates 100 blocks, but 80 of them are just "sky" or "grass" (boring background). CRAFT automatically throws away the boring blocks and only sends the interesting ones (the flower, the tumor) to the translator. This makes the system faster and cheaper to run.

The Real-World Impact

In the paper, they tested this on:

Medical Scans: Identifying fluid in brains.
Plant Diseases: Spotting bacterial spots on leaves.
Abstract Diagrams: Solving logic puzzles with shapes.

The Result: CRAFT improved accuracy by 13.5% on average compared to other methods, while keeping the AI's ability to explain its reasoning intact.

Summary

CRAFT is like giving a camera a universal vocabulary so it can talk to any smart AI translator without needing to re-teach the translator. It's cheaper, faster, and ensures the AI doesn't forget how to be smart and helpful while learning to see new things.

1. Problem Statement

Large Vision-Language Models (LVLMs) typically struggle in domain-specific tasks (e.g., medical diagnosis, fine-grained plant classification) because their pre-trained vision encoders lack specialized visual grounding.

The Bottleneck: Existing adaptation methods (e.g., LoRA, projector tuning) modify the continuous feature interface between the vision encoder and the Language Model (LLM).
The Consequence: These methods couple the vision and language components. When the vision encoder is fine-tuned for a new domain, its feature distribution shifts, necessitating costly re-alignment of the LLM. Furthermore, fine-tuning the LLM itself often leads to catastrophic forgetting of general instruction-following capabilities and reasoning abilities.
The Goal: Can we adapt an LVLM to a new domain by modifying only the vision encoder, while keeping the LLM frozen and preserving its original reasoning capabilities?

2. Methodology: CRAFT

The authors propose CRAFT (Codebook RegulAted Fine-Tuning), a lightweight framework that decouples vision adaptation from language processing using a discrete codebook interface.

Core Concept

Instead of passing continuous vectors to the LLM, CRAFT quantizes visual features into a shared, frozen discrete codebook. The vision encoder learns to select and arrange existing codebook entries (tokens) that best represent domain-specific visual cues. Because the codebook is fixed, the "visual vocabulary" remains stable, allowing a fine-tuned encoder to plug into any LLM that shares the same codebook without re-training the LLM.

Key Components

Discrete Visual Interface:
- The vision encoder outputs continuous features $z$ .
- These are quantized to the nearest entry in a frozen codebook $C$ : $\tilde{z} = q(z)$ .
- The resulting discrete tokens are projected into the LLM's embedding space.
Composite Training Objective:
CRAFT fine-tunes only the vision encoder using three loss functions:
- Surrogate Alignment Loss ( $L_{SAL}$ ): A small, frozen "surrogate" LLM predicts the next token based on the image and text. Gradients are backpropagated through the surrogate to the vision encoder. This teaches the encoder to produce tokens that are interpretable and useful for reasoning, rather than just pixel-perfect.
- Commitment Loss ( $L_{commit}$ ): Ensures the encoder's continuous outputs remain close to their assigned codebook entries, preventing feature drift that would break the quantization process.
- Contrastive Loss ( $L_{con}$ ): Preserves the semantic structure learned during pre-training by aligning image embeddings with text captions (ground truth and generated), preventing the model from forgetting general visual knowledge.
Test-Time Token Pruning:
- Discrete encoders often produce redundant tokens (e.g., background patches mapping to the same frequent codebook ID).
- CRAFT employs a rarity-weighted pruning strategy:
  - It calculates a "rarity weight" for each token ID based on its frequency in the training set (rare tokens are kept; frequent background tokens are pruned).
  - It uses quantization residuals (how hard a patch was to quantize) and spatial isolation to select the most informative tokens within a specific ID.
- This reduces computational load (FLOPs) and focuses the LLM on salient visual details.

3. Key Contributions

Decoupled Adaptation: Introduces a method to adapt LVLMs to new domains by updating only the vision encoder, leaving the LLM completely frozen.
Codebook Anchoring: Demonstrates that a shared discrete codebook acts as a stable "visual language," enabling cross-architecture transfer. An encoder fine-tuned with a small surrogate model (e.g., 0.5B) can be directly used with a much larger LLM (e.g., 70B) without re-alignment.
Surrogate-Guided Training: Uses a lightweight surrogate LLM to guide the vision encoder, ensuring the generated tokens facilitate reasoning rather than just feature reconstruction.
Efficiency: Achieves domain adaptation with significantly lower compute costs (training on small surrogates) and inference costs (token pruning).

4. Experimental Results

The authors evaluated CRAFT on 10 benchmarks covering medical imaging (VQARAD, Kvasir), fine-grained classification (Flowers, Cars, Dogs, PlantVillage), and abstract reasoning (IconQA, ScienceQA).

Performance Gains: CRAFT achieved an average improvement of 13.51% across benchmarks compared to zero-shot baselines.
Comparison to SOTA: It outperformed continuous-feature adaptation methods (Vision FT, Projector FT, LDIFS) and LLM LoRA fine-tuning.
- Example: On PlantVillage, CRAFT improved accuracy by ~26% over the baseline, whereas continuous methods often failed or degraded performance.
Preservation of Reasoning: Unlike LLM LoRA or Projector FT, which often caused the model to "collapse" into short, non-explanatory answers, CRAFT preserved the LLM's instruction-following and explanatory capabilities.
- In Table 2, CRAFT maintained high scores in "Presence" (providing explanations) and "Faithfulness" (grounding in the image), whereas LoRA-tuned models scored near zero on these metrics.
Cross-LLM Transfer: An encoder trained with a 0.5B surrogate successfully boosted the performance of 7B and 3B LLM backbones, proving the modularity of the approach.
Efficiency:
- Training: Using a 0.5B surrogate reduced VRAM usage by ~61% and training time by ~73% compared to fine-tuning a 7B model.
- Inference: Token pruning reduced FLOPs by 16% and inference latency by 7% without sacrificing accuracy.

5. Significance

CRAFT addresses a critical bottleneck in the deployment of LVLMs: the high cost and instability of adapting models to specialized domains.

Resource Efficiency: It democratizes domain adaptation, allowing organizations to specialize models using small surrogates and limited data, rather than requiring massive compute resources to retrain full multimodal stacks.
Stability: By decoupling vision and language, it prevents the "catastrophic forgetting" of general reasoning skills that plagues current fine-tuning methods.
Modularity: The shared codebook concept suggests a future where vision encoders and LLMs can evolve independently, provided they adhere to a common discrete visual vocabulary. This offers a practical solution for resource-constrained settings and specialized industries like healthcare.