TextBoost: Boosting Scene Text Fidelity in Ultra-low Bitrate Image Compression

The Big Problem: The "Blurry Sign" Dilemma

Imagine you are trying to send a photo of a busy city street to a friend, but you are on a very slow, expensive internet connection. You have to shrink the photo down to a tiny size (ultra-low bitrate) so it can be sent quickly.

When you shrink a photo this much, the computer has to throw away a lot of details to save space. Usually, it keeps the big, obvious things (like the sky or a car) but throws away the tiny, hard-to-see things.

The problem: In a city scene, the most important tiny details are often small signs, street names, or license plates. When the photo is compressed, these small words turn into a blurry, unreadable mess.

The old solution (ROI): The traditional way to fix this was to tell the computer, "Hey, don't throw away the text! Give the text more space in the file."

The Catch: It's like trying to fit a large suitcase and a small jewelry box into a tiny backpack. If you give the jewelry box (the text) more room, you have to squeeze the suitcase (the rest of the image) until it bursts. You get clear text, but the rest of the photo looks terrible.

The New Solution: TextBoost (The "Ghost Writer" Approach)

The authors of this paper, TextBoost, came up with a clever trick. Instead of fighting over space in the backpack, they bring in a Ghost Writer.

Here is how it works, step-by-step:

1. The "Ghost Writer" (OCR)

Before sending the photo, the system uses a smart tool called OCR (Optical Character Recognition) to read the signs in the picture.

The Magic: Instead of sending the picture of the sign (which takes up a lot of space), the computer just sends the words and their location (e.g., "The word 'STOP' is at the top right").
Why it's great: Sending the word "STOP" takes up almost no space at all compared to sending the actual blurry image of the sign. It's like sending a text message instead of a photo of a sign.

2. The "Blueprint" (Guidance Map)

The computer takes those words and draws a simple, clean "blueprint" or a map of where the letters should go. It doesn't try to draw the whole picture; it just draws the skeleton of the text.

Analogy: Imagine an architect drawing the outline of a house on a piece of paper. They aren't painting the walls yet; they are just showing where the walls should be.

3. The "Smart Builder" (Fusion Block)

Now, the photo arrives at the receiver (your friend's phone). The photo is blurry and missing details.

The Smart Builder (the AI decoder) looks at the blurry photo and the "Blueprint" (the text map) at the same time.
It says, "Okay, the photo is blurry here, but the Blueprint tells me there should be a sharp 'STOP' sign right here."
The builder uses the Blueprint to sharpen the blurry letters in the photo, making them crisp and readable, while leaving the rest of the photo (the sky, the trees) exactly as it was.

4. The "Safety Net" (Loss Function)

The system has a rule: "Don't just paste the words on top like a sticker." The words need to look like they belong in the scene (same lighting, same angle). The system checks to make sure the new text blends in naturally with the blurry background, so it doesn't look fake.

Why is this a Game-Changer?

No Trade-offs: In the old method, you had to choose between clear text or a clear background. With TextBoost, you get both. The text becomes sharp, and the background stays just as good as before.
Super Efficient: Because the "Ghost Writer" only sends the text data (which is tiny), it doesn't cost any extra internet data.
Works Everywhere: They tested this on thousands of images with street signs, billboards, and license plates. The results showed that the text became 60% easier to read compared to the best existing methods, without making the rest of the image worse.

The Bottom Line

TextBoost is like hiring a specialized editor to fix a blurry photo. Instead of trying to save more space for the whole picture, the editor reads the text, writes it down on a tiny note, and then uses that note to perfectly reconstruct the letters in the photo.

It solves the problem of "blurry signs" in compressed images by using smart hints rather than extra space.

1. Problem Statement

In ultra-low bitrate image compression (e.g., for satellite communication or surveillance), preserving small-font scene text is a critical challenge.

The Trade-off: Traditional methods often use Region-of-Interest (ROI) coding, which allocates more bits to text regions. However, this creates a structural limitation: increasing local fidelity for text inevitably degrades the global perceptual quality of the rest of the image.
The Limitation of Generative Models: While diffusion models can generate perceptually pleasing images, they often fail to preserve the precise pixel fidelity and fine-grained details required for machine-readable text, leading to blurry or hallucinated characters.
The Core Challenge: How to enhance the recognizability of small text in reconstructed images without sacrificing global image quality or significantly increasing the bitrate.

2. Methodology: TextBoost

The authors propose TextBoost, a framework that shifts the paradigm from "bit reallocation" to "semantic guidance." Instead of treating text as a region to be protected by extra bits, TextBoost treats OCR (Optical Character Recognition) output as a lightweight, auxiliary semantic prior to guide the decoder.

The pipeline consists of three strategic modules:

A. Adaptive OCR Information Processing (Rendering-and-Alignment)

Selective Transmission: Instead of transmitting all detected text, the system filters for small-font text (which suffers most from compression artifacts) based on average character area. Large text is ignored as it remains legible even at low bitrates.
Visual Guidance Map: The filtered OCR data (text content and bounding boxes) is compressed with negligible overhead (e.g., using gzip) and transmitted.
Rendering: At the decoder, the text is rendered into a visual guidance map. Crucially, the system:
- Normalizes text orientation (rotating slanted/vertical text to horizontal for rendering).
- Adapts font sizes to fit the bounding boxes.
- Renders text on a black background to create clear spatial masks.
- Rotates the map back to the original scene orientation.
- Graceful Degradation: If no OCR data is available, a zero tensor is output, reverting to standard compression without artifacts.

B. Attention-Guided Feature Fusion

The auxiliary guidance map is integrated into the reconstruction stream via a Fusion Block:

Modulation: The guidance map is multiplied element-wise (Hadamard product) with the decoder's RGB output, allowing glyph pixels to inherit color information from the learned image prior.
Channel Expansion: The decoder output is expanded from 3 to 13 channels via a $1\times1$ convolution and concatenated with the 3-channel modulated guidance map, creating a 16-channel representation.
Attention Mechanism: A stacked convolutional attention module (adapted from prior work) processes these features. It learns to spatially emphasize small-text regions while suppressing irrelevant responses, ensuring the text is sharpened without disrupting the global scene statistics.
Projection: A final $1\times1$ convolution projects the features back to 3-channel RGB.

C. Guidance-Consistent Loss

To prevent the network from simply "copying" the auxiliary text (which would cause artifacts and poor blending), a specialized loss function is introduced:

Two-Stage Training:
1. Stage 1: Standard Rate-Distortion optimization to train the backbone encoder/decoder.
2. Stage 2: Fine-tuning where the encoder and base decoder are frozen. Only the fusion block is updated.
Loss Function: A Guidance-Consistent Loss ( $L_{gc}$ ) is applied using a binary mask derived from OCR boxes. It minimizes the Mean Squared Error (MSE) specifically within text regions, ensuring the reconstructed text matches the guidance map's geometry while maintaining consistency with the global image quality. This decouples text enhancement from bitrate allocation.

3. Key Contributions

Paradigm Shift: Moves away from ROI-based bit allocation (which trades global quality for local accuracy) to an auxiliary semantic guidance approach that decouples text enhancement from rate-distortion optimization.
Novel Architecture: Introduces a Rendering-and-Alignment module that converts discrete OCR strings into geometrically aligned visual guidance maps, and an Attention-Guided Fusion Block that seamlessly integrates this guidance into the decoder.
Training Strategy: Proposes a two-stage training protocol with a guidance-consistent loss that freezes the compression backbone, ensuring improvements come from better feature integration rather than increased bit consumption.
Robustness: The method gracefully degrades to standard compression if OCR fails, introducing no failure modes.

4. Experimental Results

The method was evaluated on TextOCR, ICDAR 2015, and Kodak datasets against state-of-the-art (SOTA) learned codecs (ELIC, LIC-TCM, TACO) and traditional standards (JPEG, VTM).

Text Recognition Performance:
- On TextOCR, TextBoost achieved a 60.6% relative improvement in text detection (DET) F1 score compared to the best baseline (ELIC) at comparable bitrates (~0.033 bpp).
- On ICDAR 2015, it showed a 90% improvement in End-to-End (E2E) recognition scores over ELIC at 0.0225 bpp.
Global Image Quality:
- Unlike ROI methods, TextBoost maintained competitive global fidelity (PSNR, MS-SSIM) and even improved perceptual quality (lower LPIPS) compared to baselines.
- It successfully preserved global scene structures without introducing artifacts in non-text regions.
Efficiency:
- Achieved superior text fidelity at lower average bitrates (0.025 bpp) compared to baselines (0.027–0.029 bpp).
- The auxiliary stream (OCR data) adds negligible overhead.
Generalization:
- The method works effectively on general images (Kodak dataset) without text, proving it does not degrade performance on non-text-centric scenes.
- The strategy is model-agnostic, successfully boosting performance when applied to the LIC-TCM backbone.

5. Significance

TextBoost represents a significant advancement in content-aware image compression.

Practical Impact: It solves a critical bottleneck for applications like search and rescue, surveillance, and satellite imaging, where small text often carries vital information that is lost in ultra-low bitrate transmission.
Theoretical Insight: It demonstrates that semantic priors (like OCR) can be effectively leveraged to guide generative reconstruction without the stochasticity or hallucination issues of pure generative models.
Future Direction: The paper suggests this "auxiliary guidance" paradigm could be extended to other critical visual elements (e.g., faces, objects) and potentially adapted for handwritten text, though that presents unique stylistic challenges.

In summary, TextBoost successfully breaks the traditional trade-off between local text accuracy and global image quality in ultra-low bitrate compression by using lightweight semantic guidance to "boost" the decoder's ability to reconstruct fine-grained text details.