Global-Local Dual Perception for MLLMs in High-Resolution Text-Rich Image Translation

Imagine you are trying to translate a giant, messy billboard in a foreign city. The billboard is huge, covered in tiny handwritten notes, big bold headlines, and decorative flowers. If you try to look at the whole thing at once from far away, you can't read the tiny words. But if you zoom in too close to read one word, you lose the context of the whole sign and might translate a word wrong because you don't know what the sentence is about.

This is exactly the problem computer scientists face when trying to translate text inside images (like menus, posters, or screenshots) using Artificial Intelligence.

Here is a simple breakdown of the paper "GLoTran" and how it solves this problem:

1. The Problem: The "Zoom Dilemma"

Current AI models (called MLLMs) are like students who are great at reading books but terrible at reading giant, messy posters.

If they zoom out: They see the big picture but miss the small text. They might skip a sentence or miss a word entirely (called omission).
If they zoom in: They get the words right but lose the story. They might translate a word correctly but put it in the wrong place, or invent words that aren't there (called hallucination).
The Result: The translation is either incomplete or makes no sense.

2. The Solution: The "Sherlock Holmes" Approach (GLoTran)

The researchers created a new system called GLoTran. Instead of forcing the AI to look at the whole image at once, they teach it to use a "Global-Local" strategy.

Think of it like a detective solving a crime scene:

The Global View (The Wide Shot): The AI first takes a quick, low-resolution look at the entire image. It's like looking at the crime scene from a helicopter. It sees the layout: "Oh, this is a menu, and the prices are on the right." This gives the AI the context.
The Local View (The Magnifying Glass): Then, the AI cuts the image into small slices (like taking photos of just the "Appetizers" section or just the "Drinks" section). It zooms in tight on these slices to read the tiny, messy handwriting perfectly.
The Magic Connection: The AI doesn't just look at the slices in isolation. It constantly checks the "Helicopter View" (Global) while reading the "Magnifying Glass View" (Local). This ensures it knows where it is in the document and keeps the story consistent.

3. The "Replay" Mechanism: Keeping the Conversation Flowing

Imagine you are translating a long letter, one paragraph at a time. If you forget what you translated in the first paragraph, the second paragraph might not make sense.

GLoTran uses a "Replay Window." Before it translates the current slice of text, it looks back at the translations of the previous slices. It's like a translator whispering to themselves: "Okay, I just translated the title as 'Summer Sale,' so this next sentence about '50% off' must be part of that sale." This keeps the whole translation smooth and logical.

4. The New Training Ground (GLoD Dataset)

You can't teach a student to drive without a driving school. Similarly, the researchers realized that existing AI training data wasn't good enough for this specific task. Most data was just simple images with one translation.

So, they built a massive new dataset called GLoD (510,000 examples!).

They took real-world images (menus, road signs, posters).
They created "Global-Local" pairs for every single image (the whole picture + the zoomed-in slices).
They had humans and AI work together to ensure the translations were perfect.
Analogy: It's like giving the AI a library of 500,000 "Before and After" photo albums where every photo is annotated with exactly how to translate the text in different zoom levels.

5. The Results: Smarter, Not Just Bigger

Usually, to make AI smarter, companies just make the AI "bigger" (add more brain power). But this paper shows that being smarter about how you look is better than just being bigger.

Efficiency: GLoTran can translate high-resolution images using much less computer power than other models. It doesn't need to process millions of pixels at once; it just processes the important bits.
Accuracy: In tests, GLoTran translated text more completely and accurately than even the most famous, expensive AI models (like GPT-4o or Qwen-VL). It stopped skipping words and stopped making up fake sentences.

Summary

GLoTran is a new way of teaching AI to read text in images. Instead of staring at a giant, confusing wall of text, it teaches the AI to:

Step back to understand the scene.
Zoom in to read the details.
Remember what it just read to keep the story straight.

It's a bit like giving the AI a pair of binoculars and a magnifying glass, and teaching it how to use both at the same time to get the perfect translation.

1. Problem Statement

Text Image Machine Translation (TIMT) involves translating text embedded within images from a source language to a target language. While Multimodal Large Language Models (MLLMs) have advanced this field, they face significant challenges when processing high-resolution, text-rich images (e.g., posters, menus, complex documents).

Limitations of Existing Methods:
- Cascaded Pipelines: Suffer from error propagation (OCR errors lead to translation errors) and high computational latency.
- Traditional End-to-End Models: Often struggle to generalize to diverse real-world scenarios with cluttered layouts.
- Current MLLMs: When processing high-resolution images directly, they encounter:
  - Visual Redundancy: Excessive non-textual visual tokens (backgrounds, icons) dilute attention.
  - Resolution Constraints: MLLMs often downsample images to fit input limits, causing loss of fine-grained text details.
  - Global-Local Misalignment: Models fail to maintain scene-level context while accurately recognizing small, scattered text, leading to omissions, hallucinations, mistranslations, and semantic drift.

2. Methodology: GLoTran Framework

The authors propose GLoTran, a framework designed to integrate global contextual understanding with fine-grained local text perception.

A. Global-Local Dual Visual Perception Strategy

Instead of processing a single high-resolution image or a single low-resolution thumbnail, GLoTran decomposes the input into two complementary views:

Global View ( $I_g$ ): The original high-resolution image is downsampled to a low resolution (e.g., 224px). This captures the holistic scene layout, semantic priors, and inter-regional relationships.
Local Slices ( $I_i$ ): A text region detector (e.g., PaddleOCR) identifies candidate text regions. These are cropped, sorted by spatial order, and merged into compact slices. Each slice preserves high-fidelity textual details.

B. Architecture and Processing Flow

Detection & Grouping: Text regions are detected, sorted, and merged based on geometric and typographic cues to form a sequence of local slices.
Dual Encoding: Both the global image and local slices are encoded by a shared visual encoder (e.g., ViT).
Hierarchical Cross-Attention: A key innovation is the introduction of hierarchical cross-attention between global and local tokens in early Transformer layers. This allows local text tokens to selectively attend to semantically relevant global context tokens, grounding local translation in the overall scene.
Regressive Regional Translation: Translation is performed sequentially across slices.
- Replay Mechanism: The translation of the current slice ( $\hat{Y}_i$ ) is conditioned on the global image, the current slice, and the translations of the previous $\eta$ regions (the "replay window"). This ensures terminology consistency and narrative continuity.
Structured Prompting: The input prompt to the LLM includes four components:
- Global Understanding Instruction: Comprehend the overall layout.
- Local Focus Instruction: Focus on the specific slice text.
- Consistency Rule: Ensure local output aligns with global context and previous translations.
- Translation Instruction: Explicit task definition.

3. Key Contributions

1. The GLoTran Framework

A novel architecture that balances scene-level context and fine-grained text detail. It effectively mitigates the "attention dilution" problem in high-resolution images by decoupling global context (low-res) from local detail (high-res slices).

2. The GLoD Dataset

To support this paradigm, the authors constructed GLoD, a large-scale TIMT dataset specifically curated for global-local dual perception.

Scale: 510K high-resolution global-local image-text pairs.
Diversity: Covers 40+ real-world scenarios (menus, documents, posters, road signs) in 5 languages.
Quality: Curated via a rigorous pipeline involving automated detection, bi-directional translation fusion (using GPT-4o/DeepSeek), and human verification.

3. Comprehensive Evaluation

The paper provides extensive benchmarks showing that GLoTran outperforms state-of-the-art MLLMs (both open-source like Qwen3-VL, InternVL3, and closed-source like GPT-4o) in translation completeness and accuracy, particularly in dense and heterogeneous layouts.

4. Experimental Results

Performance on MCiTon Benchmark:
- GLoTran (based on Qwen3-VL 8B) achieved the best results across all 8 scenarios (Documents, Posters, Novels, etc.).
- It outperformed the base Qwen3-VL 8B by +7.49% BLEU and +2.49% COMET.
- In dense layouts (documents/posters), it showed average gains of 4.6% BLEU over other open-source MLLMs.
Multilingual Performance (MTIT6):
- Consistently achieved top performance across 6 language pairs (e.g., JP→CN, KO→CN), surpassing models with significantly larger parameter counts.
Scaling Insights:
- The study found that simply increasing model parameters (e.g., from 7B to 32B or 40B) yields diminishing returns for TIMT tasks.
- GLoTran's strategy (global-local perception) provided greater performance gains than parameter scaling alone.
Efficiency vs. Accuracy:
- To match GLoTran's accuracy, a standard MLLM (Qwen3-VL) would need to process full-resolution images, resulting in a ~215x increase in FLOPs and significantly higher latency.
- GLoTran achieves high accuracy with a low-resolution global view (224px) and local slices, maintaining a manageable token count (~4.9K - 8.4K) and memory footprint.

5. Significance and Impact

Paradigm Shift: Moves beyond the "one-size-fits-all" high-resolution input approach, demonstrating that structured decomposition (Global + Local) is more effective for text-rich images.
Solving Hallucinations & Omissions: By explicitly grounding local text in global context and enforcing consistency via replay mechanisms, the framework significantly reduces common MLLT failures like missing text or inventing content.
Resource Efficiency: Offers a path to high-quality TIMT without the prohibitive computational cost of processing full-resolution images in MLLMs, making it more viable for real-world deployment.
Community Resource: The release of GLoD fills a critical gap in high-resolution, text-rich TIMT datasets, enabling future research in fine-grained multimodal translation.

In summary, GLoTran establishes a new state-of-the-art for translating complex, text-rich images by teaching MLLMs to "see the forest (global context) and the trees (local text)" simultaneously, rather than trying to process the entire forest at a single, often overwhelming, resolution.