From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

The Big Picture: The "Frankenstein" vs. The "Native Speaker"

Imagine you want a robot that can see a picture and talk about it.

The Old Way (Modular VLMs):
Think of this like building a robot by gluing two separate experts together. You have Expert A (a Vision Specialist who knows how to see) and Expert B (a Language Specialist who knows how to speak). To make them work together, you have to build a complicated "translator" desk between them.

The Problem: Expert A speaks "Pixel," and Expert B speaks "Word." The translator often loses meaning, gets confused, or slows things down. They are two different people trying to have a conversation through a wall.

The New Way (NEO - Native VLM):
The authors of this paper say, "Why glue them together? Let's build a single person who is born knowing both languages."
NEO is a "Native" model. It doesn't have a separate vision part and a language part. It is one single brain that learns to see and speak simultaneously from the very first day of training. It's like raising a child who learns to see a red apple and say "red apple" at the exact same moment, rather than teaching them to see first, then teaching them to speak later.

The Three Secret Ingredients (The "Primitives")

To build this single brain, the researchers created three special tools (called Primitives) that act like the brain's natural wiring:

1. The "Universal Translator" (Flexible Position Encoding)

The Analogy: Imagine you are describing a map. In old models, you might say, "The tree is at row 1, column 1." But if you add a new row, the whole map breaks.
NEO's Solution: NEO uses a special coordinate system (called Native-RoPE) that understands space naturally. It knows that "left," "right," "up," and "down" exist regardless of how big the image is. It treats the image like a living landscape, not a rigid grid. This allows it to handle any size photo without getting lost.

2. The "Two-Way Street" (Multi-Head Native Attention)

The Analogy: In a standard conversation, you usually listen to what the other person said before you speak (one-way). But when looking at a picture, you need to look at the whole scene at once to understand it.
NEO's Solution: NEO has a special attention mechanism. When looking at an image, it can look at every pixel simultaneously (like a wide-angle lens) to understand the whole picture. When speaking, it looks back at what it just said. It mixes these two modes perfectly, allowing the "eyes" and the "mouth" to talk to each other instantly.

3. The "Construction Phase" (Pre-Buffer & Post-LLM)

The Analogy: Imagine building a skyscraper. You don't start by putting the roof on a finished house. You start with a strong foundation.
NEO's Solution: The training happens in two phases:
- Phase 1 (Pre-Buffer): The model starts as a "sponge," soaking up millions of images and captions to learn what things look like. It's like a student taking notes in a library.
- Phase 2 (Post-LLM): Once the foundation is solid, the model merges into one giant brain. It stops being a "student" and becomes a "teacher," using its language skills to reason about what it saw.
- Why this matters: This prevents the model from forgetting how to speak while it's learning to see. It keeps the "language muscle" strong while building the "vision muscle."

The Results: How Good is NEO?

The researchers tested NEO on thousands of tasks, from reading text in photos to solving complex math problems with charts.

The Competition: They compared NEO against the "Frankenstein" models (Modular VLMs) which are currently the industry leaders.
The Outcome: Even though NEO was trained with less data and fewer resources than the giants, it performed almost as well as them.
- Analogy: It's like a self-taught musician who, with a simple guitar and a few months of practice, can play a song just as beautifully as a classically trained orchestra that spent years in conservatory.

Key Takeaway: NEO proves that you don't need to glue two separate systems together to get great results. A single, unified system that learns vision and language together is not only possible but highly efficient.

Why Should You Care?

Cheaper & Faster: Because it's one model instead of two glued together, it's easier to run on smaller computers (like your phone or a laptop).
Better Understanding: Since it learns vision and language together, it understands the relationship between them better. It doesn't just "see" a dog and "say" "dog"; it understands the concept of a dog in a way that feels more human.
The Future: This paper suggests that the next generation of AI won't be built by stacking different tools on top of each other. Instead, the future is Native AI—systems that are born multimodal, seeing and speaking as one unified intelligence.

In a nutshell: The paper introduces NEO, a new type of AI that learns to see and speak at the same time, proving that a single, unified brain is often smarter and more efficient than two separate brains glued together.

1. Problem Statement

Current state-of-the-art Vision-Language Models (VLMs) predominantly follow a modular architecture, consisting of a pre-trained Visual Encoder (VE), a projector, and a Large Language Model (LLM). While effective, this approach suffers from several critical limitations:

Inductive Biases: Modular VLMs inherit strong biases from pre-trained VEs, limiting resolution flexibility, eroding fine-grained details, and blunting sensitivity to multi-scale features.
Training Complexity: They require complex, multi-stage training pipelines (pre-training VEs, aligning with LLMs, post-training) to harmonize components, leading to high alignment costs and infrastructure overhead.
Fragmented Learning: The separation of vision and language modules prevents the model from learning intrinsic cross-modal interactions from the ground up, often resulting in suboptimal pixel-word alignment.
Native VLM Challenges: Previous attempts at "native" (monolithic) VLMs often struggle with efficiency, optimization stability, and the preservation of linguistic knowledge when mapping visual tokens into LLMs.

The paper asks: What fundamental constraints separate native VLMs from modular ones, and how can we construct native VLMs that are accessible, scalable, and competitive?

2. Methodology: The NEO Framework

The authors propose NEO, a native VLM built from first principles using a unified, monolithic architecture. Instead of grafting a visual encoder onto an LLM, NEO integrates vision and language into a single backbone using specific Native VLM Primitives.

A. Core Architecture: Native VLM Primitives

The model is built upon stacked blocks that inherently support both modalities. Key innovations include:

Native Rotary Position Embedding (Native-RoPE):
- Unlike standard 1D or 3D RoPE, NEO decouples Temporal (T), Height (H), and Width (W) dimensions.
- Index Allocation: Text tokens retain T indices while H/W are zeroed. Image tokens have constant T indices but unique H/W indices for spatial location. This prevents long text sequences from distorting spatial dependencies.
- Frequency Allocation: Distinct base frequencies are assigned to T, H, and W. H and W dimensions use new head dimensions with higher frequencies (optimized for local spatial semantics), while T uses the original LLM frequency (optimized for long-range relations).
Multi-Head Native Attention (MHNA):
- Mixed Attention Mechanism: Text tokens use causal attention (autoregressive generation), while image tokens use bidirectional attention (allowing full interaction among visual tokens, similar to a ViT).
- Channel Decoupling: The model enlarges Query (Q) and Key (K) head dimensions to accommodate separate H and W relations, adding only ~10% parameters over a standard Transformer block.
Patch and Word Embeddings:
- Images are processed via a lightweight Patch Embedding Layer (PEL) using Conv1D/Conv2D and GELU, converting them into tokens.
- Text is processed via the standard LLM tokenizer.
- Both are merged into a unified token sequence with special <img> and </img> markers.

B. Training Paradigm: Pre-Buffer and Post-LLM

To bridge the gap between random initialization and pre-trained linguistic capabilities, NEO employs a two-stage training strategy:

Pre-Buffer (Early Layers): The initial layers ( $L_1$ ) are randomly initialized. This stage is responsible for pixel-word alignment and learning visual concepts from scratch. It acts as a "pre-aligner" to translate raw pixels into a unified representation.
Post-LLM (Later Layers): The remaining layers ( $L_2$ $L_{2}$ ) are initialized from a pre-trained LLM (e.g., Qwen3). These layers inherit linguistic proficiency and reasoning capabilities.
- Initialization Strategy: The Q weights for H/W dimensions are initialized from the LLM's temporal Q weights; K weights are zero-initialized; and QK-Norm is initialized with specific parameters ( $\beta=0, \gamma=1$ ).
- Progressive Fusion: During pre-training, the Pre-Buffer and Post-LLM are distinct. As training progresses to Mid-Training and Supervised Fine-Tuning (SFT), the partition dissolves, forming a unified monolithic backbone that autonomously allocates capacity for encoding, alignment, and reasoning.

C. Training Data & Procedure

Scale: Trained on 390M image-text examples (345M pre-training, 40M mid-training, 4M SFT).
Stages:
1. Pre-Training: Next-token prediction on web-scale and synthetic data. LLM weights are frozen initially to preserve linguistic knowledge while the Pre-Buffer learns visual perception.
2. Mid-Training: Unfreezing the full backbone to strengthen vision-language alignment on high-resolution and complex scene data.
3. Supervised Fine-Tuning (SFT): Optimization on high-quality instruction data (VQA, reasoning, OCR) for real-world deployment.

3. Key Contributions

Native VLM Primitives: Defined a set of design principles (Flexible Position Encoding, MHNA, Native-RoPE) that allow a single model to natively encode, align, and reason across modalities without separate encoders.
The NEO Architecture: Introduced a scalable, dense, monolithic model that eliminates the need for external visual encoders or complex alignment modules.
Pre-Buffer Strategy: A novel training technique that allows pre-trained LLMs to guide visual learning from scratch while maintaining end-to-end differentiability, significantly reducing the cost of developing native VLMs.
Reusability: The Pre-Buffer components are released as reusable assets, fostering a cost-effective ecosystem for future native VLM research.

4. Experimental Results

NEO was evaluated on 2B and 8B parameter scales against top-tier modular VLMs (e.g., Qwen2-VL, InternVL3) and other native VLMs (e.g., Mono-InternVL, EVE).

Performance vs. Modular VLMs:
- NEO-2.2B and NEO-9B achieve highly competitive performance with modular counterparts (e.g., InternVL2.5, Qwen2.5-VL) across diverse benchmarks (MMMU, MMBench, MMVet, OCRBench).
- Notably, NEO narrows the performance gap significantly despite using less training data and no reinforcement learning (RL).
Performance vs. Native VLMs:
- NEO substantially outperforms existing native VLMs (e.g., EVE, Mono-InternVL, Chameleon) on visual-centric benchmarks.
- It demonstrates superior pixel-word alignment and reasoning capabilities, proving that native architectures can match modular performance when designed with proper primitives.
Ablation Studies:
- Native-RoPE: Outperforms 1D-RoPE, IL-RoPE, and M-RoPE, validating the importance of decoupling spatial and temporal frequencies.
- Mixed Attention: Bidirectional attention for images significantly boosts performance over purely causal attention.
- Pre-Buffer: The Pre-Buffer alone (trained on 22M samples) achieves performance within ~2.5% of the full NEO model, demonstrating its efficiency as a reusable visual learning component.

5. Significance and Impact

Paradigm Shift: NEO challenges the dominance of modular "ViT-MLP-LLM" pipelines, demonstrating that unified, native architectures can be equally powerful, if not more efficient, for multimodal tasks.
Democratization: By providing reusable primitives and a clear training recipe, NEO lowers the barrier to entry for developing native VLMs, making the field more accessible.
Scalability: The architecture is designed to scale efficiently, with successful results on both 2B and 9B parameter models, suggesting a clear path toward larger, more capable native systems.
Future Directions: The paper lays the groundwork for next-generation multimodal systems that are intrinsically multimodal, capable of handling video, generation, and embodied AI tasks within a single, coherent framework.

In conclusion, NEO proves that with the right architectural primitives (Native-RoPE, MHNA) and training strategies (Pre-Buffer), native VLMs can overcome historical limitations and rival the performance of complex modular systems, offering a more streamlined and scalable path forward for multimodal AI.