OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis

Imagine you are trying to understand a complex story, but you have two different ways of looking at the book:

The "Single Page" View: You look at one page at a time. You can read the words clearly and see the small details (like a typo or a specific drawing), but you can't see how the story flows from one page to the next.
The "Whole Book" View: You hold the entire book in your hands. You can see the big picture, how Chapter 1 connects to Chapter 10, and the overall shape of the story. But if you try to read a single word on a specific page, it's too blurry and small to make out.

The Problem:
In the world of medical AI, specifically for CT scans (which are like 3D X-rays of the human body), doctors need both views.

They need to see a tiny, sub-centimeter nodule on a single slice (the "Single Page" view).
They also need to see how a tumor spreads through an organ or how one organ pushes against another (the "Whole Book" view).

Currently, AI models are stuck in one camp or the other. Some are great at reading single slices but get lost in the 3D structure. Others are great at 3D shapes but miss the tiny, critical details. This makes them unreliable for real-world doctors.

The Solution: OmniCT
The paper introduces OmniCT, a new AI model that acts like a super-reader who can flip through pages and hold the whole book at the same time. It unifies these two perspectives into one powerful brain.

Here is how it works, using simple analogies:

1. The "Spatial Consistency" Trick (SCE)

The Analogy: Imagine you are trying to understand a 3D object, like a loaf of bread, but you only have a camera that takes flat photos.
How OmniCT does it: Instead of just taking one photo, it takes three slices of bread and stacks them together to make a tiny "mini-loaf." It then teaches the AI to recognize that these three slices belong together and have a specific order (top, middle, bottom).
The Magic: It also adds "GPS coordinates" to every pixel. Just like a map has North, South, East, and West, OmniCT gives every part of the image a 3D address (Up/Down, Left/Right, Front/Back). This helps the AI understand the shape of the body, not just the flat picture.

2. The "Organ-Level" Focus (OSE)

The Analogy: Imagine a detective looking at a crime scene. If they look at the entire room equally, they might miss a tiny clue on the floor. But if they know exactly where the "important stuff" is (like the safe or the weapon), they can zoom in on those spots.
How OmniCT does it: The AI is taught to identify specific organs (like the liver or heart) first. It then creates a "highlight reel" of just those organs.
The Magic: It uses a smart compression technique. If an organ is huge (like the liver), it summarizes it efficiently. If an organ is tiny (like the pancreas), it "magnifies" the details so the AI doesn't miss anything. This ensures the AI pays attention to the right places without getting overwhelmed by too much data.

3. The "MedEval-CT" (The New Exam)

The Analogy: Before, if you wanted to test a student's math skills, you might give them a mix of algebra and geometry questions. But for medical AI, the tests were messy. Some tests only had 2D pictures, others only had 3D volumes, and they weren't fair.
The Innovation: The authors built MedEval-CT, the world's largest and most fair "final exam" for medical AI.
- It contains 1.7 million questions (like a massive question bank).
- It tests the AI on everything: from simple "What is this?" questions to complex "What should the doctor do next?" reasoning.
- It covers 13 different organs and various types of medical tasks.

Why This Matters

Think of previous AI models as specialized tools: a hammer is great for nails but bad for screws. A screwdriver is great for screws but bad for nails.

OmniCT is the "Swiss Army Knife" of medical imaging.

It is smarter than current models at spotting tiny tumors (micro-level).
It is better at understanding how organs relate to each other (macro-level).
It is more reliable because it was tested on a massive, fair dataset that mimics real hospital scenarios.

The Bottom Line:
OmniCT bridges the gap between looking at a single photo and understanding the whole 3D body. By doing this, it moves medical AI one giant step closer to actually helping doctors diagnose diseases accurately and safely, rather than just being a cool tech demo. It's not just an upgrade; it's a new way of thinking about how computers see the human body.

1. Problem Statement

Computed Tomography (CT) is a critical diagnostic modality requiring the analysis of both slice-level local features (e.g., sub-centimeter nodules, lesion boundaries) and volume-level spatial representations (e.g., tumor infiltration, inter-organ anatomical relations).

Current Large Vision-Language Models (LVLMs) in the medical domain suffer from a "fragmentation" problem:

Slice-driven models: Leverage strong 2D pre-training and generalization but fail to capture cross-slice spatial consistency, leading to a lack of 3D context.
Volume-driven models: Explicitly model voxel-level structures but often lack sensitivity to fine-grained abnormalities, suffer from coarse granularity, and struggle to adapt to slice-level inputs.

The absence of a unified modeling paradigm that effectively bridges 2D and 3D representations constitutes a major bottleneck for the clinical translation of medical LVLMs.

2. Methodology: OmniCT Architecture

OmniCT is a unified LVLM designed to process both 2D slices and 3D volumes within a single framework. It introduces two core modules to achieve this: Spatial Consistency Enhancement (SCE) and Organ-level Semantic Enhancement (OSE).

A. Spatial Consistency Enhancement (SCE)

SCE bridges the modality gap between slices and volumes to inject 3D spatial priors while maintaining 2D efficiency. It consists of three components:

Volumetric Slice Composition (VSC):
- For 3D volumes, adjacent slices are structurally concatenated along the channel dimension to form locally consistent volumetric units (e.g., stacking 3 slices into a $3 \times H \times W$ unit).
- For 2D slices, the single slice is replicated along the channel axis to match the volumetric unit format.
- This unifies both inputs into a series of reassembled units $\hat{S}$ , preserving cross-slice spatial transitions.
Tri-Axial Positional Embedding (TPE):
- Sinusoidal positional encodings are injected along three dimensions: Depth ( $N_s$ , the number of reassembled units), Height ( $H'$ ), and Width ( $W'$ ).
- This enriches visual tokens with explicit 3D positional awareness ( $Z = F \oplus P_{Ns} \oplus P_{H'} \oplus P_{W'}$ ).
MoE Hybrid Projection (MHP):
- To handle token explosion and reduce redundancy, a token-level unshuffle operation clusters spatially adjacent tokens.
- A Mixture of Experts (MoE) projection dynamically aligns slice and volume features into the LLM's representation space. It uses a shared projection matrix ( $W_{share}$ ) combined with slice-specific ( $W_s$ ) and volume-specific ( $W_v$ ) experts, routed by binary indicators. This ensures semantic unification while allowing modality-specific adaptation.

B. Organ-level Semantic Enhancement (OSE)

Clinical diagnosis occurs at the organ level. OSE addresses the scale mismatch between high-resolution volumes and localized lesions by:

Anatomical Region Localization: Uses segmentation masks (from TotalSegmentator) to select tokens corresponding to specific organs ( $\hat{F}_o$ ).
Adaptive Feature Aggregation: A discriminative aggregation function compresses variable-length organ token sequences into a fixed number of tokens ( $L_c$ $L_{c}$ ).
- Magnification Effect: Small organs are expanded to highlight fine-grained features.
- Compression Effect: Large organs are compressed to reduce redundancy.
Context Fusion: The aggregated organ tokens are concatenated with global visual tokens to form a semantically enhanced multimodal representation ( $\hat{F}_{OSE}$ ) fed into the LLM.

C. Training Strategy

The model undergoes a two-stage training process:

Pretraining: Only the MoE Hybrid Projection is updated to align visual tokens with the LLM space.
Instruction Tuning: Both the projection layer and the LLM backbone are optimized using autoregressive cross-entropy loss on medical VQA data.

3. Key Contributions

A. Unified LVLM Paradigm

OmniCT is the first model to successfully unify slice and volume representations in CT analysis. It retains the efficiency and generalization of 2D models while integrating the spatial structural awareness of 3D models through SCE and OSE.

B. MedEval-CT (Dataset, Benchmark, and Factory)

To address the lack of comprehensive evaluation tools, the authors introduced MedEval-CT:

MedEval-CT-Dataset: The largest unified CT resource to date, containing 1.7 million VQA samples from 170k 3D volumes and 327k 2D slices. It covers 7 task types (e.g., Free-form QA, Report Generation) and 13 organ categories.
MedEval-CT-Bench: A systematic hybrid benchmark ensuring balanced representation across clinical difficulty levels (from basic recognition to clinical reasoning) and organ distributions.
MedEval-CT-Factory: A standardized evaluation toolkit that handles heterogeneous inputs (DICOM, NIfTI, etc.) and provides multi-dimensional metrics (statistical, semantic, and LLM-based clinical reasoning).

C. Data Orchestration Engine

A pipeline comprising a Corpus Selector, Integrity Verifier, Task Mapper, and Semantic Refiner to ensure data quality, modality consistency, and clinical authenticity (e.g., rewriting questions to reduce answer leakage).

4. Experimental Results

OmniCT was evaluated against state-of-the-art general LVLMs (e.g., GPT-5, Qwen2.5-VL, InternVL3) and specialized medical LVLMs (e.g., M3D-LaMed, CT-CHAT, RadFM).

Slice-Driven Benchmarks: OmniCT (7B) achieved an average score of 81.45 on 2D benchmarks, outperforming the second-best model (Lingshu) by +11.01 points. It surpassed general LVLMs which lack CT-specific adaptation.
Volume-Driven Benchmarks: OmniCT (7B) achieved an average of 66.15 on 3D benchmarks, significantly outperforming specialized 3D models (e.g., M3D-LaMed-7B at 30.88) and general models.
Ablation Studies:
- Adding SCE improved 2D performance by ~1.5 points and 3D by ~1.5 points.
- Adding OSE further boosted performance, with the full model (SCE + OSE) showing the best results.
- Mixed Training: Training on mixed 2D/3D data consistently outperformed single-modality training, proving the model's ability to transfer knowledge between modalities.
Generalization: OmniCT demonstrated robust performance across all organs (including difficult small organs like the pancreas) and clinical task levels, narrowing the gap between low-level recognition and high-level reasoning.

5. Significance

Clinical Translation: By unifying slice and volume understanding, OmniCT addresses the dual requirements of clinical CT interpretation (local detail + global context), moving closer to real-world diagnostic utility.
New Paradigm: It establishes a new standard for cross-modal medical imaging, demonstrating that 2D encoders with structured spatial injection can effectively represent 3D volumes, challenging the necessity of heavy native 3D encoders.
Community Resource: The release of MedEval-CT (dataset, benchmark, and factory) provides a fair, scalable, and rigorous foundation for future research in medical LVLMs, addressing the historical lack of standardized evaluation in this domain.

In conclusion, OmniCT represents a significant leap forward in medical AI, offering a unified, high-performance solution for CT analysis that balances micro-level detail sensitivity with macro-level spatial reasoning.