SigVLP: Sigmoid Volume-Language Pre-Training for Self-Supervised CT-Volume Adaptive Representation Learning

Imagine you are trying to teach a computer to "read" a 3D medical scan (like a CT scan of a human body) and understand the doctor's written report about it. This is a bit like trying to teach a student to understand a 3D movie by only showing them a single, flat photograph, or by chopping the movie into tiny, disconnected frames.

Here is the story of SigVLP, a new AI method designed to solve this problem, explained through simple analogies.

The Problem: The "Cookie Cutter" Approach

Medical scans are tricky. One patient might have a scan with 50 slices (like a loaf of bread with 50 slices), while another has 200 slices. The thickness of the slices and the spacing between them can vary wildly depending on which hospital or machine took the picture.

The Old Way:
To train AI models, scientists used to force all these different scans into a "cookie cutter." They would chop the scans into fixed-size blocks or stretch/squish them to make them all the same size.

The Analogy: Imagine trying to fit a long, winding river into a square box. You have to either cut off the ends or stretch the water until it fits. In doing so, you lose the natural flow and shape of the river. Similarly, the old AI methods lost important details about the body's 3D structure because they forced everything into a rigid grid.

The Solution: The "Scrolling Video" Approach

The authors of this paper, SigVLP, decided to stop forcing the scans into a box. Instead, they treated the 3D scan like a video.

1. The "Chunk" Strategy
Instead of looking at the whole body at once, the AI looks at the scan in "chunks" (like taking a bite of a sandwich rather than eating the whole thing at once).

The Analogy: Imagine reading a long novel. Instead of trying to memorize the whole book in one go, you read it page by page. SigVLP reads the CT scan "page by page" (slice by slice) but keeps the context of the story flowing.

2. The "Rotary Position" Compass
Old AI models used "absolute position" tags, like saying "This is slice #100." If the scan had 200 slices, the model got confused.

The Analogy: Think of a GPS. An old system says, "You are at Mile Marker 100." If you move to a different road, that number is useless. SigVLP uses a Rotary Position Embedding (RoPE), which is like a compass. It doesn't care about the specific number; it cares about the direction and distance relative to the previous slice. This allows the AI to handle a scan with 30 slices or 300 slices without getting lost. It understands that "Slice B is right next to Slice A," regardless of how long the whole book is.

3. The "Organ-Specific" Translator
Medical reports are long and messy. A report might say, "The heart looks good, but the liver has a spot."

The Old Way: The AI would try to match the entire scan to the entire report. It's like trying to match a whole city map to a whole travel diary. It's too vague.
The SigVLP Way: The AI uses a smart assistant (a large language model) to break the report down. It says, "Okay, for this specific chunk of the scan showing the liver, let's only look at the part of the report that talks about the liver."
The Analogy: Instead of matching a whole library to a whole encyclopedia, SigVLP matches a single book chapter to the specific paragraph in the encyclopedia that talks about that chapter. This creates a much tighter, more accurate connection between the image and the text.

Why This Matters (The Results)

By using this flexible, chunk-based approach, SigVLP learned to understand the 3D body much better than previous models.

Better Precision: When asked to find a small tumor or a specific organ (like the stomach or aorta), SigVLP was much more accurate. It didn't just guess "it's somewhere in the middle"; it knew exactly where the boundaries were.
Better Memory: It learned to connect the visual image with the medical text so well that it could find the right scan just by reading a description, even if it had never seen that specific scan before.
Efficiency: It didn't need to waste computing power stretching and squishing images. It just read them naturally, like a human radiologist does.

The Bottom Line

SigVLP is like upgrading from a rigid, cookie-cutter robot to a flexible, intelligent reader. It respects the natural shape and size of medical scans, breaks them down into manageable pieces, and matches them with the right parts of the doctor's notes. This helps computers "see" the human body more clearly, which could eventually lead to faster and more accurate diagnoses for patients.

1. Problem Statement

Current medical vision-language models (VLMs) face significant challenges when applied to 3D Computed Tomography (CT) volumes:

Data Heterogeneity & Fixed-Size Constraints: CT scans from different vendors vary significantly in resolution, slice thickness, and the number of slices ( $z$ -axis). Traditional Transformer-based models require fixed-length inputs, forcing researchers to crop or interpolate volumes to a uniform grid. This process inevitably discards clinically relevant anatomical details and disrupts volumetric continuity.
Coarse-Grained Alignment: Existing models often align entire 3D volumes with full-length radiology reports. This global alignment fails to capture fine-grained, organ-specific correlations, leading to embeddings that encode local cues rather than holistic, cross-anatomical semantics.
Lack of Standardization: The scarcity of large-scale, standardized, open-access 3D medical datasets (compared to 2D datasets like MIMIC-CXR) limits the generalization of current models across different organs and institutions.

2. Methodology: SigVLP

The authors propose SigVLP, a novel self-supervised pre-training framework designed to handle variable-length 3D volumes without resampling.

A. Dynamic Chunk-wise Training Pipeline

Instead of processing entire scans or fixed grids, SigVLP treats 3D CT volumes as sequences of 3D chunks (sub-volumes).

Variable Length Handling: During training, volumes are sampled into chunks with varying lengths ( $\ell \in \{32, 64, 128\}$ slices) and random starting positions. This eliminates the need for fixed-depth constraints.
Organ-Aware Text Construction: To match these visual chunks, the system does not use the full radiology report. Instead, it uses a lightweight LLM (GPT-5 Mini) to decompose reports into organ-specific findings.
- Anatomical segmentation masks (e.g., from TotalSegmentator) identify which organs are present in a specific chunk.
- The system dynamically reconstructs a text description containing only the findings relevant to the organs in that specific chunk (e.g., "Liver: hepatomegaly; Lung: normal").
- This creates a fine-grained, region-specific alignment between the visual sub-volume and its corresponding textual observations.

B. Rotary Position Embeddings (RoPE)

To address the variable sequence length of the $z$ -axis without absolute positional embeddings (which require fixed lengths):

RoPE Integration: The model replaces absolute positional embeddings with Rotary Position Embeddings (RoPE) applied directly within the attention mechanism.
Mechanism: RoPE assigns each token a unique, rotationally invariant positional signature. It rotates the query and key projections based on their relative positions using sine and cosine weights.
Benefit: This allows the model to capture relative dependencies across tokens and adapt to any input size dynamically, preserving the spatio-temporal structure of the 3D volume without resampling.

C. Optimization

Loss Function: The model utilizes a pairwise sigmoid objective (inspired by SigLIPv2) to stabilize large-scale vision-language pretraining.
Optimizer: The authors employ the Muon optimizer, a second-order optimizer designed for hidden layers in neural networks, which facilitates stable and efficient learning across variable-length inputs compared to standard AdamW.

3. Key Contributions

On-the-Fly Subvolume-Observation Alignment: A novel training-time method that retrieves and constructs clinical observations specifically for the organs present in any sampled sub-volume, enabling optimal text-volume encoder alignment.
Large-Scale Volumetric Pretraining: The first demonstration of anatomically consistent vision-language alignment at scale using a large corpus of 3D CT volumes (CT-RATE dataset, >40k scans).
Organ-Wise Clinical Dataset: The release of an open-source dataset of organ-wise clinical observations, automatically extracted from free-text reports using LLM-assisted analysis, built upon the CT-RATE dataset.
Architecture Innovation: The successful integration of RoPE and chunk-wise processing into a medical VLM, allowing for variable-depth inputs while maintaining volumetric coherence.

4. Experimental Results

SigVLP was evaluated on the CT-RATE dataset and compared against strong baselines including CT-CLIP, CT-Vocab, DINOv3 variants, and SigLIPv2.

Text-to-Volume Retrieval:
- SigVLP achieved a MeanRank of 8.23 (vs. 26.01 for CT-CLIP) in retrieving the correct radiology report for a 200-slice volume.
- It significantly outperformed all baselines in Recall@5 (0.636 vs. 0.204) and Recall@10 (0.769 vs. 0.348).
Zero-Shot Abnormality Classification:
- SigVLP achieved the highest Precision (0.435) and Accuracy (0.80), demonstrating a clean separation of true positives from hard negatives, whereas DINOv3 showed high recall but low precision (over-sensitive features).
Segmentation (Linear Probe):
- SigVLP showed superior performance on medium and small anatomical structures (e.g., Aorta, Stomach) compared to DINOv3.
- For the Aorta, the Dice score improved from 0.278 (DINOv3) to 0.471 (SigVLP), a 1.7x improvement, indicating better capture of fine anatomical details.
Scalability with Slice Count:
- Unlike 2D models (DINOv3) whose performance degrades as slice count increases due to the loss of 3D context, SigVLP's performance improved as the number of slices increased, confirming its ability to leverage true 3D volumetric information.
Efficiency: SigVLP demonstrated higher computational efficiency (FLOPs/sample) and throughput compared to DINOv3 for volumes with $\ge 32$ slices.

5. Significance and Impact

Overcoming Data Limitations: SigVLP provides a scalable solution for training on heterogeneous medical data without the information loss associated with cropping or interpolation.
Clinical Relevance: By establishing fine-grained, organ-level alignment, the model learns representations that are more useful for specific clinical tasks (e.g., tumor segmentation, organ detection) rather than just global image classification.
Foundation for Future Models: The work establishes that adaptive positional encoding (RoPE) combined with localized supervision is a viable path toward general-purpose 3D medical foundation models capable of reasoning over both anatomy and language.
Resource Efficiency: The use of the Muon optimizer and chunk-wise processing allows for efficient training on large-scale 3D datasets, making high-quality medical VLMs more accessible.

In conclusion, SigVLP represents a significant leap forward in medical AI by moving away from rigid, fixed-grid 3D processing toward a flexible, context-aware, and organ-specific representation learning paradigm.

SigVLP: Sigmoid Volume-Language Pre-Training for Self-Supervised CT-Volume Adaptive Representation Learning

The Problem: The "Cookie Cutter" Approach

The Solution: The "Scrolling Video" Approach

Why This Matters (The Results)

The Bottom Line

1. Problem Statement

2. Methodology: SigVLP

A. Dynamic Chunk-wise Training Pipeline

B. Rotary Position Embeddings (RoPE)

C. Optimization

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation