Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou, Peiheng Zhou, Xinyu Zhou, Dongzhan Zhou, Zhiwang Zhou, Yuhao Zhou, Bowen Zhou, Zhanping Zhong, Zhijie Zhong, Haiteng Zhao, Penghao Zhao, XiaYicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou, Peiheng Zhou, Xinyu Zhou, Dongzhan Zhou, Zhiwang Zhou, Yuhao Zhou, Bowen Zhou, Zhanping Zhong, Zhijie Zhong, Haiteng Zhao, Penghao Zhao, Xiaomeng Zhao, Zhiyuan Zhao, Yechen Zhang, Jin Zhang, Wenwei Zhang, Hongjie Zhang, Zhuo Zhang, Wenlong Zhang, Bo Zhang, Chao Zhang, Chen Zhang, Yuhang Zang, Fei Yuan, Jiakang Yuan, Jiashuo Yu, Jinhui Yin, Haochen Ye, Qian Yao, Bowen Yang, Danni Yang, Kaichen Yang, Ziang Yan, Jun Xu, Yicheng Xu, Wanghan Xu, Xuenan Xu, Chao Xu, Ruiliang Xu, Shuhao Xing, Long Xing, Xinchen Xie, Ling-I Wu, Zijian Wu, Zhenyu Wu, Lijun Wu, Yue Wu, Jianyu Wu, Wen Wu, Fan Wu, Xilin Wei, Qi Wei, Bingli Wang, Rui Wang, Ziyi Wang, Zun Wang, Yi Wang, Haomin Wang, Yizhou Wang, Lintao Wang, Yiheng Wang, Longjiang Wang, Bin Wang, Jian Tong, Zhongbo Tian, Huanze Tang, Chen Tang, Shixiang Tang, Yu Sun, Qiushi Sun, Xuerui Su, Qisheng Su, Chenlin Su, Demin Song, Jin Shi, Fukai Shang, Yuchen Ren, Pengli Ren, Xiaoye Qu, Yuan Qu, Jiantao Qiu, Yu Qiao, Runyu Peng, Tianshuo Peng, Jiahui Peng, Qizhi Pei, Zhuoshi Pan, Linke Ouyang, Wenchang Ning, Yichuan Ma, Zerun Ma, Ningsheng Ma, Runyuan Ma, Chengqi Lyu, Haijun Lv, Han Lv, Lindong Lu, Kuikun Liu, Jiangning Liu, Yuhong Liu, Kai Liu, Hongwei Liu, Zhoumianze Liu, Mengjie Liu, Ziyu Liu, Wenran Liu, Yang Liu, Liwei Liu, Kaiwen Liu, Junyao Lin, Junming Lin, Tianyang Lin, Dahua Lin, Jianze Liang, Linyang Li, Peiji Li, Zonglin Li, Zehao Li, Pengze Li, Guoyan Li, Lingkai Kong, Linglin Jing, Zhenjiang Jin, Feifei Jiang, Qian Jiang, Junhao Huang, Zixian Huang, Haian Huang, Zhouqi Hua, Han Hu, Linfeng Hou, Yinan He, Conghui He, Tianyao He, Xu Guo, Qipeng Guo, Aijia Guo, Yuzhe Gu, Lixin Gu, Jingyang Gong, Qiming Ge, Jiaye Ge, Songyang Gao, Jianfei Gao, Xinyu Fang, Caihua fan, Yue Fan, Yanhui Duan, Zichen Ding, Shengyuan Ding, Xuanlang Dai, Erfei Cui, Ganqu Cui, Pei Chu, Tao Chu, Guangran Cheng, Yu Cheng, Kai Chen, Yongkang Chen, Chiyu Chen, Guanzhou Chen, Qiaosheng Chen, Sitao Chen, Xin Chen, Haojiong Chen, Yicheng Chen, Weihan Cao, Yuhang Cao, Qinglong Cao, Lei Bai

Published 2026-03-27

📖 4 min read☕ Coffee break read

View on arXiv ↗PDF ↗

Imagine you are trying to build the ultimate "Super-Scientist" robot.

In the past, we had two types of robots:

The Generalist: A robot that knows a little bit about everything (history, math, cooking, movies) but isn't an expert in anything.
The Specialist: A robot that knows everything about one thing (like chemistry) but can't talk about movies or do math.

The paper introduces Intern-S1-Pro, which is the world's first "Trillion-Parameter" Super-Scientist. Think of it as a robot with a brain so massive (one trillion connections!) that it doesn't have to choose between being a generalist or a specialist. It is both at the same time.

Here is a simple breakdown of how they built it and why it's special:

1. The Brain Expansion: "The Library of Experts"

Imagine a library. The old version (Intern-S1) had a few very smart librarians. The new version (Intern-S1-Pro) expanded the library to have thousands of specialized experts (chemists, biologists, geologists) all working together.

The Problem: If you just throw 1,000 experts into a room, they might argue, or some might do all the work while others sit idle. This causes the system to crash (like a traffic jam).
The Solution (Group Routing): The team created a "traffic cop" system. They grouped the experts into teams. When a question comes in, the traffic cop sends it to the best team, and within that team, to the best expert. This keeps the workload perfectly balanced so the computer doesn't crash, even with a brain this big.
The "Router" Upgrade: To make sure the traffic cop learns quickly, they used a special trick (called a "Straight-Through Estimator") that lets the cop learn from every mistake, not just the ones it made, speeding up the learning process.

2. Learning to "See" Science

Science isn't just text; it's pictures, graphs, and charts.

The Problem: Most AI models look at a scientific graph and see a blurry mess of lines. They can't read the tiny labels or understand the complex data.
The Solution: The team built a special "Caption Factory." Instead of just letting the AI guess what a picture is, they used a pipeline to turn scientific papers into high-quality descriptions.
- Analogy: Imagine a human translator who doesn't just translate words, but explains why a graph looks the way it does. They fed the AI millions of these "super-explanations" so it learned to read scientific charts like a PhD student.

3. Listening to Time (The Time-Series Module)

Science often involves data that changes over time, like a heartbeat, weather patterns, or stock markets.

The Problem: Standard AI treats time like a string of beads (discrete steps). But real life is a flowing river.
The Solution: They added a "Time-Series Encoder."
- Analogy: Instead of looking at a movie one frame at a time, this module watches the whole flow of the river. It can handle data that is super short (a few seconds) or super long (years of data) without getting confused. It's like having a time machine that understands the rhythm of the universe.

4. The "Agent" Capability: Doing the Work

This isn't just a robot that answers questions; it's a robot that does things.

The Upgrade: Intern-S1-Pro has "Agent" skills.
- Analogy: If you ask a normal AI, "How do I synthesize this chemical?" it gives you a recipe. If you ask Intern-S1-Pro, it can plan the experiment, search for the right tools, run the simulation, and check the results. It's like hiring a personal assistant who can actually go into the lab and do the work, not just write a memo about it.

5. The Results: Why It Matters

The team tested this new robot against the smartest closed-source models (like the secret models from big tech companies) and other open-source models.

The Verdict: Intern-S1-Pro beat them all in science tasks.
- In chemistry, biology, and materials science, it scored higher than the "black box" models that cost millions to run.
- It also kept its general smarts (math, coding, logic), proving you don't have to sacrifice being "smart in general" to be "smart in science."

The Big Takeaway

The paper proves a counter-intuitive idea: You don't need a separate robot for every science.

If you build one giant, well-organized brain (a "Specializable Generalist") and feed it the right data, it can become the world's best scientist and the world's best general assistant. It's a massive leap forward for "AI for Science," meaning this tool could help researchers discover new medicines, design better batteries, and understand our planet faster than ever before.

1. Problem Statement

The field of AI for Science (AI4S) faces significant challenges in scaling foundation models to handle the immense diversity and complexity of scientific domains (chemistry, materials, life sciences, earth sciences).

Data Heterogeneity: Scientific data involves specialized "languages," notations, and reasoning patterns distinct from natural language, requiring models with massive capacity to master long-tailed knowledge.
Scaling Limitations: Previous attempts to scale models often suffer from training instability in Mixture-of-Experts (MoE) architectures due to expert load imbalance and the difficulty of optimizing router embeddings.
Data Quality & Alignment: Existing scientific image-text datasets often lack high-quality, dense captions. Scientific figures in PDFs contain high information density but are poorly described by standard web-crawled captions, leading to misalignment between visual and textual modalities.
General vs. Specialized Trade-off: There is a prevailing belief that specialized models outperform generalist models in niche tasks. The paper challenges this, questioning whether a sufficiently large, jointly trained generalist model can surpass specialized counterparts.
Engineering Bottlenecks: Training trillion-parameter models with Reinforcement Learning (RL) introduces severe memory constraints and precision inconsistencies between training and inference engines.

2. Methodology

Intern-S1-Pro is a 1-trillion-parameter scientific multimodal foundation model built upon the Intern-S1 architecture, utilizing the SAGE (Synergistic Architecture for Generalizable Experts) framework.

A. Architecture Innovations

Grouped Routing Mechanism: To address expert load imbalance in MoE models, the authors replace traditional Top- $k$ routing with a Grouped Router. Experts are partitioned into $G$ groups (matching the parallelism degree). Within each group, only the top- $(k/G)$ experts are selected. This ensures absolute load balancing across devices, preventing Out-of-Memory (OOM) errors and stabilizing training.
Straight-Through Estimator (STE) for Routers: To solve the gradient sparsity issue where non-selected experts receive no learning signal, the authors implement STE. This allows gradients to flow through the full dense softmax distribution during backpropagation while maintaining sparse selection in the forward pass, accelerating router convergence.
Fourier Position Encoding (FoPE): Unlike standard positional encodings (e.g., RoPE) that treat tokens as discrete particles, FoPE leverages Fourier analysis to model position as a continuous wave. This better captures the spectral and wave-interference patterns inherent in physical signals (images, time series).
Native Time-Series Encoder: A dedicated module for scientific time-series data (e.g., EEG, ECG, seismic data) featuring Adaptive Subsampling. It dynamically determines patch sizes based on signal sampling rates to normalize heterogeneous time series into a uniform representation space without losing structural features.
Vision Encoder: Uses a Native Vision Transformer (ViT) processing images at native resolution to preserve fine-grained spatial information, coupled with an MLP projector for multimodal alignment.

B. Data Strategy & Pre-training

Scientific Caption Pipeline: The authors constructed a massive pipeline to extract figures from scientific PDFs using MinerU 2.5. They employ a dual-model captioning strategy:
- InternVL3.5-241B for scientific sub-images to generate dense, domain-specific descriptions.
- CapRL-32B (trained with RLVR) for general images.
- This generated ~270B tokens of high-quality, aligned scientific image-text data.
Data Conflict Resolution: To prevent "negative transfer" between scientific and general data, three strategies are used:
1. Structured Transformation: Converting tabular/structured data into narrative text via templates.
2. Diversification: Using prompt diversification and "Rollout" mechanisms to generate reasoning chains rather than simple answers.
3. System Prompt Isolation: Injecting exclusive system prefixes for scientific vs. general data to create independent contextual environments during training.

C. Post-Training & RL

Stable Mixed-Precision RL: To train on 1T parameters efficiently, the team uses FP8 quantization for expert layers while keeping the LM head in FP32.
Consistency Mechanisms:
- Rollout Router Replay: Ensures the same experts are selected during inference (rollout) and training updates.
- Operator-Level Alignment: Minimizing precision gaps between the training engine (XTuner) and inference engine (LMDeploy) in sensitive operations like RMSNorm and softmax.
- Dual Importance Sampling: Uses two ratios to correct for train-inference distribution shifts and off-policy bias in the REINFORCE objective.

3. Key Contributions

First Trillion-Parameter Scientific Multimodal Model: Intern-S1-Pro scales to 1T parameters, demonstrating that a "Specializable Generalist" can outperform proprietary models in specialized scientific tasks while maintaining top-tier general capabilities.
Architectural Stability for Massive MoE: The Grouped Routing and STE mechanisms solve critical load imbalance and gradient sparsity issues, enabling stable training of 1T-parameter models with 4x the expert count of its predecessor.
Specialized Data Engineering: The development of a high-fidelity scientific caption pipeline and the FoPE/Time-Series encoders significantly bridge the gap between raw scientific data and LLM understanding.
Engineering Synergy: The co-design of XTuner and LMDeploy allows for efficient 1T-parameter RL training with only a ~20% efficiency drop compared to smaller models, ensuring strict precision consistency.

4. Results

Scientific Tasks: Intern-S1-Pro significantly outperforms leading proprietary models (Gemini-3-Pro, GPT-5.2) and open-source models (Qwen3-VL-235B) on specialized benchmarks:
- SciReasoner: 55.5 (vs. 14.7 for Gemini-3-Pro).
- SmolInstruct (Chemistry): 74.8.
- MatBench (Materials): 72.8.
- Time Series (SciTS): Achieved F1 scores of 99.5 on EAU01, vastly outperforming text-only and vision-language baselines.
General Tasks: Maintains strong performance in general reasoning and coding:
- AIME-2025: 93.1.
- MMLU-Pro: 86.6.
- Agent Tasks: 93.6 on ScreenSpot V2 and 80.9 on $\tau^2$ -Bench.
Generalist vs. Specialist Case Study: In a direct comparison with a specialized "Biology-Instruction" model trained on the same data, Intern-S1-Pro achieved a significantly higher average score (52.45 vs. 39.24), proving that larger generalist models can extract more value from specialized data.

5. Significance

Paradigm Shift in AI4S: The paper challenges the notion that niche tasks require niche models. It demonstrates that scaling general foundation models with joint training strategies creates a more robust and versatile scientific intelligence.
Infrastructure Breakthrough: The successful training of a 1T-parameter model with stable RL and mixed precision sets a new benchmark for the engineering scalability of scientific AI.
Accelerating Discovery: By mastering over 100 specialized tasks across chemistry, biology, and earth sciences, Intern-S1-Pro serves as a unified interface for scientific discovery, capable of autonomous planning and execution of complex scientific workflows.
Open Source Leadership: As a top-tier open-source model, it democratizes access to state-of-the-art scientific reasoning capabilities, previously limited to proprietary closed models.