Phi-4-reasoning-vision-15B Technical Report

Jyoti Aneja, Michael Harrison, Neel Joshi, Tyler LaBonte, John Langford, Eduardo Salinas

Published 2026-03-05

📖 2 min read☕ Coffee break read

View on arXiv ↗PDF ↗

` and starts "thinking out loud," breaking the problem down step-by-step before answering.

The Magic: It learns when to switch automatically. It's like a chef who chops vegetables quickly (direct action) but stops to carefully measure ingredients for a complex sauce (reasoning).

5. What Can It Actually Do?

Because of these design choices, this small model is surprisingly good at:

Math & Science: It can look at a diagram of a spring-mass system or a handwritten math equation and solve it correctly.
Computer Control: It can look at a screenshot of a Windows desktop or a website and figure out which button to click to get a job done.
Everyday Tasks: It can read a receipt, explain a chart, or describe what's happening in a photo.

6. Why Does This Matter?

This paper pushes the "Pareto Frontier." In simple terms, it found a spot on the graph where you get maximum intelligence for minimum cost.

For Users: You can run this on your own laptop or phone without needing a massive server farm.
For Developers: It shows that you don't need to build bigger models to get better results; you just need better data and smarter architecture.

The Bottom Line

Phi-4-reasoning-vision-15B is proof that you don't need to be the biggest to be the best. By being picky about its data, giving itself "high-definition eyes," and learning when to think hard versus when to act fast, this small model punches way above its weight class. It's a step toward making smart AI accessible, fast, and practical for everyone.

` tokens.

Non-Reasoning Mode: For perception tasks (captioning, OCR, simple VQA), the model uses a <nothink> token to trigger direct responses, avoiding unnecessary latency.
Training Mix: The dataset contains ~20% reasoning traces and ~80% direct-response data. The model learns to implicitly switch modes based on the task, though users can explicitly force a mode via prompting.

3. Key Contributions

Pareto Frontier Optimization: Phi-4-reasoning-vision-15B achieves competitive accuracy with models requiring 10x more compute and tokens, while outperforming similarly sized fast models in math and science reasoning.
Data-Centric Insights: Demonstrates that systematic filtering, error correction, and synthetic augmentation are more effective levers for performance than simply scaling model size or token count.
Dynamic Resolution Superiority: Proves that high-resolution, dynamic tokenization is a prerequisite for high-quality multimodal reasoning and grounding, particularly for UI interaction.
Unified Reasoning/Perception: Successfully trains a single model to handle both fast direct perception and slow, structured reasoning, optimizing inference costs for different task types.
Open-Weight Release: Provides a fully open-weight model (15B) with weights, code, and benchmark logs to the community.

4. Results

The model was evaluated on a suite of benchmarks including MathVista, MMMU, ScreenSpot, ChartQA, and OCRBench.

Performance vs. Compute: In Figure 2 (Accuracy vs. Time/Compute), the model sits on the optimal Pareto frontier, offering higher accuracy than similarly fast models and comparable accuracy to much slower, larger models.
Benchmark Highlights:
- Math/Science: Achieves strong results on MathVista (75.2%) and MathVerse, outperforming many non-thinking open-weight models.
- Computer Use (CUA): Excels at ScreenSpot-v2 (88.2%), demonstrating superior ability to locate and interact with UI elements compared to other 15B-scale models.
- General VLM: Competitive on MMMU and AI2D.
Mode Switching: The default mixed-reasoning behavior generally outperforms forcing the model into a specific "thinking" or "non-thinking" mode, except in specific edge cases (e.g., forcing "thinking" on MathVerse).

5. Significance

Democratization of Advanced Reasoning: By achieving high-level reasoning capabilities with a 15B parameter count and low token usage, this model makes advanced multimodal AI accessible on modest hardware (e.g., local laptops, edge devices).
Blueprint for Efficient VLMs: The report provides a concrete roadmap for the research community, emphasizing that data quality and architecture choices (like dynamic resolution) are more critical than brute-force scaling.
Agentic AI Foundation: The model's specific strength in GUI grounding and high-resolution perception makes it a foundational building block for Computer-Using Agents (CUA) capable of navigating desktop and mobile interfaces autonomously.
Safety Alignment: The model incorporates rigorous safety training (RAI) specifically tailored for multimodal inputs, addressing risks related to visual content interpretation and harmful requests.

In summary, Phi-4-reasoning-vision-15B represents a shift from "bigger is better" to "smarter data and architecture," delivering a highly efficient, open-weight model that excels in scientific reasoning and interactive agent tasks.

ShareTwitter LinkedIn Email

Enjoyed this explanation? Get the best ones every week.

Check your inbox to confirm your subscription.

Something went wrong. Try again?

No spam, unsubscribe anytime.

More like this

Memory Bear AI Memory Science Engine for Multimodal Affective Intelligence: A Technical Report

This technical report introduces the Memory Bear AI Memory Science Engine, a novel framework that enhances multimodal affective intelligence by transforming transient emotion recognition into a structured, memory-driven process capable of modeling long-term dependencies and maintaining robustness under noisy or incomplete input conditions.

Deliang Wen, Ke Sun, Yu Wang2026-03-25🤖 cs.AI

The Efficiency Attenuation Phenomenon: A Computational Challenge to the Language of Thought Hypothesis

This paper challenges the Language of Thought hypothesis by demonstrating through a multi-agent reinforcement learning experiment that artificial agents achieve significantly higher collaborative efficiency using emergent, inscrutable communication protocols compared to human-comprehensible symbolic languages, suggesting that optimal cognition may rely on sub-symbolic rather than language-like structures.

Di Zhang2026-03-25🤖 cs.AI

Dynamic Fusion-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition in Conversations

The paper proposes DF-GCN, a dynamic fusion-aware graph convolutional neural network that integrates ordinary differential equations and global information vector-guided prompts to dynamically adapt multimodal feature fusion parameters for different emotion categories, thereby achieving superior performance in multimodal emotion recognition within conversations.

Tao Meng, Weilun Tang, Yuntao Shou, Yilong Tan, Jun Zhou, Wei Ai, Keqin Li2026-03-25🤖 cs.AI

Intelligence Inertia: Physical Principles and Applications

This paper introduces the concept of "intelligence inertia," a physical principle rooted in the non-commutativity of rules and states that explains the super-linear computational costs of reconfiguring intelligent systems, proposing a relativistic J-shaped cost model validated through comparative analysis, geometric trajectory studies, and an inertia-aware training scheduler.

Jipeng Han2026-03-25🤖 cs.AI

Session Risk Memory (SRM): Temporal Authorization for Deterministic Pre-Execution Safety Gates

This paper introduces Session Risk Memory (SRM), a lightweight, deterministic module that enhances agent safety by tracking temporal behavioral trajectories to detect distributed, multi-step attacks, achieving perfect F1 scores and zero false positives with negligible overhead compared to stateless authorization systems.

Florin Adrian Chitan2026-03-25🤖 cs.AI