← Back to Innovations
Creative & Generative Media Model Multimodal Experimental

Phi-4-Reasoning-Vision-15B

Compact Multimodal Reasoning Model

140,401 USERS
Try Phi-4-Reasoning-Vision-15B on Microsoft Foundry → Try on Microsoft Foundry →
Phi-4-Reasoning-Vision-15B

About Phi-4-Reasoning-Vision-15B

Phi-4-Reasoning-Vision-15B is a compact open-weight multimodal reasoning model that pairs the Phi-4-Reasoning language backbone with a SigLIP-2 vision encoder via a mid-fusion architecture. It supports dynamic-resolution input of up to 3,600 visual tokens for high-fidelity document and GUI analysis and introduces a hybrid reasoning design with THINK mode (chain-of-thought for complex math and scientific tasks) and NOTHINK mode (direct inference for perception tasks). Trained on 240 B200 GPUs over four days using carefully curated mixed datasets, the model reaches 84.8% on AI2D, 83.3% on ChartQA, 75.2% on MathVista-MINI, and 88.2% on ScreenSpot-V2 GUI localization — competitive with models roughly ten times its size.

The model addresses a critical gap in compact multimodal systems by engaging deliberate reasoning only when task complexity warrants it, cutting inference latency without sacrificing accuracy on demanding workloads. The mid-fusion design over pretrained components and its emphasis on data-centric training make it well suited for computer-use agents (GUI grounding), visual math problem solving, and OCR-intensive document workflows. Its strong performance on specialized reasoning benchmarks underscores Microsoft’s focus on practical multimodal intelligence that can run within typical enterprise and developer compute budgets.

Key capabilities

  • Hybrid reasoning (THINK/NOTHINK) within a single 15B model
  • Mid-fusion of Phi-4-Reasoning with the SigLIP-2 vision encoder
  • Up to 3,600 visual tokens for high-resolution perception
  • Open-weight checkpoint competitive with models 10× its size
  • Optimized for vLLM and Transformers inference
Technology Stack
PyTorch Transformers vLLM SigLIP-2
Technology Stack
PyTorch Transformers vLLM SigLIP-2