Phi-4-Reasoning-Vision handling everyday vision-language tasks

Experiment

Phi-4-Reasoning-Vision-15B

Try Phi-4-Reasoning-Vision in Foundry Catalog Phi-4-Reasoning-Vision-15B has been released for research purposes. Users can learn, explore, and experiment with Phi-4-Reasoning-Vision. Read the Blog Open GitHub Repo Try on Hugging Face

Phi-4-Reasoning-Vision-15B is a compact and smart open‑weight multimodal reasoning model that balances reasoning power, efficiency, and training data needs. It is a broadly capable model that allows for natural interaction for a wide array of vision-language tasks and excels at math and science reasoning and understanding user-interfaces.

The model delivers competitive performance to much larger, slower models requiring ten times or more compute and tokens, and higher accuracy than similarly fast models—particularly on math and science benchmarks. Built on the Phi-4 foundation (trained on 400 billion tokens) and Phi-4-Reasoning backbone (16 billion tokens), the multimodal model was trained on 200 billion tokens of carefully curated multimodal data—far less than comparable models that rely on over one trillion tokens for their multimodal training.

The model also has strong capabilities that make it well-suited as a foundation for computer-use agent applications. With high-resolution perception and fine-grained grounding, it can identify and localize interactive elements such as buttons, menus, and text fields across desktop, web, and mobile interfaces. These capabilities, combined with low latency and a compact footprint, make it a compelling base model for training agentic systems that need to interact with graphical user interfaces.

Phi-4-Reasoning-Vision-15B is now available for experimental purposes on Microsoft Foundry, GitHub and Hugging Face.