PUBLIC PREVIEW
Microsoft Phi-4
Explore the capabilities of Phi-4, the latest model in Microsoft’s Phi family of advanced AI technologies.
Phi-4-multimodal and Phi-4-mini, the newest models in Microsoft’s Phi family of small language models (SLMs) are now available. These models are designed to empower developers with advanced AI capabilities. Phi-4-multimodal, with its ability to process speech, vision, and text simultaneously, opens new possibilities for creating innovative and context-aware applications. Phi-4-mini, on the other hand, excels in text-based tasks, providing high accuracy and scalability in a compact form.
Phi-4-multimodal marks a new milestone in Microsoft’s AI development as our first multimodal language model. At the core of innovation lies continuous improvement, and that starts with listening to our customers. In direct response to customer feedback, we’ve developed Phi-4-multimodal, a 5.6B parameter model, that seamlessly integrates speech, vision, and text processing into a single, unified architecture. By leveraging advanced cross-modal learning techniques, this model enables more natural and context-aware interactions, allowing devices to understand and reason across multiple input modalities simultaneously. Whether interpreting spoken language, analyzing images, or processing textual information, it delivers highly efficient, low-latency inference—all while optimizing for on-device execution and reduced computational overhead. Natively built for multimodal experiences Phi-4-multimodal is a single model with mixture-of-LoRAs that includes speech, vision, and language, all processed simultaneously within the same representation space. The result is a single, unified model capable of handling text, audio, and visual inputs seamlessly—no need for complex pipelines or separate models for different modalities. The Phi-4-multimodal is built on a new architecture that enhances efficiency and scalability. It incorporates a larger vocabulary for improved processing, supports multilingual capabilities, and integrates language reasoning with multimodal inputs. All of this is achieved within a powerful, compact, highly efficient model that’s perfectly suited for deployment on devices and edge computing platforms. This breakthrough model represents a major leap forward in AI technology, offering unprecedented performance in a small package. Whether you’re looking for advanced AI capabilities on mobile devices or edge systems, Phi-4-multimodal provides a high-capability option that’s both efficient and versatile. With its impressive range of capabilities and flexibility, Phi-4-multimodal opens exciting new possibilities for app developers, businesses, and industries looking to harness the power of AI in innovative ways. The future of multimodal AI is here, and it’s ready to transform your applications. Phi-4-multimodal is capable of processing both visual and audio together. The following table shows the model quality when the input query for vision content is synthetic speech on chart/table understanding and document reasoning tasks.
Phi-4-multimodal is capable of processing both visual and audio together. The following table shows the model quality when the input query for vision content is synthetic speech on chart/table understanding and document reasoning tasks. Compared to other existing state-of-the-art omni models that can enable audio and visual signals as input, Phi-4-multimodal achieves much stronger performance on multiple benchmarks.

Phi-4-mini is a 3.8B parameter model and a dense, decoder-only transformer featuring grouped-query attention, 200,000 vocabulary, and shared input-output embeddings, designed for speed and efficiency. Despite its compact size, it continues outperforming larger models in text-based tasks, including reasoning, math, coding, instruction-following, and function-calling. Supporting sequences up to 128,000 tokens, it delivers high accuracy and scalability, making it a powerful solution for advanced AI applications. Function calling, instruction following, long context, and reasoning are powerful capabilities that enable small language models like Phi-4-mini to access external knowledge and functionality despite their limited capacity. Through a standardized protocol, function calling allows the model to seamlessly integrate with structured programming interfaces. When a user makes a request, Phi-4-Mini can reason through the query, identify and call relevant functions with appropriate parameters, receive the function outputs, and incorporate those results into its responses. This creates an extensible agentic-based system where the model’s capabilities can be enhanced by connecting it to external tools, application program interfaces (APIs), and data sources through well-defined function interfaces.
These models are designed to handle complex tasks efficiently, making them ideal for edge case scenarios and compute-constrained environments. Given the new capabilities Phi-4-multimodal and Phi-4-mini bring, the uses of Phi are only expanding. Phi models are being embedded into AI ecosystems and used to explore various use cases across industries.

Embedded directly to your smart device: Integrating Phi-4-multimodal directly into a smartphone could enable smartphones to process and understand voice commands, recognize images, and interpret text seamlessly. Users could benefit from advanced features like real-time language translation, enhanced photo and video analysis, and intelligent personal assistants that understand and respond to complex queries. This would elevate the user experience by providing powerful AI capabilities directly on the device, ensuring low latency and high efficiency.

On the road: Imagine an automotive company integrating Phi-4-multimodal into their in-car assistant systems. The model could enable vehicles to understand and respond to voice commands, recognize driver gestures, and analyze visual inputs from cameras. For instance, it could enhance driver safety by detecting drowsiness through facial recognition and providing real-time alerts. Additionally, it could offer seamless navigation assistance, interpret road signs, and provide contextual information, creating a more intuitive and safer driving experience while connected to the cloud and offline when connectivity isn’t available.

Multilingual financial services: Imagine a financial services company integrating Phi-4-mini to automate complex financial calculations, generate detailed reports, and translate financial documents into multiple languages. For instance, the model can assist analysts by performing intricate mathematical computations required for risk assessments, portfolio management, and financial forecasting. Additionally, it can translate financial statements, regulatory documents, and client communications into various languages and could improve client relations globally.