Experiment
Magma
Magma is a multimodal foundation model designed to both understand and act in digital and physical environments. Magma builds on the foundation models paradigm that pretraining on a larger amount of more diverse datasets allows these models to generalize better to new tasks and environments. Magma can perceive visual and textual inputs and generate actions, whether it’s clicking a button in a user interface or grabbing a tool in the real world. This new model represents a significant step towards AI agents that can serve as general-purpose assistants.
Imagine an AI assistant that can book a meeting online and also set up the room for it – navigating software menus as effortlessly as it moves physical objects. Such seamless integration of digital and physical tasks has long been a sci-fi vision.

Microsoft researchers are bringing it closer to reality with Magma, a multimodal AI foundation model designed to both understand and act in digital and physical environments. Magma builds on the foundation models paradigm, that pretraining on a larger amount of more diverse datasets allows these models to generalize better to new tasks and environments. Magma can perceive visual and textual inputs and generate actions, whether it’s clicking a button in a user interface or grabbing a tool in the real world. This new model represents a significant step towards AI agents that can serve as general-purpose assistants.

Vision-Language-Action (VLA) models are typically pretrained on large amounts of vision-language-action datasets to obtain the vision-language understanding ability (verbal intelligence) and the ability to perceive and interact with the visual spatial world to perform a wide range of tasks (spatial intelligence). However, due to the dramatic difference among various digital and physical environments, separate VLA models are trained and used for different environments. These models cannot easily generalize to new tasks and new environments that are unseen in training data. Moreover, most of these models do not leverage pretrained vision-language (VL) models or diverse vision-language datasets. As a result, their vision language understanding ability is often inferior to state-of-the-art VL models, which further limits model generalizability.
A multimodal AI foundation model designed to both understand and act in digital and physical environments.
Magma, is a VLA foundation model that can adapt to downstream (unseen) agentic tasks in both the digital and physical environments. With Magma, researchers showed that it is beneficial to pretrain a single VLA model for AI agents across these environments while still achieving state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are tailored specifically to these tasks. On VL tasks, Magma also compares favorably to popular VL models that are trained on much larger datasets.
