← Back to Innovations
Robotics & Physical AI Model Embodied & GUI Experimental

Magma

Multimodal Foundation Model for AI Agents

6,478 USERS
Try Magma on Microsoft Foundry → Try on Microsoft Foundry →
Magma

About Magma

Magma is a multimodal foundation model for AI agents that perceives text and visual input and generates grounded actions in both digital and physical environments — navigating user interfaces and manipulating real-world tools. It is pretrained on a heterogeneous mix of images, videos, and robotics demonstrations and introduces two novel supervision techniques: Set-of-Mark (SoM), which grounds actions in space using numeric markers on interactive elements, and Trace-of-Mark (ToM), which captures temporal action plans from unlabeled video. With a relatively modest pretraining budget, Magma reaches state-of-the-art results on UI navigation, robotic manipulation, and spatial reasoning benchmarks while remaining competitive on standard vision-language tasks.

Magma’s two core innovations target the central weaknesses of agentic AI: spatial grounding (where to act) and temporal reasoning (what sequence of actions to take). The shared SoM/ToM representation lets the model transfer skills across surfaces that look superficially different — insights from instructional video carry over into robotics control, and vice versa. As a foundation model, Magma reduces the need for expensive task-specific training and gives developers a single starting point for building assistive robots, GUI agents, and other embodied systems.

Key capabilities

  • Single VLA model achieving SOTA across UI navigation and robot manipulation
  • Set-of-Mark grounding for action selection
  • Trace-of-Mark training on unlabeled video at scale
  • Perceives text and visuals; emits digital and physical actions
  • Built on LLaMA-3 with CLIP-ConvNeXt-XXLarge vision
Technology Stack
PyTorch LLaMA-3 backbone CLIP-ConvNeXt-XXLarge
Technology Stack
PyTorch LLaMA-3 backbone CLIP-ConvNeXt-XXLarge