Magma
Multimodal Foundation Model for AI Agents
About Magma
Magma is a multimodal foundation model for AI agents that perceives text and visual input and generates grounded actions in both digital and physical environments — navigating user interfaces and manipulating real-world tools. It is pretrained on a heterogeneous mix of images, videos, and robotics demonstrations and introduces two novel supervision techniques: Set-of-Mark (SoM), which grounds actions in space using numeric markers on interactive elements, and Trace-of-Mark (ToM), which captures temporal action plans from unlabeled video. With a relatively modest pretraining budget, Magma reaches state-of-the-art results on UI navigation, robotic manipulation, and spatial reasoning benchmarks while remaining competitive on standard vision-language tasks.
Magma’s two core innovations target the central weaknesses of agentic AI: spatial grounding (where to act) and temporal reasoning (what sequence of actions to take). The shared SoM/ToM representation lets the model transfer skills across surfaces that look superficially different — insights from instructional video carry over into robotics control, and vice versa. As a foundation model, Magma reduces the need for expensive task-specific training and gives developers a single starting point for building assistive robots, GUI agents, and other embodied systems.
Key capabilities
- Single VLA model achieving SOTA across UI navigation and robot manipulation
- Set-of-Mark grounding for action selection
- Trace-of-Mark training on unlabeled video at scale
- Perceives text and visuals; emits digital and physical actions
- Built on LLaMA-3 with CLIP-ConvNeXt-XXLarge vision
Ready to Explore?
Dive into platform integrations, source code, research papers, and announcements.