← Back to Innovations
Creative & Generative Media Model Embodied & GUI Experimental

OmniParser V2

Pure-Vision GUI Screen Parser

1,711 USERS
Try OmniParser V2 on Microsoft Foundry → Try on Microsoft Foundry →
OmniParser V2

About OmniParser V2

OmniParser V2 is a pure-vision GUI screen-parsing module that converts UI screenshots into structured, actionable elements without invoking a language model. It pairs a fine-tuned YOLOv8 icon detector with a Florence-2-based caption model: the detector localizes interactive regions and the captioner describes their function. V2 cuts inference latency by 60% relative to V1 and reaches 39.6 average accuracy on the ScreenSpot Pro benchmark — meaningful improvements for real-time agent execution. The module handles both interactive and non-interactive elements and generalizes across arbitrary screen layouts without domain-specific training.

Screen understanding is fundamental infrastructure for any agent that drives a GUI, but earlier approaches leaned on expensive vision-language models or hand-labeled UI schemas. By splitting detection from captioning into specialized components and skipping the LLM in the parsing loop, OmniParser keeps both accuracy and latency in the range required by computer-use agents. Its structured output — element coordinates plus descriptions — slots directly into action-prediction pipelines, and it now powers a number of Microsoft and third-party agent stacks, including Fara-7B.

Key capabilities

  • Avg 0.6s/frame on A100; turns any LLM into a computer-use agent
  • 60% lower latency than V1
  • 39.6 average accuracy on ScreenSpot Pro
  • Fine-tuned YOLOv8 icon detector paired with Florence-2 captioning
  • Pure-vision GUI parsing without DOM access
Technology Stack
PyTorch YOLOv8 Florence-2
Technology Stack
PyTorch YOLOv8 Florence-2