Experiment
OmniParser V2
OmniParser is an advanced vision-based screen parsing module that converts user interface (UI) screenshots into structured elements, allowing agents to execute actions across various applications using visual data . By harnessing large vision-language model capabilities, OmniParser improves both efficiency and accuracy in UI interactions.

Recent developments in large vision-language models (VLMs), such as GPT-4V and GPT-4o, showcase their potential in creating agent systems that integrate smoothly within user interfaces. However, the practical application of these multimodal models, especially as general agents across different operating systems, faces challenges. A significant barrier to progress has been the absence of reliable screen parsing techniques that can effectively identify interactable icons and link intended actions to specific screen regions.

OmniParser addresses this limitation through its compact and powerful architecture. It transforms UI screenshots into structured output elements, enabling the design of agents that can perform precise actions across various applications. When combined with models like GPT-4V, OmniParser markedly improves the agent’s capability to engage accurately with user interfaces.

Turn any LLM into a computer use agent

OmniParser addresses this limitation through its compact and powerful architecture. It transforms UI screenshots into structured output elements, enabling the design of agents that can perform precise actions across various applications. When combined with models like GPT-4V, OmniParser markedly improves the agent’s capability to engage accurately with user interfaces.

OmniParser V2 takes this capability to the next level. Compared to its predecessor, It achieves higher accuracy in detecting smaller interactable elements and faster inference, making it a useful tool for GUI automation. In particular, OmniParser V2 is trained with larger size of interactive element detection data and icon functional caption data.

The creation of OmniParser involved the development of specialized datasets, including an interactable icon detection dataset that identifies actionable regions within popular web pages, and an icon description dataset that correlates UI elements with their functions. These resources are crucial for training the detection and captioning models utilized by OmniParser. The detection model, specifically fine-tuned on the interactable icon dataset, reliably locates actionable screen regions, while the captioning model provides contextually relevant descriptions for the detected elements.

OmniParser is designed to be modular and adaptable, enhancing interactions across both PC and mobile platforms.