Agent System Multimodal Experimental

MMCT-Agent

Multimodal Critical Thinking Agent

Explore MMCT-Agent on GitHub → Explore on GitHub →

About MMCT-Agent

MMCT-Agent is a multimodal critical-thinking agent framework for complex visual reasoning over images and long-form video. It integrates audio, visual detail, and text, decomposes complex questions into manageable sub-questions, generates adaptive reasoning strategies, and employs a built-in critic to self-verify intermediate answers. Across multiple visual-reasoning benchmarks it outperforms vanilla multimodal LLMs and tool-augmented pipelines that lack structured deliberation.

MMCT-Agent addresses a real limitation in current multimodal models: they tend to process modalities sequentially or independently rather than reasoning over them jointly, and they have no mechanism to catch their own errors. The framework’s decomposition plus adaptive-strategy approach lets the agent handle reasoning that requires relating visual elements across time and grounding them in textual context, while the self-critique mechanism reduces hallucinations. This makes it well suited to high-confidence applications such as document understanding, video analysis, and multimodal search.

Key capabilities

Critic-augmented multimodal reasoning over images and long videos
Decomposes complex questions and adapts strategies per step
Built-in critic self-verifies answers before responding
Outperforms MLLMs and tool-augmented pipelines on benchmarks
Powered by Project Gecko vision tooling

Technology Stack

NLP Computer Vision Project Gecko

Technology Stack

NLP Computer Vision Project Gecko

Ready to Explore?

Dive into platform integrations, source code, research papers, and announcements.

PLATFORM Microsoft Foundry Try MMCT-Agent in the Microsoft Foundry model catalog. EXPLORE ON FOUNDRY CODE GitHub Repository Browse the open-source codebase and contribute. VIEW REPOSITORY