MMCT-Agent
MMCTAgent is a new multimodal critical thinking agent framework, designed to improve complex visual reasoning on images and long-form videos with multimodal LLMs. MMCTAgent looks at different types of information such as audio, visual details and textual information, and breaks down questions into smaller parts. It comes up with strategies and adapts its reasoning as it goes, and also verifies its own answers using a built-in “critic,” helping ensure accuracy and relevance. In research testing across various benchmarks, MMCTAgent outperformed both existing (MLLMs) and other tool-augmented pipelines.
MMCTAgent uses natural language processing (NLP), ethnographic design, and computer vision techniques to help chatbots better understand source videos and supporting transcripts, making them more accessible through search and Q&A. The resulting multimodal answers are both culturally and linguistically relevant, because they are grounded in the video and information crafted by people in their own communities. Field studies in Kenya and India showed improvements in response quality, usability, and user trust—offering early signals for how community-grounded, multilingual copilots might perform in similar contexts.
MMCTAgent was created as part of Project Gecko, a Microsoft Research initiative to create cost-effective, tailorable AI systems that deliver vital expertise to the global majority using local languages, culturally sensitive content, and multi-modal engagement. Developing globally equitable generative AI that reflects the culturally nuanced lived experiences of the communities it serves helps to advance AI in a more responsible and inclusive way.