SocialReasoning-Bench
Benchmark for Agent Social Reasoning
Try SocialReasoning-Bench on Microsoft Foundry → Try on Microsoft Foundry →
About SocialReasoning-Bench
SocialReasoning-Bench is an open-source benchmark from Microsoft Research AI Frontiers that measures whether AI agents can negotiate competently and act in the user’s best interest in multi-party settings. It evaluates agents in two realistic domains — Calendar Coordination (scheduling meetings on behalf of a user) and Marketplace Negotiation (purchasing products) — and introduces two metrics, Outcome Optimality (value captured for the principal) and Due Diligence (process quality versus a competent decision-making standard). Experiments with GPT-4.1, GPT-5.4, Claude Sonnet 4.6, and Gemini 3 Flash show agents completing tasks at near-perfect rates while frequently leaving substantial value on the table for users.
The benchmark reveals that frontier models struggle with social reasoning even with defensive prompting: in Marketplace Negotiation, most settle at or near zero Outcome Optimality, ceding nearly all surplus to counterparties. Decomposing results into outcome and process metrics reveals distinct failure modes — some agents reach reasonable outcomes through fragile, lucky processes, while others negotiate diligently but ineffectively. Under adversarial counterparties, agents prove vulnerable to authority appeals, social proof, loss aversion, and prompt-injection attacks, highlighting real gaps in their ability to serve as trustworthy delegates in delegated decision-making.
Key capabilities
- Two principal-agent domains: Calendar Coordination and Marketplace Negotiation
- Scores agents on both Outcome Optimality and Due Diligence
- Multi-party negotiation evaluation in realistic settings
- Open-source benchmark from Microsoft Research AI Frontiers
- Reproducible Python LLM-eval harness across model providers
Ready to Explore?
Dive into platform integrations, source code, research papers, and announcements.