← Back to Innovations
Benchmark & Environment Language Experimental

SocialReasoning-Bench

Benchmark for Agent Social Reasoning

Try SocialReasoning-Bench on Microsoft Foundry → Try on Microsoft Foundry →
SocialReasoning-Bench

About SocialReasoning-Bench

SocialReasoning-Bench is an open-source benchmark from Microsoft Research AI Frontiers that measures whether AI agents can negotiate competently and act in the user’s best interest in multi-party settings. It evaluates agents in two realistic domains — Calendar Coordination (scheduling meetings on behalf of a user) and Marketplace Negotiation (purchasing products) — and introduces two metrics, Outcome Optimality (value captured for the principal) and Due Diligence (process quality versus a competent decision-making standard). Experiments with GPT-4.1, GPT-5.4, Claude Sonnet 4.6, and Gemini 3 Flash show agents completing tasks at near-perfect rates while frequently leaving substantial value on the table for users.

The benchmark reveals that frontier models struggle with social reasoning even with defensive prompting: in Marketplace Negotiation, most settle at or near zero Outcome Optimality, ceding nearly all surplus to counterparties. Decomposing results into outcome and process metrics reveals distinct failure modes — some agents reach reasonable outcomes through fragile, lucky processes, while others negotiate diligently but ineffectively. Under adversarial counterparties, agents prove vulnerable to authority appeals, social proof, loss aversion, and prompt-injection attacks, highlighting real gaps in their ability to serve as trustworthy delegates in delegated decision-making.

Key capabilities

  • Two principal-agent domains: Calendar Coordination and Marketplace Negotiation
  • Scores agents on both Outcome Optimality and Due Diligence
  • Multi-party negotiation evaluation in realistic settings
  • Open-source benchmark from Microsoft Research AI Frontiers
  • Reproducible Python LLM-eval harness across model providers
Technology Stack
Python LLM eval harness
Technology Stack
Python LLM eval harness