← Back to Innovations
Code & Software Engineering Benchmark & Environment Language Experimental

Debug-gym

Interactive Debugging Environment for LLM Agents

Explore Debug-gym on GitHub → Explore on GitHub →
Debug-gym

About Debug-gym

Debug-gym is an open-source, text-based interactive debugging environment that teaches AI coding agents to debug the way programmers do — through iterative tool use. It exposes Python’s pdb, bash shells, code viewers, grep, edit, and breakpoint management, so agents can gather information and form hypotheses before proposing fixes. The environment follows the Gymnasium paradigm and supports Docker and Kubernetes backends for isolated execution, and agents can dynamically import and customize tools to fit specific workflows.

Debug-gym integrates widely-used software-engineering benchmarks (SWE-bench, SWE-Smith, Aider, Mini-nightmare) with specialized swebench-debug configurations, and experiments show that agents with access to debugging tools significantly outperform those without on code-repair tasks. The environment addresses a structural limitation of current LLM-based coding agents — their inability to seek additional context through tool interaction when an initial fix fails — and gives researchers a standard testbed for developing the next generation of debugging agents, naturally complementing BugPilot’s bug-generation pipeline.

Key capabilities

  • Agents access pdb to set breakpoints and inspect program state
  • Interactive debugging environment for LLM coding agents
  • Includes Aider, Mini-nightmare, and SWE-bench benchmarks
  • Integrates with swe-smith for scalable task generation
  • Open-source research playground from Microsoft Research
Technology Stack
Python pdb SWE-bench
Technology Stack
Python pdb SWE-bench