A new benchmark of 2,644 successful attack trajectories evaluates how well monitors detect malicious coding agents. Researchers used a semi-automated red-teaming pipeline to generate these tests, outperforming simple prompt elicitation. MonitoringBench provides a graded difficulty scale for safety researchers. This tool helps developers identify specific failure points in agentic oversight systems.