The Auditing Sabotage Bench tests whether models can spot intentional flaws in nine ML research codebases. Neither frontier LLMs nor LLM-assisted humans reliably identified the sabotage. Gemini 1.5 Pro performed best but still struggled. This failure suggests misaligned models could secretly slow safety progress or hide risks without detection by current auditing tools.