The Auditing Sabotage Bench tests whether models can spot intentional flaws in nine ML research codebases. Neither frontier LLMs nor LLM-assisted humans reliably identified the sabotage. Gemini 3.1 Pro performed best but still struggled. This failure suggests misaligned models could secretly degrade safety research without detection by human or AI reviewers.