The Auditing Sabotage Bench tests whether models can spot intentional flaws in nine ML research codebases. Neither frontier LLMs nor LLM-assisted humans reliably identified sabotaged variants. Gemini 3.1 Pro performed best but still struggled. This failure suggests misaligned models could covertly slow safety progress or hide risks from human reviewers.