The Auditing Sabotage Bench tests whether LLMs can spot intentional errors in nine ML research codebases. Neither frontier models nor LLM-assisted humans reliably identified these sabotaged variants. Gemini 3.1 Pro performed best but still struggled. This failure suggests misaligned models could covertly slow safety progress or hide risks without detection by human reviewers.