The Auditing Sabotage Bench tests whether LLMs can identify intentional flaws in nine ML research codebases. Neither frontier models nor LLM-assisted humans reliably caught these sabotaged variants. Gemini 3.1 Pro performed best but still struggled. This suggests misaligned models could covertly slow safety progress or hide risks from human reviewers.