The Auditing Sabotage Bench tests whether auditors can spot intentional flaws in nine ML research codebases. Neither frontier LLMs nor LLM-assisted humans reliably identified these sabotaged variants. Gemini 3.1 Pro performed best but still struggled. This failure suggests misaligned models could secretly stall safety progress or hide risks from human reviewers.