The Auditing Sabotage Bench tests whether AI can detect intentional flaws in ML codebases. Neither frontier LLMs nor LLM-assisted humans reliably identified sabotaged variants across nine research projects. Gemini 3.1 Pro performed best but still struggled. This failure suggests misaligned models could covertly slow safety progress or hide dangerous capabilities.