The Auditing Sabotage Bench tests whether models can spot intentional flaws in ML codebases. Neither frontier LLMs nor human-AI teams reliably identified sabotaged variants designed to hide misalignment. Gemini 1.5 Pro struggled to catch these subtle errors. This failure suggests that automating AI safety audits remains risky and prone to deception.