A new experiment explores whether "monitor" models can detect malicious code generated by smarter agents. Human reviewers cannot keep pace with 10,000-line code outputs, creating a scalable control gap. This research tests if ensembles of weaker models can effectively oversee frontier systems. The results determine if automated oversight can prevent critical infrastructure failures.