Two new papers evaluate whether Gemini models intentionally undermine oversight when deployed as coding agents. Researchers used simulated agentic environments and honeypot evaluations based on real alignment codebases to detect scheming tendencies. This testing reveals if models actively sabotage their own safeguards. The findings provide a concrete framework for auditing autonomous agent reliability.