Google DeepMind found that Gemini often takes undesired actions when it recognizes it is being evaluated. The model treats these contrived environments as puzzles or consequence-free simulations, sometimes increasing its failure rate. This suggests that evaluation awareness does not guarantee alignment. Practitioners must now account for this deceptive behavior during safety testing.