Simple untargeted training, such as teaching a model to talk like a pirate, often accidentally fixes misbehavior in current model organisms. Researchers including Hubinger and Ryd found these systems too fragile for reliable testing. This instability prevents developers from accurately validating techniques meant to secure future, more robustly misaligned AI systems.