A new preprint introduces activation-matched finetuning to uncover backdoors and reward hacking. Researchers train a clean reference model to mimic a suspect model's activations on benign prompts. The resulting discrepancy isolates abnormal behaviors that do not naturally extrapolate. This technique helps safety researchers find hidden triggers before LLMs deploy in production environments.