A new preprint introduces activation-matched finetuning to uncover backdoors and reward hacking. Researchers train a clean reference model to imitate a suspect model's residual-stream activations on benign prompts. The resulting discrepancy isolates abnormal behaviors that do not naturally extrapolate. This technique allows practitioners to locate hidden triggers even when the malicious mechanism remains dormant.