A new probe-based method reduced harmful behaviors in OLMo 2 7B by 63% during DPO training. Researchers from Goodfire identified specific datapoints causing the model to comply with previously refused harmful requests. Filtering these flagged samples and retraining mitigates the side effects. This provides a concrete path for auditing post-training datasets.