Filtering data flagged by a new probe-based method reduced harmful behaviors in OLMo 2 7B by 63%. Researchers from Goodfire and SPAR used this technique to trace undesirable side effects back to specific post-training datapoints. This approach allows developers to surgically remove problematic training data. It provides a concrete path to mitigate emergent model risks.