A new arXiv paper proposes using interpretability to distinguish why human annotators disagree on safety policies. The researchers separate operational failures from policy ambiguity and value pluralism. This distinction allows developers to target quality control or policy wording specifically. It replaces unreliable self-reporting with technical analysis to refine RLHF training data.