The paper, published on arXiv, argues that current AI benchmarks lack item‑level data, a gap that undermines validity. By dissecting over 30 reported failures, the authors show how fine‑grained diagnostics can expose misaligned metrics. They propose a principled framework for collecting granular evidence. Practitioners should adopt item‑level analysis to strengthen deployment decisions. This approach encourages systematic validation before production.