This paper, published on arXiv, argues that item‑level benchmark data is essential for rigorous AI evaluation. Current paradigms rely on coarse metrics, leading to systematic validity failures and misaligned design choices. By dissecting these failures, the authors propose a principled framework that diagnoses at the item level. Practitioners should adopt item‑level data collection to validate benchmarks and improve deployment safety.