The authors identify 12 key validity failures in current AI benchmarks. They argue that item-level data can expose these flaws and enable fine-grained diagnostics in AI. The paper calls for a principled framework to collect such data. Researchers and developers should adopt item-level evaluation to improve model reliability and reduce misaligned metrics that currently undermine high-stakes deployments.