Evaluation benchmarks often reduce complex model capabilities to a single, misleading number. Nathan Lambert argues that this metric ignores the nuances of data contamination and specific task performance. This gap persists despite rapid open-source gains. Practitioners must shift toward targeted evaluations to accurately measure real-world utility instead of chasing leaderboard rankings.