Evaluation benchmarks often obscure the nuanced differences between open and closed models. Nathan Lambert examines how specific data contamination and prompt sensitivity distort single-score rankings. This analysis reveals that raw numbers rarely capture true utility. Practitioners should prioritize task-specific evaluations over aggregate leaderboards to avoid choosing suboptimal models for production.