Evaluation benchmarks often reduce complex model capabilities to a single, misleading number. Nathan Lambert examines how data contamination and specific prompt engineering inflate these scores. This analysis reveals that the perceived gap between open and closed models is narrower than benchmarks suggest. Practitioners should prioritize task-specific evaluations over aggregate leaderboard rankings.