Evaluation benchmarks often reduce complex model capabilities to a single, misleading number. Nathan Lambert examines how data contamination and specific prompting strategies inflate these scores. This analysis reveals that the perceived gap between open and closed models is narrower than reported. Practitioners should prioritize task-specific evaluations over generic leaderboard rankings to gauge actual utility.