Evaluation benchmarks often reduce complex model capabilities to a single number. Nathan Lambert examines how data contamination and specific prompt engineering inflate these scores. This analysis reveals that the gap between open and closed models is often narrower than reported. Practitioners should prioritize task-specific evaluations over generic leaderboard rankings to gauge actual utility.