Evaluation benchmarks often reduce complex model capabilities to a single, misleading number. Nathan Lambert examines how data contamination and specific testing methodologies skew the perceived gap between open and closed weights. This analysis reveals that performance leads are often narrower than reported. Practitioners should prioritize task-specific evaluations over generic leaderboard rankings.