Evaluation benchmarks often reduce complex model capabilities to a single, misleading number. Nathan Lambert examines how data contamination and specific testing methodologies skew the perceived gap between open and closed weights. This analysis warns practitioners against relying on leaderboard rankings. Accurate assessment requires deeper, task-specific probing rather than chasing a generic score.