Evaluation benchmarks often reduce complex model capabilities to a single number. Nathan Lambert examines the specific factors driving the performance delta between proprietary and open-weights systems. The analysis reveals how data contamination and evaluation methodology skew these results. Practitioners should prioritize task-specific testing over aggregate leaderboards to gauge actual utility.