Evaluation benchmarks often reduce complex model capabilities to a single number. Nathan Lambert examines the specific variables that drive the performance delta between open-weight and proprietary systems. This analysis reveals how data contamination and evaluation methodology skew results. Practitioners should prioritize task-specific benchmarks over aggregate scores to gauge real-world utility.