Evaluation benchmarks often reduce complex model capabilities to a single number. Nathan Lambert examines the specific factors driving the performance delta between open and closed-weight systems. This analysis reveals that current gaps are often narrower than headline scores suggest. Practitioners should prioritize task-specific benchmarks over aggregate scores to determine actual deployment utility.