Evaluation benchmarks often reduce complex model capabilities to a single number. Nathan Lambert examines the specific factors driving the performance delta between open and closed weights. This analysis reveals that current gaps are often narrower than marketing suggests. Practitioners should prioritize task-specific benchmarks over general leaderboards to determine actual deployment utility.