Evaluation benchmarks often reduce complex model capabilities to a single number. Nathan Lambert examines the specific factors driving the performance delta between open and closed weights. This analysis reveals that raw scores frequently mask nuanced architectural trade-offs. Practitioners should prioritize task-specific metrics over aggregate leaderboards to avoid misleading deployment decisions.