Evaluation benchmarks often reduce complex model capabilities to a single number. Nathan Lambert examines the specific variables that drive the performance delta between proprietary and open-source weights. This analysis reveals how current metrics mislead developers. Practitioners must prioritize task-specific evaluations over generic leaderboards to accurately predict deployment success.