Evaluation benchmarks often reduce complex model capabilities to a single number. Nathan Lambert examines the specific variables driving the performance delta between open and closed weights. This analysis reveals how metric obsession masks nuanced architectural differences. Practitioners should prioritize task-specific evaluations over aggregate scores to avoid misleading performance projections.