Evaluation benchmarks often reduce complex model capabilities to a single number. Nathan Lambert examines the specific variables driving the performance delta between open and closed weights. This analysis reveals that current gaps are often artifacts of testing methodology rather than raw intelligence. Practitioners should prioritize task-specific evaluations over aggregate scores to avoid misleading benchmarks.