Evaluation benchmarks often mask the complex variables driving the performance gap between open and closed models. Nathan Lambert examines how specific data curation and training recipes influence these single-number scores. This analysis challenges the reliance on static leaderboards. Practitioners should prioritize task-specific evaluations over aggregate scores to determine true model utility for their specific deployments.