Single evaluation scores often mask the complex variables driving the gap between open and closed models. Nathan Lambert examines how specific benchmarking choices skew perceived progress. This analysis reveals that raw numbers rarely tell the full story of model capability. Practitioners should prioritize task-specific evaluations over aggregate leaderboard rankings to gauge actual utility.