Single evaluation scores often mask the complex factors driving the gap between open and closed models. Nathan Lambert examines how these metrics fluctuate based on specific testing methodologies. This analysis reveals that current benchmarks may misrepresent actual capability differences. Practitioners should prioritize task-specific evaluations over aggregate leaderboard rankings to ensure model reliability.