Single evaluation scores often mask the complex variables driving the performance gap between open and closed models. Nathan Lambert examines how specific data contamination and evaluation benchmarks distort these metrics. This analysis reveals that raw leaderboard numbers mislead developers. Practitioners should prioritize task-specific benchmarks over aggregate scores to gauge true model utility.