Evaluation benchmarks often mask the complex variables driving the performance gap between open and closed models. Nathan Lambert examines how specific data contamination and training objectives skew these single-number metrics. This analysis challenges the reliability of current leaderboards. Researchers must now prioritize nuanced, task-specific testing over aggregate scores to determine true model utility.