Evaluation benchmarks often hide the complex factors driving the performance gap between open and closed models. Nathan Lambert examines how specific data contamination and evaluation metrics skew these single-number results. This analysis warns practitioners against over-relying on static leaderboards. Accurate model comparison requires a deeper look at training data and evaluation methodology.