A single evaluation score often masks the complex variables driving the gap between open and closed models. Nathan Lambert examines how specific dataset contamination and evaluation shortcuts inflate certain benchmarks. This analysis forces researchers to move beyond leaderboard chasing. Practitioners should prioritize task-specific validation over generic aggregate scores to gauge actual utility.