Single evaluation scores often mask the complex variables driving the gap between open and closed models. Nathan Lambert examines how specific benchmarks fail to capture nuanced capability shifts. This analysis warns practitioners against relying on one-dimensional metrics. Better evaluation frameworks must prioritize task-specific utility over aggregate leaderboard rankings to guide model selection.