A single evaluation score often masks the complex variables driving the performance gap between open and closed models. Nathan Lambert examines how data contamination and benchmark saturation skew these metrics. This analysis warns practitioners that raw leaderboard rankings frequently misrepresent real-world utility. Accurate model selection now requires deeper, task-specific validation over generic scores.