A single evaluation score often masks the complex variables driving the gap between open and closed models. Nathan Lambert examines how specific benchmarks fail to capture real-world utility. This analysis reveals that marginal gains in leaderboard rankings rarely translate to practical advantages. Researchers must prioritize nuanced evaluation metrics over monolithic scores to guide development.