Evaluation benchmarks often reduce complex model capabilities to a single number. Nathan Lambert argues these metrics mask the nuanced trade-offs between open-weights and proprietary systems. This analysis reveals how specific data contamination and training objectives skew perceived performance. Practitioners must prioritize task-specific validation over generic leaderboard rankings to avoid deployment failures.