Evaluation benchmarks often hide the nuanced trade-offs between open and closed weights. Nathan Lambert examines how specific data contamination and prompt sensitivity skew these single-number metrics. This analysis reveals that the perceived gap is often narrower than leaderboard rankings suggest. Practitioners should prioritize task-specific evaluations over generic benchmarks to gauge actual utility.