Evaluation benchmarks often hide the nuance behind a single performance score. Nathan Lambert examines the specific factors driving the gap between open-source and proprietary models. This analysis reveals how data contamination and evaluation leakage skew results. Practitioners should prioritize task-specific benchmarks over aggregate scores to gauge actual model utility.