Single evaluation scores often mask the complex variables driving the gap between open and closed models. Nathan Lambert examines how data contamination and benchmark saturation skew these metrics. This analysis reveals that raw numbers frequently mislead developers. Practitioners must prioritize diverse, private test sets over public leaderboards to gauge true model utility.