Evaluation benchmarks often condense complex model behaviors into a single, misleading number. Nathan Lambert argues that this narrow focus obscures the nuanced trade-offs between open-weight and proprietary systems. Researchers must prioritize diverse, task-specific metrics over aggregate scores. This shift prevents overreliance on synthetic benchmarks that no longer challenge modern LLMs.