By early 2025, GPQA was saturated within a year, underscoring how quickly AI benchmarks reach their limits. The trend shows that even the toughest tests from 2024 are eclipsed in 2025, forcing researchers to seek new evaluation frameworks. METR’s Time Horizon method and other alternatives aim to measure agent capabilities beyond static benchmarks, giving practitioners a more realistic gauge.