GPQA, a once‑tough benchmark, saturated in 2024, just a year after its release. By 2025, METR’s Time Horizon method emerged as a fresh way to gauge agent skill. The shift signals that fixed tests can no longer upper‑bound AI progress. Practitioners must adopt dynamic, scenario‑based metrics to stay ahead today.