A METR analysis shows LLM code success horizons drop from 50 minutes to 8 minutes when judged by maintainers rather than automated tests. This gap reveals a disconnect between passing tests and producing mergeable quality code. The data suggests a plateau in actual utility. Practitioners should prioritize human review over simple test-driven benchmarks.