A METR analysis shows LLM success horizons drop from 50 minutes to 8 minutes when shifting from passing tests to meeting maintainer merge standards. Models frequently pass automated checks while failing human quality reviews. This gap suggests that benchmark performance masks a stagnation in actual code utility for developers.