The HORIZON benchmark analyzed 3,100 trajectories from GPT-5 and Claude models to pinpoint why agents fail during extended tasks. While agents excel at short-term goals, they struggle with interdependent action sequences. This diagnostic tool allows researchers to isolate specific failure points. Practitioners can now systematically identify where agentic workflows break across different domains.