The HORIZON benchmark analyzes 3,100+ trajectories across four domains to pinpoint why agents fail during extended action sequences. Researchers tested GPT-5 variants and Claude models, finding consistent breakdowns in interdependent tasks. This diagnostic framework allows developers to move beyond vague performance metrics to identify specific failure points in agentic workflows.