The HORIZON benchmark analyzes over 3,100 trajectories to pinpoint why LLM agents fail during extended, interdependent action sequences. Researchers tested GPT-5 and Claude variants across four domains to characterize these breakdowns. The data reveals a sharp performance drop as task length increases. This diagnostic framework allows developers to isolate specific failure points in agentic workflows.