The HORIZON benchmark analyzes 3,100+ trajectories from GPT-5 and Claude models to pinpoint why agents fail on extended, interdependent tasks. Researchers found that while short-horizon performance is strong, complex sequences trigger systemic breakdowns. This diagnostic framework allows developers to isolate specific failure modes rather than guessing why agents loop or stall.