No frontier model scored above 50% on the new ITBench-AA benchmark. Developed by IBM and Artificial Analysis, the test evaluates agentic workflows for complex IT tasks. Current models struggle with the multi-step reasoning required for enterprise environments. This gap confirms that general-purpose LLMs are not yet ready for autonomous IT operations.