No frontier model scored above 50% on ITBench-AA, a new benchmark for agentic IT tasks. Developed by IBM and Artificial Analysis, the test reveals a gap in complex tool-use and reasoning. Current agents struggle with multi-step enterprise workflows. This confirms that general-purpose LLMs still lack the reliability for autonomous IT operations.