No frontier model scored above 50% on ITBench-AA, a new benchmark for agentic IT tasks. Developed by IBM and Artificial Analysis, the test evaluates complex troubleshooting and system administration. Current LLMs struggle with the multi-step reasoning required for enterprise environments. This gap confirms that autonomous IT agents remain unreliable for production use.