No frontier model scored above 50% on ITBench-AA, a new evaluation for agentic IT tasks. IBM and Artificial Analysis tested models on complex system administration and troubleshooting workflows. These results expose a critical gap between general reasoning and reliable enterprise execution. Practitioners should expect significant failure rates when deploying autonomous agents for technical infrastructure management.