Debugging GLM-5 at scale revealed critical bottlenecks in agent serving infrastructure. Engineers faced severe latency spikes when managing long-context coding tasks across distributed clusters. These findings highlight the gap between single-prompt benchmarks and production agentic workflows. Practitioners must prioritize memory management and state persistence to prevent systemic crashes during complex autonomous iterations.