Nine autonomous Claude instances outperformed human researchers on an open alignment problem during a controlled experiment. Anthropic attempted to transfer the winning method to production models, but the performance gains vanished. This discrepancy highlights a gap between isolated agentic success and scalable deployment. Practitioners must now reconcile these inconsistent results across environments.