Thirteen open-weight models were tested across Omni-MATH and Codeforces to determine if natural-language feedback actually improves performance. Researchers used a student-teacher protocol to separate genuine learning from simple resampling or format corrections. The findings reveal that multi-turn improvements often stem from test-time computation rather than useful guidance. This challenges current assumptions about agentic refinement.