Thirteen open-weight models were tested across Omni-MATH and ARC-AGI to determine if natural-language feedback actually improves accuracy. Researchers used a student-teacher protocol to separate genuine learning from simple resampling or format corrections. The findings suggest multi-turn improvements often stem from test-time computation rather than instructional quality. This challenges the perceived efficacy of self-refinement.