Thirteen open-weight models were tested across Omni-MATH and ARC-AGI1 to isolate the effects of natural-language feedback. Researchers found that multi-turn accuracy gains often stem from resampling or format correction rather than actual instruction. This suggests that current self-refinement loops may overstate their utility. Practitioners should prioritize controlled protocols over raw accuracy.