A new testbed using Qwen2.5-0.5B-Instruct reveals that RLVR success often depends on the measurement instrument rather than model improvement. By separating reward channels from extractors, the study shows how identical runs can appear as successes or failures. This highlights a critical measurement flaw in GRPO pipelines that practitioners must decouple to ensure valid training.