The vLLM team shifted their reinforcement learning focus to prioritize correctness over mere corrections. By refining how models evaluate their own errors, they reduce the risk of reward hacking. This approach ensures RLHF updates improve actual reasoning rather than just mimicking preferred formatting. Practitioners can expect more stable model convergence.