Post-training now relies on a precise mix of SFT and RLHF to refine model reasoning. Finbarr Timbers explains how iterative preference optimization replaces simple supervised learning to maximize performance. This shift prioritizes high-quality data over sheer volume. Practitioners should focus on reward model calibration to prevent reward hacking in complex reasoning tasks.