Direct Preference Optimization (DPO) now applies to vision-language models and reward functions. Researchers shifted from simple chat alignment to optimizing complex visual reasoning and image generation preferences. This move removes the need for separate reward models. Practitioners can now refine multimodal outputs using a simpler, single-stage training process that reduces computational overhead.