Direct Preference Optimization (DPO) now applies to vision-language models and reward-free alignment. Researchers are moving past simple chat preferences to optimize complex multimodal outputs. This shift simplifies the training pipeline by removing the need for a separate reward model. Practitioners can now align non-textual AI outputs with significantly less computational overhead.