Direct Preference Optimization (DPO) now applies to vision and audio models, moving past simple text alignment. Researchers use this method to refine model outputs based on human preferences without needing a separate reward model. This streamlines the training pipeline for multimodal systems. Practitioners can now align non-textual outputs with significantly less computational overhead.