Direct Preference Optimization now extends to vision-language models and reward model distillation. Researchers shifted from simple chat preferences to complex visual reasoning and multi-objective optimization. This transition reduces the reliance on expensive RLHF pipelines. Practitioners can now align non-textual outputs using the same stable loss functions found in Llama and Mistral fine-tuning.