The latest Sentence Transformers update introduces native support for training multimodal embedding and reranker models. This framework streamlines the alignment of text and image vectors into a shared latent space. Developers can now fine-tune vision-language models with significantly less boilerplate code. It simplifies the deployment of high-accuracy multimodal retrieval systems for production.