The Sentence Transformers library now supports training and finetuning multimodal embedding and reranker models. This update allows developers to align text and image representations using a unified framework. It streamlines the creation of cross-modal retrieval systems. Practitioners can now implement complex contrastive learning workflows with significantly less boilerplate code than previous manual implementations.