The Sentence Transformers library now supports training and finetuning for multimodal embedding and reranker models. This update allows developers to map images and text into a shared vector space using contrastive loss. It simplifies the pipeline for building visual search systems. Practitioners can now implement cross-modal retrieval without building custom training loops from scratch.