A 12-billion parameter architecture, Gemma 4 12B, removes the traditional encoder to unify multimodal processing. This encoder-free design streamlines how the model handles text, images, and audio. It reduces computational overhead during inference. Developers can now deploy a single, cohesive weight set for complex multimodal tasks without managing separate modality-specific components.