The 12B parameter Gemma 4 removes the separate encoder, processing text and images within a single unified architecture. This design streamlines multimodal inference and reduces latency. Google DeepMind open-weights the model to allow developers to build native vision-language applications. It sets a new efficiency benchmark for medium-sized open models.