KV sharing and compressed attention now drive efficiency in Gemma 4 and DeepSeek V4. These architectural shifts reduce the memory overhead required to process massive prompts. Developers can now deploy longer context windows on cheaper hardware. This trend prioritizes inference speed over raw parameter count for open-weight models.