KV sharing and compressed attention mechanisms now define the latest open-weight models like Gemma 4 and DeepSeek V4. These architectural shifts target the massive memory overhead of long-context windows. By reducing key-value cache requirements, developers can run larger contexts on cheaper hardware. This trend prioritizes inference efficiency over raw parameter growth.