Apple researchers introduced Stochastic KV Routing to reduce the memory footprint of transformer language models. The method shares Key-Value caches across different layers rather than relying on temporal compression or eviction. This approach targets the depth dimension to lower serving costs. It offers a more memory-efficient way to maintain high throughput during generation.