A new Apple research paper introduces stochastic KV routing to reduce memory footprints in transformer models. The method shares Key-Value caches across layers rather than maintaining full caches for every single depth level. This approach targets serving costs by optimizing the depth dimension. Practitioners can expect lower VRAM requirements during high-throughput autoregressive generation.