A new Apple research paper introduces Stochastic KV Routing to reduce memory footprints in transformer models. The method shares Key-Value caches across layers rather than relying solely on temporal compression. This approach targets the depth dimension to lower serving costs. Practitioners can now optimize inference throughput without sacrificing significant model accuracy.