Linear layers struggle with rare tokens, but Hugging Face researchers found that hybrid architectures bridge this gap. By combining traditional transformers with specialized components, these models predict infrequent tokens with higher accuracy. This finding helps developers optimize vocabulary size. It suggests a path toward more efficient tokenization for niche domains.