A new benchmark reveals that frontier ASR models fail significantly when users switch languages mid-sentence. Researchers tested several top-tier systems on code-switched speech, finding a sharp drop in accuracy compared to monolingual inputs. This gap hinders the reliability of voice agents for bilingual populations. Developers must now prioritize diverse, mixed-language datasets to improve robustness.