A new benchmark tests how frontier ASR models handle code-switching, where speakers mix languages mid-sentence. Researchers found that while Whisper and SeamlessM4T perform well on monolingual data, accuracy drops sharply during language shifts. This gap limits the reliability of voice agents for bilingual users. Developers must now prioritize mixed-language training sets.