A new benchmark tests how frontier ASR models handle code-switching, where speakers mix two languages in one sentence. Researchers found that while Whisper and other top models perform well on monolingual data, accuracy drops sharply during language shifts. This gap limits the reliability of voice agents for bilingual users in multilingual markets.