A new benchmark tests how frontier ASR models handle code-switching, where speakers mix two languages in one sentence. Results show that while Whisper and SeamlessM4T perform well on monolingual tasks, they struggle with spontaneous language shifts. This gap limits the reliability of voice agents for bilingual users in diverse markets.