A new benchmark tests how frontier ASR models handle code-switching, where speakers blend two languages in one sentence. Results show that while Whisper and SeamlessM4T excel at monolingual tasks, they struggle with rapid language shifts. This gap limits the reliability of voice agents for bilingual users in diverse linguistic markets.