Three major AI systems struggled with 331 real dental examination questions in a new PLOS study. Researchers found critical gaps in confidence calibration and citation reliability. Many cited sources simply did not exist. This failure warns medical educators against relying on LLMs for high-stakes learning without rigorous human-in-the-loop verification of AI outputs.