293 open-ended engineering problems comprise ThermoQA, a new benchmark evaluating thermodynamic reasoning across three complexity tiers. Claude Opus led the leaderboard with 94.1% accuracy. Performance drops significantly on full cycle analysis, proving that memorizing properties does not equal physical reasoning. This gap highlights a persistent struggle for LLMs in complex engineering simulations.