A new benchmark targets actual software engineering tasks rather than simple coding snippets. It challenges LLMs to handle complex, multi-file repositories and real-world bug fixes. This shift moves evaluation away from synthetic puzzles toward practical utility. Developers now have a more accurate measure of how AI agents perform in production environments.