The ScarfBench benchmark evaluates how AI agents handle migrating legacy Java frameworks to modern versions. It uses a set of real-world software engineering tasks to measure precision and success rates. This provides a concrete baseline for developers building autonomous agents. Most current models still struggle with the complex dependency chains found in enterprise codebases.