The ScarfBench benchmark evaluates how AI agents handle complex migrations of enterprise Java frameworks. It uses 15 real-world migration tasks to measure code correctness and efficiency. Current models struggle with these long-context dependencies. This dataset provides a concrete baseline for developers building autonomous agents to automate legacy software updates and technical debt reduction.