The ScarfBench benchmark evaluates how AI agents handle complex enterprise Java framework migrations. It uses 10 real-world migration tasks to measure accuracy and efficiency. Most current agents struggle with these large-scale structural changes. This dataset provides a concrete baseline for developers building autonomous software engineering tools for legacy codebases.