The ScarfBench benchmark evaluates how AI agents handle complex migrations between enterprise Java frameworks. It uses 15 real-world migration tasks to measure code correctness and efficiency. Current agents struggle with these large-scale structural changes. This dataset provides a concrete baseline for developers building autonomous coding tools for legacy software maintenance.