The LABBench2 benchmark shifts AI evaluation from rote knowledge to the execution of meaningful scientific work. It evolves from the original LAB-Bench to better measure real-world capabilities in autonomous hypothesis generation. Researchers can now quantify how effectively AI agents perform complex biology tasks. This provides a concrete metric for measuring progress in autonomous labs.