Researchers tested four LLM scaffolds on 48 papers to see if agents could recreate results using only method descriptions and raw data. The system isolates agents from original code to ensure authentic reimplementation. This workflow uses an error attribution step to trace discrepancies. It provides a deterministic benchmark for verifying academic reproducibility.