Researchers tested four LLM scaffolds across 48 papers to see if agents could reproduce empirical results using only methods descriptions and raw data. The system isolates agents from original code to ensure authentic reimplementation. An error attribution step traces discrepancies to their root cause. This provides a deterministic framework for verifying academic reproducibility.