Researchers tested four LLMs across 48 papers to see if agents could reproduce empirical results using only method descriptions and raw data. The system isolates agents from original code to ensure authentic reimplementation. An error attribution step traces discrepancies back to specific chain failures. This provides a scalable framework for auditing scientific reproducibility.