Researchers tested four LLMs across 48 papers to see if agents could reproduce results using only methods descriptions and raw data. The system isolates agents from original code to ensure unbiased reimplementation. An error attribution step traces discrepancies to their root cause. This provides a scalable framework for verifying empirical research without relying on original scripts.