Researchers developed controllable 2D grid environments to quantify how Language Model agents balance discovery and knowledge use. The framework uses unknown task Directed Acyclic Graphs to isolate specific decision errors. This allows developers to measure agent performance without accessing internal policies. It provides a concrete benchmark for improving reliability in complex, open-ended embodied AI tasks.