A new framework uses 2D grid maps and Directed Acyclic Graphs to quantify how LM agents balance exploration and exploitation. Researchers can now programmatically adjust environment difficulty to isolate specific decision-making failures. This method allows developers to measure agent errors without accessing internal policies. It provides a concrete benchmark for improving embodied AI reliability.