Over 720 fine-tuned LLMs comprise Pando, a new benchmark designed to test how well interpretability methods recover known decision rules. Gradient-based techniques outperformed blackbox baselines, while non-gradient methods struggled. This dataset allows researchers to measure rationale faithfulness with precision. It provides a controlled environment to validate interpretability tools before applying them to opaque models.