The BTF-2 benchmark uses 1,417 pastcasting questions and a 15M-document corpus to isolate agent research from judgment. This setup detects Brier score differences as small as 0.004. High-performing agents succeed by analyzing blind spots and black swan events. Practitioners can now distinguish whether an agent fails due to poor data retrieval or flawed logic.