The BTF-2 benchmark uses 1,417 pastcasting questions and a 15M-document corpus to isolate agent research and judgment strengths. Researchers built a forecaster that outperforms single frontier agents by 0.011 Brier score. Success depends on pre-mortem analysis of blind spots and black swan events. This provides a reproducible method to measure strategic reasoning without hindsight bias.