The BTF-2 benchmark uses 1,417 pastcasting questions and a 15M-document corpus to isolate agent research from judgment. Researchers built a forecaster that beats single frontier agents by 0.011 Brier score. Success depends on pre-mortem analysis of blind spots and black swan events. This provides a reproducible way to measure strategic reasoning without hindsight bias.