The BTF-2 benchmark uses 1,417 pastcasting questions and a 15M-document corpus to isolate why AI agents succeed or fail. It distinguishes between research ability and final judgment. Researchers built a forecaster outperforming frontier agents by focusing on pre-mortem analysis and black swan events. This provides a precise metric for strategic reasoning without hindsight bias.