The BTF-2 benchmark uses 1,417 pastcasting questions and 15 million documents to analyze agent reasoning. It distinguishes between research ability and final judgment. Researchers built a forecaster that beats single frontier agents by 0.011 Brier score. Success depends on pre-mortem analysis of blind spots and black swan events, providing a blueprint for better strategic reasoning.