A test of OpenAI RFT finetuning on o4-mini showed numeric forecasting scores of +14.59, beating a +9.25 baseline. However, the model struggled with binary questions, scoring -0.7 against a +2.4 baseline. This suggests reinforcement finetuning improves specific quantitative accuracy but may degrade general binary logic. Practitioners should match finetuning methods to specific question types.