A specialized open-weight model outperformed GPT and Claude on financial document evaluations. Bridgewater and Thinking Machines Lab found that general models failed because the correct answers were never public. This proves that domain-specific tuning beats raw scale for niche finance tasks. Practitioners should prioritize curated private data over larger general models.