Capability evaluations primarily measure how well models code or solve science problems. The AI Alignment Forum argues this focus creates externalities by accelerating capability research. Shifting toward behavioral evaluations helps forecast specific risks more accurately. Practitioners must prioritize safety-centric metrics over raw performance to better anticipate autonomous agent failures.