The MolmoMotion model predicts future 3D human poses based on natural language instructions. It leverages a vision-language model to map textual cues to spatial trajectories. This approach outperforms previous baselines in motion forecasting accuracy. Developers can now generate precise, text-guided movement for digital avatars, reducing the need for manual keyframing in 3D environments.