The MolmoMotion framework predicts 3D human motion by integrating language guidance with spatial awareness. It leverages a vision-language model to map textual descriptions to precise physical trajectories. This approach outperforms traditional baselines in complex scene navigation. Developers can now generate more realistic synthetic movements for robotics and animation by using natural language prompts.