The MolmoMotion framework predicts future 3D trajectories of objects based on natural language descriptions. It integrates a vision-language model with a motion forecasting head to handle complex spatial reasoning. This approach outperforms traditional baselines on the NuScenes dataset. Developers can now map linguistic cues directly to physical movement for better robotic navigation.