A new model called MolmoMotion predicts 3D human motion using natural language prompts. It leverages a vision-language architecture to translate text descriptions into precise spatial trajectories. This research simplifies how developers generate realistic character animations. The system outperforms previous baselines by better aligning linguistic cues with physical movement, reducing manual keyframing for creators.