Directly operating on long-term motion embeddings allows Apple to model scene dynamics orders of magnitude more efficiently than full video synthesis. The system learns from large-scale trajectories to generate realistic motions via text prompts or spatial pokes. This approach reduces the computational cost of predicting multiple future motion scenarios for visual intelligence.