Directly operating on long-term motion embeddings allows Apple to predict scene dynamics orders of magnitude faster than full video synthesis. The system learns from large-scale trajectories to generate realistic motions via text prompts or spatial pokes. This approach reduces the computational cost of simulating multiple futures. It provides a faster alternative for kinematics generation.