Apple researchers developed a method to model scene dynamics using long-term motion embeddings derived from large-scale tracker trajectories. This approach bypasses inefficient full video synthesis to generate realistic motions via text prompts or spatial pokes. It allows for faster exploration of multiple future scenarios. Practitioners can now generate complex kinematics without the overhead of traditional video models.