Apple researchers developed a method to model scene dynamics using long-term motion embeddings derived from tracker trajectories. This approach generates realistic motion via text prompts or spatial pokes without the overhead of full video synthesis. It operates orders of magnitude faster than standard video models, streamlining kinematics generation for visual intelligence tasks.