Apple researchers developed a method to model scene dynamics using long-term motion embeddings derived from tracker models. This approach bypasses expensive full video synthesis to generate realistic motions via text prompts or spatial pokes. It operates orders of magnitude more efficiently than current video models. Practitioners can now simulate complex trajectories without the compute overhead of frame-by-frame generation.