Apple researchers developed a method to model scene dynamics using long-term motion embeddings derived from tracker models. This approach bypasses expensive full video synthesis to generate realistic trajectories via text prompts or spatial pokes. It operates orders of magnitude more efficiently than current video models. Practitioners can now simulate complex motions without the overhead of frame-by-frame generation.