Apple researchers now generate long, realistic motions by operating on embeddings learned from large-scale tracker trajectories. This approach bypasses the high computational cost of full video synthesis. Users can specify goals via text prompts or spatial pokes. It offers a faster alternative for predicting scene dynamics in visual intelligence systems.