Apple researchers now generate realistic motion trajectories by operating on learned embeddings rather than full video synthesis. This approach uses large-scale data from tracker models to reduce computational overhead by orders of magnitude. Users can specify movement goals via text prompts or spatial pokes. It streamlines how visual models predict complex scene dynamics.