Apple researchers now generate realistic motion trajectories by operating on learned embeddings rather than full video synthesis. This approach uses large-scale data from tracker models to predict scene dynamics orders of magnitude faster. Practitioners can now specify goals via text prompts or spatial pokes to produce long-term kinematics without the heavy compute of video diffusion.