Large-scale trajectories from tracker models now power a new long-term motion embedding. This approach bypasses expensive full video synthesis to predict scene dynamics orders of magnitude more efficiently. Apple researchers can now generate realistic motions via text prompts or spatial pokes. It streamlines how visual intelligence systems simulate complex future movements.