Apple researchers created a system that generates long, realistic motion trajectories by operating on learned embeddings rather than full video synthesis. The model uses large-scale tracker data to predict scene dynamics via text prompts or spatial pokes. This approach reduces the computational cost of predicting multiple futures for visual intelligence practitioners.