Apple researchers developed a long-term motion embedding learned from large-scale tracker trajectories. This approach bypasses expensive full video synthesis to predict scene dynamics orders of magnitude more efficiently. Users can now generate realistic, long-form motions using text prompts or spatial pokes. It streamlines how visual models simulate complex physical movements.