Apple researchers created a system that predicts scene dynamics by operating on motion embeddings derived from large-scale tracker trajectories. This approach bypasses the inefficiency of full video synthesis. It generates realistic, long-term motions based on text prompts or spatial pokes. The method enables faster iteration for visual intelligence and kinematics generation.