Apple researchers introduced MixAtlas to optimize data mixtures for multimodal LLM midtraining. The framework uses systematic domain decomposition and small proxy models to improve sample efficiency. This approach replaces manual tuning of data formats and task types. It allows developers to refine training sets without the cost of full-scale model runs.