The MixAtlas framework uses small proxy models to optimize data mixtures for multimodal LLM midtraining. It decomposes domains systematically to improve sample efficiency and generalization. This approach replaces the manual tuning of data formats and task types. Practitioners can now refine training recipes with significantly less compute than full-scale model iterations.