The MixAtlas framework optimizes multimodal training by using small proxy models to determine ideal data mixtures. It decomposes domains systematically to improve sample efficiency and downstream generalization. This approach replaces manual tuning of data formats and task types. Practitioners can now reduce compute costs during midtraining while maintaining high model performance.