The MixAtlas framework uses small proxy models to optimize data mixtures for multimodal LLM midtraining. It replaces manual tuning with systematic domain decomposition to improve sample efficiency. This approach reduces the compute required to find ideal data weights. Practitioners can now refine training sets more precisely to boost downstream generalization.