The MixAtlas framework uses small proxy models to optimize data mixtures for multimodal LLM midtraining. It replaces manual tuning with systematic domain decomposition to improve sample efficiency. This approach reduces the compute required to find ideal data weights. Researchers can now refine multimodal training recipes without exhaustive full-scale experiments.