The MixAtlas framework uses small proxy models to optimize data mixtures for multimodal LLM midtraining. It replaces simplistic tuning with systematic domain decomposition to improve sample efficiency. This approach reduces the compute cost of finding ideal training recipes. Practitioners can now refine multimodal datasets without exhaustive, full-scale model iterations.