Researchers at Apple found that conditional diffusion models only inconsistently learn compositional structures. Testing on the CLEVR dataset revealed that length generalization—generating more objects than seen during training—works only in specific cases. This suggests models often fail to grasp underlying logic. Practitioners should expect unpredictable results when scaling object counts beyond training sets.