Testing in a controlled CLEVR setting reveals that conditional diffusion models only sporadically learn underlying compositional structures. The researchers focused on length generalization, specifically the ability to generate more objects than present during training. These findings suggest that Apple's observed capabilities in out-of-distribution combinations are inconsistent and lack a reliable mechanism.