Dreamweaver

Abstract

Humans have an innate ability to decompose their perceptions of the world into objects and their attributes, such as colors, shapes, and movement patterns. This cognitive process enables us to imagine novel futures by recombining familiar concepts. However, replicating this ability in artificial intelligence systems has proven challenging, particularly when it comes to modeling videos into compositional concepts and generating unseen, recomposed futures without relying on auxiliary data, such as text, masks, or bounding boxes. In this paper, we propose Dreamweaver, a neural architecture designed to discover hierarchical and compositional representations from raw videos and generate compositional future simulations. Our approach leverages a novel Recurrent Block-Slot Unit (RBSU) to decompose videos into their constituent objects and attributes. In addition, Dreamweaver uses a multi-future-frame prediction objective to capture disentangled representations for dynamic concepts more effectively as well as static concepts. In experiments, we demonstrate our model outperforms current state-of-the-art baselines for world modeling when evaluated under the DCI framework across multiple datasets. Furthermore, we show how the modularized concept representations of our model enable compositional imagination, allowing the generation of novel videos by recombining attributes from previously seen objects.

Dreamweaver Framework

Our aim is to take a sequential unstructured sensory stream and bind the low-level information into abstract modular concepts to build a memory of reusable concepts, called concept library—all without text and in an unsupervised way. These concepts include both static factors such as color and shape as well as dynamic factors such as direction and speed of motion. Finally, we seek to recombine these concepts, e.g., in a novel configuration, and imagine an unseen world.

Architecture

Left: The Recurrent Block-Slot Unit (RBSU) is a recurrent unit designed for processing sequences where each item is a set of vectors. RBSU maintains and updates Block-Slots, which represent compositional and semantic concepts such as shape, color, and motion direction. Right: The Dreamweaver model encodes video inputs into Block-Slot representations, which pass through a series of RBSUs with a recurrent structure. It then predicts future frames by decoding the extracted Block-Slots using a transformer decoder, training to minimize the predictive objective.

Unsupervised Modular Concept Discovery from Videos

(1) DCI Performance

We compare our model with the baselines in terms of Disentanglement (D), Completeness (C), Informativeness (I), and Informativeness-Dynamic (I-D). I-D is the informativeness score for dynamic concepts only (e.g., direction of motion or dance patterns) to evaluate how effectively the models capture such dynamic concepts.

(2) Visualizing Captured Concepts

Moving-Sprites Dataset

Dancing-CLEVR Dataset

This illustrates the concepts represented by each block within the learned block-slot representations. To achieve this, we gather block representations with the same index and apply clustering methods, such as k-means, following the approach in Singh et al. (2023).

Compositional Imagination

We show compositionally novel videos generated by Dreamweaver. In this visualization, we (1) infer the block-slot representation given an initial context video, (2) perform manipulations on the inferred block-slot representation, and (3) perform rollout starting from the manipulated block-slot representation. At the top, we also visualize the rollout that would have occurred had no manipulation been done to the representation. Left: For the Moving-Sprites dataset, we visualize manipulations such as swapping color and shape, changing of direction of motion of a specific object, and changing the speed of movement of a specific object. Right: For the Dancing-CLEVR dataset, we visualize manipulations such as swapping the object shapes and changing the dance patterns.

Compositional Scene Prediction and Reasoning

We compare our model with baselines in terms of prediction accuracy for different frame offsets. A frame offset of zero corresponds to the last context frame and a frame offset of one corresponds to the first predicted frame after the context frames, and so on.

BibTeX

@inproceedings{
        baek2025dreamweaver,
        title={Dreamweaver: Learning Compositional World Models from Pixels},
        author={Junyeob Baek and Yi-Fu Wu and Gautam Singh and Sungjin Ahn},
        booktitle={The Thirteenth International Conference on Learning Representations},
        year={2025},
        url={https://openreview.net/forum?id=e5mTvjXG9u}
        }

Dreamweaver: Learning Compositional World Models from Pixels

💫ICLR 2025🎉