LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
arXiv:2603.19312v1 Announce Type: new Abstract: Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48x faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure
arXiv:2603.19312v1 Announce Type: new Abstract: Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48x faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detects physically implausible events.
Executive Summary
This article presents LeWorldModel, a novel Joint Embedding Predictive Architecture (JEPAs) that achieves stable end-to-end training from raw pixels using only two loss terms. LeWorldModel surpasses existing methods by reducing the number of tunable loss hyperparameters, requiring fewer computations to achieve competitive performance across various control tasks. The model's latent space is shown to encode meaningful physical structure, enabling reliable detection of physically implausible events. While LeWorldModel demonstrates a significant advancement in world modeling, its scalability and applicability in real-world scenarios remain to be explored. This research has the potential to revolutionize the field of artificial intelligence, enabling more efficient and effective learning of complex world models.
Key Points
- ▸ LeWorldModel is the first JEPAs that trains stably end-to-end from raw pixels using only two loss terms
- ▸ The model reduces the number of tunable loss hyperparameters from six to one
- ▸ LeWorldModel achieves competitive performance across diverse 2D and 3D control tasks while requiring fewer computations
Merits
Strength
LeWorldModel achieves stable end-to-end training from raw pixels using a simplified loss function, enabling faster and more efficient learning of complex world models
Competitive Performance
LeWorldModel demonstrates competitive performance across various control tasks, outperforming existing methods in terms of computational efficiency
Physical Structure Encoding
The model's latent space encodes meaningful physical structure, enabling reliable detection of physically implausible events
Demerits
Limitation
The scalability and applicability of LeWorldModel in real-world scenarios remain to be explored
Evaluation on Real-World Data
The model's performance on real-world data and its ability to generalize to diverse environments require further evaluation
Interpretability and Explainability
The interpretability and explainability of LeWorldModel's latent space and decision-making processes require further investigation
Expert Commentary
LeWorldModel represents a significant advancement in the field of artificial intelligence, demonstrating the potential for end-to-end learning of complex world models from raw pixels. The model's ability to encode meaningful physical structure in its latent space and detect physically implausible events is a notable achievement. However, the scalability and applicability of LeWorldModel in real-world scenarios require further evaluation and investigation. As the field continues to evolve, it is essential to explore the implications of LeWorldModel on various industries and policy domains, ensuring a responsible and beneficial development of artificial intelligence.
Recommendations
- ✓ Future research should focus on exploring the scalability and applicability of LeWorldModel in real-world scenarios and evaluating its performance on diverse environments
- ✓ Investigate the interpretability and explainability of LeWorldModel's latent space and decision-making processes to ensure a deeper understanding of the model's behavior and limitations
- ✓ Explore the potential applications of LeWorldModel in various industries and policy domains, ensuring a responsible and beneficial development of artificial intelligence
Sources
Original: arXiv - cs.LG