The first Joint Embedding Predictive Architecture (JEPA) to train stably end-to-end from raw pixels with massive planning speedups.
March 23, 2026
Original Paper
LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
arXiv · 2603.19312
The Takeaway
By reducing hyperparameters and simplifying the loss to just two terms, LeWM achieves 48x faster planning than foundation-model-based world models. It demonstrates that stable, pixel-to-latent world models can be trained on a single GPU in hours rather than days.
From the abstract
Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularize