AI & ML Efficiency Breakthrough

The first Joint Embedding Predictive Architecture (JEPA) to train stably end-to-end from raw pixels with massive planning speedups.

March 23, 2026

Original Paper

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, Randall Balestriero

arXiv · 2603.19312

The Takeaway

By reducing hyperparameters and simplifying the loss to just two terms, LeWM achieves 48x faster planning than foundation-model-based world models. It demonstrates that stable, pixel-to-latent world models can be trained on a single GPU in hours rather than days.

From the abstract

Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularize