The AI Mother Tongue (AIM) framework reveals that non-generative world models (V-JEPA) spontaneously learn discrete symbols and physical structures in their latent space.
March 24, 2026
Original Paper
Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations
arXiv · 2603.20327
The Takeaway
It proves that generative pixel-reconstruction is not necessary for an model to develop a 'symbolic' understanding of geometry and motion. This suggests that latent predictive architectures naturally form compact, discrete concepts that are usually associated only with human language or explicit quantization.
From the abstract
Video world models trained with Joint Embedding Predictive Architectures (JEPA) acquire rich spatiotemporal representations by predicting masked regions in latent space rather than reconstructing pixels. This removes the visual verification pathway of generative models, creating a structural interpretability gap: the encoder has learned physical structure inaccessible in any inspectable form. Existing probing methods either operate in continuous space without a structured intermediate layer, or