Introduces LongCat-Next, a 'Native Multimodal' model that treats vision and audio as first-class discrete tokens rather than language-centric attachments.
March 31, 2026
Original Paper
LongCat-Next: Lexicalizing Modalities as Discrete Tokens
arXiv · 2603.27538
The Takeaway
By utilizing the dNaViT architecture for arbitrary-resolution tokenization, it overcomes the performance ceiling typically seen in discrete vision modeling. It represents a move toward truly unified autoregressive architectures where all modalities are modeled under a single objective without modality-specific 'sidecars'.
From the abstract
The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discr