AI & ML Paradigm Shift

Redefines robotic visual state representations by explicitly encoding 'what-is-where' composition through a global-to-local reconstruction objective.

March 17, 2026

Original Paper

Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition

Seokmin Lee, Yunghee Lee, Byeonghyun Pak, Byeongju Woo

arXiv · 2603.13904

The Takeaway

Standard visual tokens often fail to capture the precise spatial-semantic relationships needed for complex manipulation. By forcing a single bottleneck token to reconstruct masked local patches, this framework ensures visual states preserve fine-grained identity and location information, significantly improving robot policy learning.

From the abstract

For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, ena