Proposes SOL-Nav, which replaces raw visual features in navigation with structured language descriptions for LLM-based agents.
March 31, 2026
Original Paper
Structured Observation Language for Efficient and Generalizable Vision-Language Navigation
arXiv · 2603.27577
The Takeaway
By translating egocentric RGB-D data into a structured grid of text (semantic, color, depth), it allows pre-trained language models to perform navigation via pure linguistic reasoning. This drastically reduces model size and training data needs while improving generalization to unseen textures and lighting.
From the abstract
Vision-Language Navigation (VLN) requires an embodied agent to navigate complex environments by following natural language instructions, which typically demands tight fusion of visual and language modalities. Existing VLN methods often convert raw images into visual tokens or implicit features, requiring large-scale visual pre-training and suffering from poor generalization under environmental variations (e.g., lighting, texture). To address these issues, we propose SOL-Nav (Structured Observati