Introduces FineViT and a 450M local caption dataset to solve the 'coarse perception' bottleneck in current CLIP-based encoders.
March 19, 2026
Original Paper
FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions
arXiv · 2603.17326
The Takeaway
By releasing a massive dataset of dense recaptions and a superior vision encoder, it provides a new foundational component for MLLMs that require fine-grained spatial understanding (e.g., OCR, document parsing, complex scene analysis).
From the abstract
While Multimodal Large Language Models (MLLMs) have experienced rapid advancements, their visual encoders frequently remain a performance bottleneck. Conventional CLIP-based encoders struggle with dense spatial tasks due to the loss of visual details caused by low-resolution pretraining and the reliance on noisy, coarse web-crawled image-text pairs. To overcome these limitations, we introduce FineViT, a novel vision encoder specifically designed to unlock fine-grained perception. By replacing co