AI & ML Efficiency Breakthrough

Distills a 2B Vision-Language Retriever into a 70M text-only encoder for visual document retrieval with 50x lower latency.

March 16, 2026

Original Paper

NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

Zhuchenyang Liu, Yao Zhang, Yu Xiao

arXiv · 2603.12824

The Takeaway

It exploits the asymmetry between complex visual documents and simple text queries to move the heavy computation to offline indexing. This allows high-quality visual document search to run on standard CPUs and edge devices with almost no loss in retrieval performance.

From the abstract

Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asym