Qianfan-OCR introduces 'Layout-as-Thought,' enabling a 4B model to outperform 235B models on complex document parsing and layout analysis.
March 17, 2026
Original Paper
Qianfan-OCR: A Unified End-to-End Model for Document Intelligence
arXiv · 2603.13398
The Takeaway
By utilizing an optional 'thinking phase' to generate structured layout representations before text, it recovers spatial grounding lost in end-to-end models. This proves that architectural reasoning can bridge the gap between small edge-deployable models and massive frontier VLMs for document intelligence.
From the abstract
We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by spe