TopoChunker maps documents to a Structured Intermediate Representation (SIR) to preserve hierarchical context during RAG chunking.
March 20, 2026
Original Paper
TopoChunker: Topology-Aware Agentic Document Chunking Framework
arXiv · 2603.18409
The Takeaway
It fixes the 'semantic fragmentation' caused by linear text chunking, leading to an 8% gain in absolute accuracy while using 23.5% fewer tokens. This is a highly practical upgrade for any production RAG system dealing with complex reports or unstructured narratives.
From the abstract
Current document chunking methods for Retrieval-Augmented Generation (RAG) typically linearize text. This forced linearization strips away intrinsic topological hierarchies, creating ``semantic fragmentation'' that degrades downstream retrieval quality. In this paper, we propose TopoChunker, an agentic framework that maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to explicitly preserve cross-segment dependencies. To balance structural fidelity with computational