A dynamic data pruning framework that cuts dense retriever training time by 50% while actually improving retrieval accuracy.
March 19, 2026
Original Paper
OPERA: Online Data Pruning for Efficient Retrieval Model Adaptation
arXiv · 2603.17205
The Takeaway
Retrieval model adaptation is typically compute-heavy; this method uses a two-stage dynamic pruning strategy to prioritize high-quality training pairs. It allows practitioners to reach state-of-the-art performance with half the hardware resources.
From the abstract
Domain-specific finetuning is essential for dense retrievers, yet not all training pairs contribute equally to the learning process. We introduce OPERA, a data pruning framework that exploits this heterogeneity to improve both the effectiveness and efficiency of retrieval model adaptation. We first investigate static pruning (SP), which retains only high-similarity query-document pairs, revealing an intrinsic quality-coverage tradeoff: ranking (NDCG) improves while retrieval (Recall) can degrade