Effective semantic alignment for low-resource languages can be achieved with only 10,000 noisy synthetic pairs, matching the performance of models trained on 1 million samples.
March 25, 2026
Original Paper
Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data
arXiv · 2603.22290
AI-generated illustration
The Takeaway
It challenges the need for massive, high-quality human-verified datasets for low-resource language embeddings. For practitioners, this means high-performance RAG and search capabilities are far cheaper and more accessible for niche languages than previously thought.
From the abstract
Low-resource languages (LRLs) often lack high-quality, large-scale datasets for training effective text embedding models, hindering their application in tasks like retrieval-augmented generation (RAG) and semantic search. In this work, we challenge the prevailing assumption that effective semantic alignment requires massive datasets or pristine, human-verified translations. Focusing on Armenian (an LRL with a unique script), we introduce a cost-effective adaptation strategy using small scale noi