AI & ML Breaks Assumption

Effective semantic alignment for low-resource languages can be achieved with only 10,000 noisy synthetic pairs, matching the performance of models trained on 1 million samples.

March 25, 2026

Original Paper

Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data

Zaruhi Navasardyan, Spartak Bughdaryan, Bagrat Minasyan, Hrant Davtyan

arXiv · 2603.22290

AI-generated illustration

The Takeaway

It challenges the need for massive, high-quality human-verified datasets for low-resource language embeddings. For practitioners, this means high-performance RAG and search capabilities are far cheaper and more accessible for niche languages than previously thought.

From the abstract

Low-resource languages (LRLs) often lack high-quality, large-scale datasets for training effective text embedding models, hindering their application in tasks like retrieval-augmented generation (RAG) and semantic search. In this work, we challenge the prevailing assumption that effective semantic alignment requires massive datasets or pristine, human-verified translations. Focusing on Armenian (an LRL with a unique script), we introduce a cost-effective adaptation strategy using small scale noi