Using 'better' LLMs for synthetic data doesn't actually guarantee better training results.
April 16, 2026
Original Paper
When Does Data Augmentation Help? Evaluating LLM and Back-Translation Methods for Hausa and Fongbe NLP
arXiv · 2604.12540
The Takeaway
There's a common assumption that if you use a smarter model (like GPT-4) to generate training data, your target model will be better. This paper proves that the task structure matters far more than the 'quality' of the LLM. For example, LLM-generated data helped with POS tagging but actually hurt Named Entity Recognition (NER) in African languages, regardless of which LLM was used. This is a critical finding for anyone using 'LLM-in-the-loop' data augmentation. You can't just 'GPT-4 your way' to a better dataset; you have to match the augmentation strategy to the specific linguistic task. It challenges the scaling law of synthetic data and forces a more nuanced approach to data engineering.
From the abstract
Data scarcity limits NLP development for low-resource African languages. We evaluate two data augmentation methods -- LLM-based generation (Gemini 2.5 Flash) and back-translation (NLLB-200) -- for Hausa and Fongbe, two West African languages that differ substantially in LLM generation quality. We assess augmentation on named entity recognition (NER) and part-of-speech (POS) tagging using MasakhaNER 2.0 and MasakhaPOS benchmarks. Our results reveal that augmentation effectiveness depends on task