Your massive dataset is ruining your prompt optimization; you only need two diverse examples for better results.
April 15, 2026
Original Paper
p1: Better Prompt Optimization with Fewer Prompts
arXiv · 2604.08801
The Takeaway
The 'more data is better' mantra fails spectacularly in prompt engineering. This paper shows that scaling up to more user prompts actually degrades optimization results due to noise and overfitting. Instead, the researchers found that training on as few as two highly variant prompts provides better generalization than using a full dataset. For practitioners, this means you can stop scraping thousands of examples and start curating a tiny, 'high-variance' gold set. It radically simplifies the pipeline for production-grade prompt engineering.
From the abstract
Prompt optimization improves language models without updating their weights by searching for a better system prompt, but its effectiveness varies widely across tasks. We study what makes a task amenable to prompt optimization. We show that the reward variance across different system prompts can be decomposed into two components: variance among responses, which captures generation stochasticity, and variance among system prompts, which captures differences in system prompt quality. Prompt optimiz