Prompt engineering cannot recover information that a human user never put into the text.
April 24, 2026
Original Paper
The signal is the ceiling: Measurement limits of LLM-predicted experience ratings from open-ended survey text
arXiv · 2604.19645
The Takeaway
Many developers believe that the right prompt can make an AI perfectly predict user satisfaction from survey responses. This study shows that the limiting factor is not the AI, but the gap between what people write and how they feel. No amount of technical tuning or model selection can bridge this fundamental measurement limit. The data itself has a ceiling that prevents further accuracy gains regardless of the prompt used. Companies should focus more on improving data collection than on endlessly tweaking their instructions.
From the abstract
An earlier paper (Hong, Potteiger, and Zapata 2026) established that an unoptimized GPT 4.1 prompt predicts fan-reported experience ratings within one point 67% of the time from open-ended survey text. This paper tests the relative impact of prompt design and model selection on that performance. We compared four configurations on approximately 10,000 post-game surveys from five MLB teams: the original baseline prompt and a moderately customized version, crossed with three GPT models (4.1, 4.1-mi