AI & ML Breaks Assumption

Proves that intuitive task similarity is a poor predictor of training data value for MLLMs and offers a highly accurate training-free alternative.

March 23, 2026

Original Paper

DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs

Xuan Qi, Luxi He, Dan Roth, Xingyu Fu

arXiv · 2603.19688

The Takeaway

This work challenges the conventional wisdom that 'similar tasks' make the best supervision data, finding that generalization depends more on dataset-specific properties. The proposed DataProphet metric achieves 86% correlation with post-training performance gains, allowing practitioners to optimize expensive multimodal data curation without training a single model.

From the abstract

Conventional wisdom for selecting supervision data for multimodal large language models (MLLMs) is to prioritize datasets that appear similar to the target benchmark, such as text-intensive or vision-centric tasks. However, it remains unclear whether such intuitive similarity reliably predicts downstream performance gains. In this work, we take a first step toward answering a practical question: can we estimate the influence of a training dataset on a target benchmark before any training is perf