AI & ML Paradigm Challenge

Training on less data can actually leak more private information through 'Choice Leakage.'

April 16, 2026

Original Paper

CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training

arXiv · 2604.12342

The Takeaway

The common intuition is that using a subset of data for training is safer for privacy. This paper proves the exact opposite: the criteria used to select that subset can itself become a massive privacy vulnerability. This 'Choice Leakage' allows attackers to infer sensitive details about the excluded data just by looking at what was included. It turns the act of 'filtering for privacy' into a new attack vector. For data practitioners, this is a critical warning: your data selection pipeline needs to be as secure as the model itself. It challenges the fundamental assumption that 'less data equals more privacy' and forces a rethink of 'safe' training practices.

From the abstract

Training models on a carefully chosen portion of data rather than the full dataset is now a standard preprocess for modern ML. From vision coreset selection to large-scale filtering in language models, it enables scalability with minimal utility loss. A common intuition is that training on fewer samples should also reduce privacy risks. In this paper, we challenge this assumption. We show that subset training is not privacy free: the very choices of which data are included or excluded can introd

Read the original paper →

← Back to today's papers

Training on *less* data can actually leak *more* private information through 'Choice Leakage.'

Original Paper

The Takeaway

From the abstract

Training on less data can actually leak more private information through 'Choice Leakage.'