Training on *less* data can actually leak *more* private information through 'Choice Leakage.'
April 16, 2026
Original Paper
CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training
arXiv · 2604.12342
The Takeaway
The common intuition is that using a subset of data for training is safer for privacy. This paper proves the exact opposite: the criteria used to select that subset can itself become a massive privacy vulnerability. This 'Choice Leakage' allows attackers to infer sensitive details about the excluded data just by looking at what was included. It turns the act of 'filtering for privacy' into a new attack vector. For data practitioners, this is a critical warning: your data selection pipeline needs to be as secure as the model itself. It challenges the fundamental assumption that 'less data equals more privacy' and forces a rethink of 'safe' training practices.
From the abstract
Training models on a carefully chosen portion of data rather than the full dataset is now a standard preprocess for modern ML. From vision coreset selection to large-scale filtering in language models, it enables scalability with minimal utility loss. A common intuition is that training on fewer samples should also reduce privacy risks. In this paper, we challenge this assumption. We show that subset training is not privacy free: the very choices of which data are included or excluded can introd