Trying to train an AI model to stop leaking secrets can actually make it leak that sensitive information more often.
April 26, 2026
Original Paper
How to Evaluate an LLM for Sensitive Data Safety Before Deploying It
SSRN · 6565139
The Takeaway
Aggressive fine-tuning to prevent data leaks often backfires by creating stronger internal pathways for that specific data. Models are much better at recognizing structured data like Social Security numbers than context-heavy info like a patient's name linked to a diagnosis. The counterintuitive leakage effect happens because the model becomes hyper-focused on the data it was told to hide. This discovery forces a complete redesign of how we handle privacy in large models. Simply telling a model don't say this is often the worst way to keep a secret.
From the abstract
Standard LLM benchmarks measure capability - what the model can do - but not constraint - what the model should not do. We present a practical evaluation framework for assessing LLM sensitive data safety across four data categories: credentials, personally identifiable information (PII), protected health information (PHI), and financial data. Testing 24+ models across 6 model families, we find that models exhibit a clear sensitivity hierarchy: format-based recognition (structured credentials, SS