Hidden training goals and secret backdoors in LLMs leak through simple perplexity checks because the models overgeneralize beyond their intended scope.
Secret fine-tuning objectives leave breadcrumbs in the way a model predicts the next word. A simple method called perplexity differencing can uncover these hidden patterns by looking for where a model is unexpectedly too good at a task. This allows external auditors to find backdoors or hidden agendas without seeing the model code or weights. It proves that internal training secrets cannot be truly buried once a model is public. Transparency in AI might be forced by the model own statistical leakage rather than corporate policy.
Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives
arXiv · 2605.00994
Finetuning can significantly modify the behavior of large language models, including introducing harmful or unsafe behaviors. To study these risks, researchers develop model organisms: models finetuned to exhibit specific known behaviors for controlled experimentation. Identifying these behaviors remains challenging. We show that a simple perplexity-based method can surface finetuning objectives from model organisms by leveraging their tendency to overgeneralize their finetuned behaviors beyond