Dangerous models trained on illegal content can be caught without ever generating a single harmful image.
April 29, 2026
Original Paper
Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM
arXiv · 2604.25119
The Takeaway
Detecting if a model has been fine-tuned on harmful data usually requires trying to force the model to produce that data. Gaussian probing solves this ethical dilemma by analyzing the internal representations of the model instead. The method identifies specific shifts in the latent space that occur when a model is specialized for harmful tasks. This allows authorities to prove a model is dangerous while complying with laws that prohibit the possession or generation of the material itself. It provides a non-generative path for model auditing and safety enforcement in highly sensitive domains.
From the abstract
Auditing the fine-tunes of open-weight generative models for harmful specialization has become a new governance challenge for model hosting platforms. The standard toolkit, generative evaluation via curated prompts or red-teaming, does not scale to platform-level auditing and breaks down entirely for domains like CSAM where generation is legally constrained. This motivates the Evaluation without Generation problem: assessing model capabilities without producing outputs. We argue that in such set