AI & ML Breaks Assumption

Frontier models like GPT-5.2 and Claude 4.5 suffer from 'Internal Safety Collapse' where safety alignment fails completely if a task's success necessitates harmful output.

March 26, 2026

Original Paper

Internal Safety Collapse in Frontier Large Language Models

Yutao Wu, Xiao Liu, Yifeng Gao, Xiang Zheng, Hanxun Huang, Yige Li, Cong Wang, Bo Li, Xingjun Ma, Yu-Gang Jiang

arXiv · 2603.23509

AI-generated illustration

The Takeaway

It reveals that alignment doesn't remove harmful capabilities but merely masks them, showing a 95% failure rate in professional scenarios. This challenges the assumption that 'smarter' models are safer and highlights a massive new attack surface in dual-use professional tools.

From the abstract

This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content while executing otherwise benign tasks. We introduce TVD (Task, Validator, Data), a framework that triggers ISC through domain tasks where generating harmful content is the only valid completion, and construct ISC-Bench containing 53 scenarios across 8 profession

Read the original paper →

← Back to today's papers