Many AI safety tools are actually just measuring the length of a sentence rather than identifying if a prompt is dangerous.
Current methods for detecting unusual or malicious prompts are structurally flawed because they are confounded by sequence length. Genuine signals of intent are split between two separate pathways involving topic shifts and processing trajectories. This two-pathway framework explains why current safety filters are so easy to bypass with long or complex prompts. Real-world jailbreaks often hide in the processing path while appearing normal to embedding-based detectors. Developers need to move beyond simple similarity scores to monitor the internal trajectory of how a model processes a prompt.
How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework
arXiv · 2605.00269
Recent white-box OOD detection methods for LLMs -- including CED, RAUQ, and WildGuard confidence scores -- appear effective, but we show they are structurally confounded by sequence length (|r| >= 0.61) and collapse to near-chance under length-matched evaluation. Even raw attention entropy (mean H(alpha) across heads and layers), a natural baseline we include for completeness, shows the same confound. The confound stems from attention's Theta(log T) dependence on input length. To identify genuin