Paradigm Challenge / AI

Many AI safety tools are actually just measuring the length of a sentence rather than identifying if a prompt is dangerous.

The Takeaway

Current methods for detecting unusual or malicious prompts are structurally flawed because they are confounded by sequence length. Genuine signals of intent are split between two separate pathways involving topic shifts and processing trajectories. This two-pathway framework explains why current safety filters are so easy to bypass with long or complex prompts. Real-world jailbreaks often hide in the processing path while appearing normal to embedding-based detectors. Developers need to move beyond simple similarity scores to monitor the internal trajectory of how a model processes a prompt.

By SeriesFusion Editorial Board · May 5, 2026

Original Paper

How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework

Hamidreza Saghir

arXiv · 2605.00269

From the abstract

Recent white-box OOD detection methods for LLMs -- including CED, RAUQ, and WildGuard confidence scores -- appear effective, but we show they are structurally confounded by sequence length (|r| >= 0.61) and collapse to near-chance under length-matched evaluation. Even raw attention entropy (mean H(alpha) across heads and layers), a natural baseline we include for completeness, shows the same confound. The confound stems from attention's Theta(log T) dependence on input length. To identify genuin

Read the original paper →

← Back to today's papers