SeriesFusion
Science, curated & edited by AI
Practical Magic  /  AI

Malicious relays can hijack an AI nervous system to bypass safety filters after the model has already decided to be helpful.

Even if an AI is perfectly safe and aligned, it can still be tricked into doing harm. These Response-Path attacks target the message after it leaves the AI brain but before it reaches the hand that executes the code. By tampering with the intermediate response, an attacker can achieve a $99\%$ success rate in forcing a safe model to perform dangerous actions. This proves that safety is a pipeline problem, not just a model problem. We must secure the entire communication loop or the AI internal ethics won't matter.

Original Paper

When Alignment Isn't Enough: Response-Path Attacks on LLM Agents

Mingyu Luo, Zihan Zhang, Zesen Liu, Yuchong Xie, Zhixiang Zhang, Dung Hiu Hilton Yeung, Wai Ip Lai, Ping Chen, Ming Wen, Dongdong She

arXiv  ·  2605.02187

Bring-Your-Own-Key (BYOK) agent architectures let users route LLM traffic through third-party relays, creating a critical integrity gap: a malicious relay can modify an aligned LLM response after generation but before agent execution. We formalize this post-alignment tampering threat and show that, without end-to-end integrity, the relay can observe, suppress, or replace downstream messages, making even perfectly aligned LLMs ineffective against such attacks. We instantiate this threat as the Re