Malicious relays can hijack an AI nervous system to bypass safety filters after the model has already decided to be helpful.
Even if an AI is perfectly safe and aligned, it can still be tricked into doing harm. These Response-Path attacks target the message after it leaves the AI brain but before it reaches the hand that executes the code. By tampering with the intermediate response, an attacker can achieve a $99\%$ success rate in forcing a safe model to perform dangerous actions. This proves that safety is a pipeline problem, not just a model problem. We must secure the entire communication loop or the AI internal ethics won't matter.
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
arXiv · 2605.02187
Bring-Your-Own-Key (BYOK) agent architectures let users route LLM traffic through third-party relays, creating a critical integrity gap: a malicious relay can modify an aligned LLM response after generation but before agent execution. We formalize this post-alignment tampering threat and show that, without end-to-end integrity, the relay can observe, suppress, or replace downstream messages, making even perfectly aligned LLMs ineffective against such attacks. We instantiate this threat as the Re