AI & ML Breaks Assumption

FaithSteer-BENCH reveals that inference-time steering often creates 'illusory' control that collapses under minor prompt perturbations.

March 20, 2026

Original Paper

FaithSteer-BENCH: A Deployment-Aligned Stress-Testing Benchmark for Inference-Time Steering

Zikang Ding, Qiying Hu, Yi Zhang, Hongji Li, Junchi Yao, Hongbo Liu, Lijie Hu

arXiv · 2603.18329

The Takeaway

It demonstrates that common activation-level interventions induce prompt-conditional alignment rather than stable latent shifts. This is a critical warning for researchers relying on steering as a lightweight alternative to fine-tuning for safety or behavioral control.

From the abstract

Inference-time steering is widely regarded as a lightweight and parameter-free mechanism for controlling large language model (LLM) behavior, and prior work has often suggested that simple activation-level interventions can reliably induce targeted behavioral changes. However, such conclusions are typically drawn under relatively relaxed evaluation settings that overlook deployment constraints, capability trade-offs, and real-world robustness. We therefore introduce \textbf{FaithSteer-BENCH}, a