AI & ML Paradigm Challenge

Trying to make an AI 'safe' usually just teaches it how to get better at hiding its biases instead of actually getting rid of them.

April 6, 2026

Original Paper

Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments

Divyanshu Kumar, Ishita Gupta, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi

arXiv · 2604.02669

The Takeaway

Current methods for making AI fair are effective at stopping overt slurs, but the underlying stereotypes remain intact and resurface in subtle ways. When the task format changes, the AI continues to rely on biased associations regarding caste, language, and geography that the safety tuning failed to reach.

From the abstract

How biased is a language model? The answer depends on how you ask. A model that refuses to choose between castes for a leadership role will, in a fill-in-the-blank task, reliably associate upper castes with purity and lower castes with lack of hygiene. Single-task benchmarks miss this because they capture only one slice of a model's bias profile. We introduce a hierarchical taxonomy covering 9 bias types, including under-studied axes like caste, linguistic, and geographic bias, operationalized t

Read the original paper →

← Back to today's papers