AI & ML Paradigm Challenge

AI models show much higher levels of bias when generating complex machine learning code than they do when writing simple "if-then" statements.

April 24, 2026

Original Paper

From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation

Minh Duc Bui, Xenia Heilmann, Mattia Cerrato, Manuel Mager, Katharina von der Wense

arXiv · 2604.21716

The Takeaway

Current benchmarks for AI fairness are dangerously optimistic because they only test small, isolated code snippets. When a model builds an entire data pipeline, it often includes prohibited factors like race or gender in its calculations. This behavior remains hidden during simple tests but emerges as a major risk in real-world engineering workflows. The bias is not just in the words the AI uses, but in the logic of the systems it constructs. Companies deploying AI-generated code for credit scoring or hiring need to move beyond basic safety checks. We must audit the full workflow of AI systems rather than just their individual outputs.

From the abstract

Prior work evaluates code generation bias primarily through simple conditional statements, which represent only a narrow slice of real-world programming and reveal solely overt, explicitly encoded bias. We demonstrate that this approach dramatically underestimates bias in practice by examining a more realistic task: generating machine learning (ML) pipelines. Testing both code-specialized and general-instruction large language models, we find that generated pipelines exhibit significant bias dur