AI & ML Efficiency Breakthrough

Fixes the 'squeezing effect' in Direct Preference Optimization (DPO) using an efficient logit-space Sharpness-Aware Minimization.

March 20, 2026

Original Paper

Sharpness-Aware Minimization in Logit Space Efficiently Enhances Direct Preference Optimization

Haocheng Luo, Zehang Deng, Thanh-Toan Do, Mehrtash Harandi, Dinh Phung, Trung Le

arXiv · 2603.18258

The Takeaway

DPO often suffers from likelihood displacement where the probability of preferred responses decreases during training. This paper identifies the cause in the logit space curvature and provides a computationally 'free' fix that improves alignment stability across various model scales.

From the abstract

Direct Preference Optimization (DPO) has emerged as a popular algorithm for aligning pretrained large language models with human preferences, owing to its simplicity and training stability. However, DPO suffers from the recently identified squeezing effect (also known as likelihood displacement), where the probability of preferred responses decreases unintentionally during training. To understand and mitigate this phenomenon, we develop a theoretical framework that models the coordinate-wise dyn