Integrates radiologist gaze data as a probabilistic prior to align vision-language models with actual human clinical reasoning workflows.
March 30, 2026
Original Paper
Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays
arXiv · 2603.26049
The Takeaway
Standard medical VLM pretraining treats images as context-agnostic. By supervising attention maps using human gaze, CoGaze forces the model to focus on 'diagnostically salient' regions, leading to a massive +23% boost in zero-shot classification and much more reliable report generation.
From the abstract
Despite recent advances in medical vision-language pretraining, existing models still struggle to capture the diagnostic workflow: radiographs are typically treated as context-agnostic images, while radiologists' gaze -- a crucial cue for visual reasoning -- remains largely underexplored by existing methods. These limitations hinder the modeling of disease-specific patterns and weaken cross-modal alignment. To bridge this gap, we introduce CoGaze, a Context- and Gaze-guided vision-language pretr