Fine-tunes Large Vision Language Models for medical tasks using only image-description pairs, bypassing the need for expensive expert-curated instructions.
March 23, 2026
Original Paper
Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following
arXiv · 2603.19482
The Takeaway
This challenges the assumption that visual instruction tuning requires curated triplets. It enables high-performance domain adaptation in fields like medicine where generating high-quality instruction-output pairs is a major bottleneck.
From the abstract
Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning appr