AI & ML Paradigm Shift

Fine-tunes Vision-Language Models using raw images alone by using a text-to-image model as a cycle-consistency reward.

March 20, 2026

Original Paper

CycleCap: Improving VLMs Captioning Performance via Self-Supervised Cycle Consistency Fine-Tuning

Marios Krestenitis, Christos Tzelepis, Konstantinos Ioannidis, Steafanos Vrochidis, Ioannis Kompatsiaris, Georgios Tzimiropoulos, Shaogang Gong, Ioannis Patras

arXiv · 2603.18282

The Takeaway

This removes the dependency on expensive human-annotated caption datasets for VLM improvement. By closing the loop (Image -> Caption -> Reconstructed Image), practitioners can use the reconstruction error as a self-supervised training signal via GRPO to improve grounding and reduce hallucinations.

From the abstract

Visual-Language Models (VLMs) have achieved remarkable progress in image captioning, visual question answering, and visual reasoning. Yet they remain prone to vision-language misalignment, often producing overly generic or hallucinated descriptions. Existing approaches address this via instruction tuning-requiring costly, large-scale annotated datasets or via complex test-time frameworks for caption refinement. In this work, we revisit image-text alignment through the lens of cycle consistency: