AI & ML Open Release

BioVITA releases a massive multimodal biological dataset of 3.6M image-audio-text samples covering 14,000 species.

March 26, 2026

Original Paper

BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment

Risa Shinoda, Kaede Shiohara, Nakamasa Inoue, Kuniaki Saito, Hiroaki Santo, Fumio Okura

arXiv · 2603.23883

The Takeaway

This release democratizes species identification across the acoustic modality (which was previously underserved compared to visual data). It provides the first large-scale foundation for tri-modal (visual-textual-acoustic) alignment in ecology.

From the abstract

Understanding animal species from multimodal data poses an emerging challenge at the intersection of computer vision and ecology. While recent biological models, such as BioCLIP, have demonstrated strong alignment between images and textual taxonomic information for species identification, the integration of the audio modality remains an open problem. We propose BioVITA, a novel visual-textual-acoustic alignment framework for biological applications. BioVITA involves (i) a training dataset, (ii)