AI & ML Paradigm Shift

Collapses the standard vision backbone-plus-decoder architecture into a single early-fusion Transformer stack for both perception and task modeling.

March 31, 2026

Original Paper

Falcon Perception

Aviraj Bevli, Sofian Chaybouti, Yasser Dahou, Hakim Hacid, Ngoc Dung Huynh, Phuc H. Le Khac, Sanath Narayan, Wamiq Reyaz Para, Ankit Singh

arXiv · 2603.27365

The Takeaway

By using a unified parameter space for image tokens and prediction tokens, it simplifies the vision pipeline while significantly improving mask quality and OCR performance over modular baselines like SAM.

From the abstract

Perception-centric systems are typically implemented with a modular encoder-decoder pipeline: a vision backbone for feature extraction and a separate decoder (or late-fusion module) for task prediction. This raises a central question: is this architectural separation essential or can a single early-fusion stack do both perception and task modeling at scale? We introduce Falcon Perception, a unified dense Transformer that processes image patches and text tokens in a shared parameter space from th

Read the original paper →

← Back to today's papers