A self-supervised robotic system detects novel objects by training bespoke detectors on-the-fly from human video demonstrations, bypassing language-based prompts.
March 16, 2026
Original Paper
Show, Don't Tell: Detecting Novel Objects by Watching Human Videos
arXiv · 2603.12751
The Takeaway
Current open-vocabulary detectors often require tedious prompt engineering to recognize specific instances. This 'Show, Don't Tell' approach allows a robot to automatically generate a tailored dataset and detector in minutes, significantly improving task completion in unconstrained environments.
From the abstract
How can a robot quickly identify and recognize new objects shown to it during a human demonstration? Existing closed-set object detectors frequently fail at this because the objects are out-of-distribution. While open-set detectors (e.g., VLMs) sometimes succeed, they often require expensive and tedious human-in-the-loop prompt engineering to uniquely recognize novel object instances. In this paper, we present a self-supervised system that eliminates the need for tedious language descriptions an