AI & ML Practical Magic

Converting joint movements into text descriptions allows language models to understand human motion without using any complex vision hardware.

April 24, 2026

Original Paper

Encoder-Free Human Motion Understanding via Structured Motion Descriptions

Yao Zhang, Zhuchenyang Liu, Thomas Ploetz, Yu Xiao

arXiv · 2604.21668

The Takeaway

This method bypasses the need for expensive cross-modal encoders that bridge the gap between video and text. By simply describing the physics of a movement in words, the model can leverage its existing knowledge of anatomy to identify an action. This approach proves that the hard problem of computer vision can sometimes be solved by treating movement as a type of language. A model's understanding of swinging an arm is already present in its training data, even if it hasn't seen a video of it. This significantly lowers the barrier for building robots that can understand and mimic human gestures. It simplifies the pipeline for human-robot interaction in home environments.

From the abstract

The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM's embedding space, remaining constrained by cross-modal representation and alignment. Inspired by