Collision / AI

Frozen weights from a text-only model can move a robot arm without ever seeing a single image.

The Takeaway

Large language models like Gemma 4 learn a universal internal structure from text that applies to the physical world. A tiny interface can map robotic sensor data onto these existing weights to perform manipulation tasks with high precision. This discovery suggests that the geometry of human language encodes fundamental truths about how objects move and interact. Researchers do not need to train massive new models for every physical task from scratch. We can now treat large text models as general-purpose brains for robots by simply plugging in a small translator.

By SeriesFusion Editorial Board · May 4, 2026

Original Paper

Borrowed Geometry: Computational Reuse of Frozen Text-Pretrained Transformer Weights Across Modalities

Abay Bektursun

arXiv · 2605.00333

From the abstract

Frozen Gemma 4 31B weights pretrained exclusively on text tokens, unmodified, transfer across modality boundaries through a thin trainable interface. (1) OGBench scene-play-singletask-task1-v0: $+4.33$pt over published GCIQL at $n=3$ with std 0.74 -- a published-SOTA win on a robotic manipulation task the substrate has never seen. (2) D4RL Walker2d-medium-v2: Decision-Transformer parity ($76.2 \pm 0.8$, $n=3$) at $0.43\times$ DT's trainable count, with the frozen substrate compressing to a 5L sl

Read the original paper →

← Back to today's papers