A heavily compressed 3-bit model built a working app better than industry-standard models with five times the memory.
April 23, 2026
Original Paper
React-ing to Grace Hopper 200: Five Open-Weights Coding Models, One React Native App, One GH200, One Weekend
arXiv · 2604.17187
The Takeaway
Current AI benchmarks for coding are failing to predict real-world success. Models that score high on tests like SWE-Bench often struggle to build a simple functional application. This weekend experiment showed that quantization doesn't necessarily kill the practical utility of a coder. A small, efficient model can navigate the complexities of app development through better reasoning rather than more parameters. This suggests that the quest for higher benchmark scores is distracting us from actual performance. Software engineering with AI should focus on practical buildability instead of academic metrics.
From the abstract
We evaluate five state-of-the-art open-weights coding language models -- Kimi-K2.5 (at Q3 and Q4 quantizations), GLM-5.1, Qwen3-Coder-480B, and DeepSeek-V3.2 -- on a single multi-file React Native application generation task on NVIDIA GH200 576 GB hardware. The task specifies authentication, per-user per-day counting, and web compatibility, and is evaluated on whether the generated project runs out-of-the-box and on feature-level correctness. We find that SWE-Bench rankings do not predict task p