AI & ML Paradigm Challenge

A heavily compressed 3-bit model built a working app better than industry-standard models with five times the memory.

April 23, 2026

Original Paper

React-ing to Grace Hopper 200: Five Open-Weights Coding Models, One React Native App, One GH200, One Weekend

arXiv · 2604.17187

The Takeaway

Current AI benchmarks for coding are failing to predict real-world success. Models that score high on tests like SWE-Bench often struggle to build a simple functional application. This weekend experiment showed that quantization doesn't necessarily kill the practical utility of a coder. A small, efficient model can navigate the complexities of app development through better reasoning rather than more parameters. This suggests that the quest for higher benchmark scores is distracting us from actual performance. Software engineering with AI should focus on practical buildability instead of academic metrics.

From the abstract

We evaluate five state-of-the-art open-weights coding language models -- Kimi-K2.5 (at Q3 and Q4 quantizations), GLM-5.1, Qwen3-Coder-480B, and DeepSeek-V3.2 -- on a single multi-file React Native application generation task on NVIDIA GH200 576 GB hardware. The task specifies authentication, per-user per-day counting, and web compatibility, and is evaluated on whether the generated project runs out-of-the-box and on feature-level correctness. We find that SWE-Bench rankings do not predict task p

Read the original paper →

← Back to today's papers