DAPA speeds up GELU computation by 16x and reduces hardware DSP utilization by 16x for on-device Transformer deployment.
March 23, 2026
Original Paper
DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training
arXiv · 2603.19338
The Takeaway
Traditional piecewise linear approximations lose accuracy in high-probability regions; DAPA uses a differentiable, non-uniform piecewise approach that preserves Transformer performance while drastically reducing the cost of non-linear activations.
From the abstract
Non-linear activation functions play a pivotal role in on-device inference and training, as they not only consume substantial hardware resources but also impose a significant impact on system performance and energy efficiency. In this work, we propose Distribution-Aware Piecewise Activation (DAPA), a differentiable and hardware-friendly activation function for Transformer architectures by exploiting the distribution of pre-activation data. DAPA employs a non-uniform piecewise approximation that