NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches
The paper introduces NOBLE, a pretraining architecture that permanently augments transformer linear layers with learnable nonlinear low-rank branches (specifically using CosNet activation), achieving significant training efficiency and speedups across various models with minimal parameter and time overhead, though its benefits may be hindered by certain stochastic data augmentations.