Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection
This paper proposes reducing KV cache memory usage in transformers by employing low-dimensional keys for attention selection while maintaining full-dimensional values for semantic transfer, a strategy validated across multiple models and datasets to achieve up to 75% cache savings with minimal performance degradation.