What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models
This paper introduces EmbedLens to reveal that multimodal large language models exhibit significant visual token sparsity and redundancy, demonstrating that only a subset of "alive" tokens carry essential semantic information which can be efficiently processed via mid-layer injection rather than full internal computation.