Point Cloud as a Foreign Language for Multi-modal Large Language Model
The paper introduces SAGE, the first end-to-end multi-modal large language model that treats raw point clouds as a "foreign language" via a lightweight 3D tokenizer and semantic alignment-based preference optimization, achieving superior performance and efficiency over existing encoder-based methods in 3D understanding tasks.