TokenCom: Vision-Language Model for Multimodal and Multitask Token Communications
The paper proposes TaiChi, a novel Vision-Language Model framework that enhances multimodal token communications through a dual-visual tokenizer, a Bilateral Attention Network for compact token fusion, and a KAN-based projector for precise cross-modal alignment, ultimately demonstrating superior performance in a joint VLM-channel coding system.