Crab: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Crab is a scalable and unified audio-visual scene understanding model that overcomes the negative transfer issues of conventional multi-task methods by introducing the AV-UIE v2 dataset with explicit reasoning and an Interaction-aware LoRA mechanism to enable effective explicit cooperation across heterogeneous tasks.