MEGC2026: Micro-Expression Grand Challenge on Visual Question Answering

The MEGC 2026 challenge introduces two new tasks, Micro-Expression Video Question Answering (ME-VQA) and Micro-Expression Long-Video Question Answering (ME-LVQA), to advance the analysis of facial micro-expressions by leveraging the multimodal reasoning capabilities of large vision-language models on both short and long-duration video sequences.

Xinqi Fan, Jingting Li, John See, Moi Hoon Yap, Su-Jing Wang, Adrian K. DavisonWed, 11 Ma💻 cs

Image Captioning via Compact Bidirectional Architecture

This paper introduces a Compact Bidirectional Transformer model for image captioning that tightly couples left-to-right and right-to-left flows to leverage bidirectional context in parallel, achieving state-of-the-art results on the MSCOCO benchmark through sentence-level ensembling and an extended two-flow self-critical training strategy.

Zijie Song, Yuanen Zhou, Zhenzhen Hu, Daqing Liu, Huixia Ben, Richang Hong, Meng WangWed, 11 Ma💬 cs.CL

Comparative Analysis of Patch Attack on VLM-Based Autonomous Driving Architectures

This paper introduces a systematic framework for evaluating black-box patch attacks on three vision-language model-based autonomous driving architectures in CARLA simulation, revealing severe, sustained vulnerabilities and distinct failure patterns that highlight the inadequacy of current designs against physical adversarial threats.

David Fernandez, Pedram MohajerAnsari, Amir Salarpour, Long Cheng, Abolfazl Razi, Mert D. PeséWed, 11 Ma💻 cs

HECTOR: Hybrid Editable Compositional Object References for Video Generation

HECTOR is a novel video generation pipeline that enables fine-grained compositional control by supporting hybrid reference conditioning from static images and dynamic videos, while allowing users to explicitly specify the trajectories, locations, scales, and speeds of individual objects to synthesize coherent, high-fidelity videos.

Guofeng Zhang, Angtian Wang, Jacob Zhiyuan Fang, Liming Jiang, Haotian Yang, Alan Yuille, Chongyang MaWed, 11 Ma💻 cs

PanoAffordanceNet: Towards Holistic Affordance Grounding in 360{\deg} Indoor Environments

This paper introduces PanoAffordanceNet, a novel framework and the first high-quality dataset (360-AGD) designed to enable holistic affordance grounding in 360-degree indoor environments by addressing challenges like geometric distortion and semantic dispersion through distortion-aware calibration and multi-level constraints.

Guoliang Zhu, Wanjun Jia, Caoyang Shao, Yuheng Zhang, Zhiyong Li, Kailun YangWed, 11 Ma⚡ eess

M2M^2-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs

The paper introduces M2M^2-Occ, a robust 3D semantic occupancy prediction framework that leverages a Multi-view Masked Reconstruction module and a Feature Memory Module to maintain geometric and semantic coherence under incomplete multi-camera inputs, significantly outperforming existing methods in scenarios with missing views.

Kaixin Lin, Kunyu Peng, Di Wen, Yufan Chen, Ruiping Liu, Kailun YangWed, 11 Ma⚡ eess

Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation

This paper introduces the first exemplar-free continual learning benchmark for Audio-Visual Segmentation (AVS) and proposes the ATLAS baseline, which utilizes audio-guided pre-fusion conditioning and Low-Rank Anchoring to effectively mitigate catastrophic forgetting in dynamic, evolving environments.

Siddeshwar Raghavan, Gautham Vinod, Bruce Coburn, Fengqing ZhuWed, 11 Ma⚡ eess

VisionCreator-R1: A Reflection-Enhanced Native Visual-Generation Agentic Model

The paper introduces VisionCreator-R1, a native visual-generation agent enhanced with explicit reflection mechanisms and trained via a Reflection-Plan Co-Optimization (RPCO) methodology that addresses credit assignment challenges to outperform state-of-the-art models on both single and multi-image generation benchmarks.

Jinxiang Lai, Wenzhe Zhao, Zexin Lu, Hualei Zhang, Qinyu Yang, Rongwei Quan, Zhimin Li, Shuai Shao, Song Guo, Qinglin LuWed, 11 Ma💻 cs

Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM

Granulon is a novel multimodal large language model that leverages a DINOv3-based visual encoder enhanced with a text-conditioned granularity controller and adaptive token aggregation to dynamically unify pixel-level perception with coarse-grained semantics, significantly improving accuracy and reducing hallucinations compared to existing approaches.

Junyuan Mao, Qiankun Li, Linghao Meng, Zhicheng He, Xinliang Zhou, Kun Wang, Yang Liu, Yueming JinWed, 11 Ma💻 cs

CycleULM: A unified label-free deep learning framework for ultrasound localisation microscopy

CycleULM is a novel, label-free deep learning framework that leverages CycleGAN to bridge the simulation-to-reality gap in ultrasound localisation microscopy, significantly enhancing microbubble localisation accuracy, image resolution, and processing speed for real-time clinical application without requiring paired ground truth data.

Su Yan, Clara Rodrigo Gonzalez, Vincent C. H. Leung, Herman Verinaz-Jadan, Jiakang Chen, Matthieu Toulemonde, Kai Riemer, Jipeng Yan, Clotilde Vié, Qingyuan Tan, Peter D. Weinberg, Pier Luigi Dragotti, Kevin G. Murphy, Meng-Xing TangWed, 11 Ma⚡ eess

Association of Radiologic PPFE Change with Mortality in Lung Cancer Screening Cohorts

This study demonstrates that the longitudinal progression of radiologic pleuroparenchymal fibroelastosis (PPFE), quantified via automated analysis of low-dose CT scans, independently predicts increased mortality and adverse respiratory outcomes in large lung cancer screening cohorts.

Shahab Aslani, Mehran Azimbagirad, Daryl Cheng, Daisuke Yamada, Ryoko Egashira, Adam Szmul, Justine Chan-Fook, Robert Chapman, Alfred Chung Pui So, Shanshan Wang, John McCabe, Tianqi Yang, Jose M Brenes, Eyjolfur Gudmundsson, The SUMMIT Consortium, Susan M. Astley, Daniel C. Alexander, Sam M. Janes, Joseph JacobWed, 11 Ma🧬 q-bio

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

This paper systematically diagnoses the performance gap between text and image inputs in multimodal LLMs, revealing that visual text primarily amplifies reading errors rather than reasoning failures, and proposes a self-distillation method that effectively bridges this gap by training models on their own text-based reasoning traces paired with image inputs.

Kaiser Sun, Xiaochuang Yuan, Hongjun Liu, Chen Zhao, Cheng Zhang, Mark Dredze, Fan BaiWed, 11 Ma💬 cs.CL

ADHint: Adaptive Hints with Difficulty Priors for Reinforcement Learning

ADHint is a novel reinforcement learning framework that enhances reasoning capabilities and generalization by integrating sample difficulty priors to adaptively schedule hint ratios and employing consistency-based gradient modulation with rollout difficulty posteriors to stabilize learning and prevent destructive imitation.

Feng Zhang, Zezhong Tan, Xinhong Ma, Ziqiang Dong, Xi Leng, Jianfei Zhao, Xin Sun, Yang YangWed, 11 Ma🤖 cs.LG

Exploring Single Domain Generalization of LiDAR-based Semantic Segmentation under Imperfect Labels

This paper addresses the challenge of LiDAR-based 3D semantic segmentation under noisy labels and domain shifts by introducing the DGLSS-NL task, establishing a new benchmark, and proposing DuNe, a dual-view framework that achieves state-of-the-art robustness across multiple datasets.

Weitong Kong, Zichao Zeng, Di Wen, Jiale Wei, Kunyu Peng, June Moh Goo, Jan Boehm, Rainer StiefelhagenWed, 11 Ma🤖 cs.LG