cs.CV papers | Gist.Science

MEGC2026: Micro-Expression Grand Challenge on Visual Question Answering

The MEGC 2026 challenge introduces two new tasks, Micro-Expression Video Question Answering (ME-VQA) and Micro-Expression Long-Video Question Answering (ME-LVQA), to advance the analysis of facial micro-expressions by leveraging the multimodal reasoning capabilities of large vision-language models on both short and long-duration video sequences.

Xinqi Fan, Jingting Li, John See, Moi Hoon Yap, Su-Jing Wang, Adrian K. DavisonWed, 11 Ma💻 cs

Image Captioning via Compact Bidirectional Architecture

This paper introduces a Compact Bidirectional Transformer model for image captioning that tightly couples left-to-right and right-to-left flows to leverage bidirectional context in parallel, achieving state-of-the-art results on the MSCOCO benchmark through sentence-level ensembling and an extended two-flow self-critical training strategy.

Zijie Song, Yuanen Zhou, Zhenzhen Hu, Daqing Liu, Huixia Ben, Richang Hong, Meng WangWed, 11 Ma💬 cs.CL

Towards Visual Query Segmentation in the Wild

This paper introduces Visual Query Segmentation (VQS), a new paradigm for pixel-level object localization in untrimmed videos, supported by the large-scale VQS-4K benchmark and the high-performing VQ-SAM method that extends SAM 2.

Bing Fan, Minghao Li, Hanzhi Zhang, Shaohua Dong, Naga Prudhvi Mareedu, Weishi Shi, Yunhe Feng, Yan Huang, Heng FanWed, 11 Ma💻 cs

Comparative Analysis of Patch Attack on VLM-Based Autonomous Driving Architectures

This paper introduces a systematic framework for evaluating black-box patch attacks on three vision-language model-based autonomous driving architectures in CARLA simulation, revealing severe, sustained vulnerabilities and distinct failure patterns that highlight the inadequacy of current designs against physical adversarial threats.

David Fernandez, Pedram MohajerAnsari, Amir Salarpour, Long Cheng, Abolfazl Razi, Mert D. PeséWed, 11 Ma💻 cs

Image Compression Using Novel View Synthesis Priors

This paper proposes a model-based image compression technique for tetherless underwater remotely operated vehicles that leverages novel view synthesis priors and gradient descent optimization to achieve superior compression ratios and image quality, particularly in scenarios involving new objects within the scene.

Luyuan Peng, Mandar Chitre, Hari Vishnu, Yuen Min Too, Bharath Kalyan, Rajat Mishra, Soo Pieng TanWed, 11 Ma⚡ eess

HECTOR: Hybrid Editable Compositional Object References for Video Generation

HECTOR is a novel video generation pipeline that enables fine-grained compositional control by supporting hybrid reference conditioning from static images and dynamic videos, while allowing users to explicitly specify the trajectories, locations, scales, and speeds of individual objects to synthesize coherent, high-fidelity videos.

Guofeng Zhang, Angtian Wang, Jacob Zhiyuan Fang, Liming Jiang, Haotian Yang, Alan Yuille, Chongyang MaWed, 11 Ma💻 cs

PanoAffordanceNet: Towards Holistic Affordance Grounding in 360{\deg} Indoor Environments

This paper introduces PanoAffordanceNet, a novel framework and the first high-quality dataset (360-AGD) designed to enable holistic affordance grounding in 360-degree indoor environments by addressing challenges like geometric distortion and semantic dispersion through distortion-aware calibration and multi-level constraints.

Guoliang Zhu, Wanjun Jia, Caoyang Shao, Yuheng Zhang, Zhiyong Li, Kailun YangWed, 11 Ma⚡ eess

$M^2$ -Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs

The paper introduces $M^2$ -Occ, a robust 3D semantic occupancy prediction framework that leverages a Multi-view Masked Reconstruction module and a Feature Memory Module to maintain geometric and semantic coherence under incomplete multi-camera inputs, significantly outperforming existing methods in scenarios with missing views.

Kaixin Lin, Kunyu Peng, Di Wen, Yufan Chen, Ruiping Liu, Kailun YangWed, 11 Ma⚡ eess

Multi-Kernel Gated Decoder Adapters for Robust Multi-Task Thyroid Ultrasound under Cross-Center Shift

This paper proposes Multi-Kernel Gated Adapters (MKGA) and their residual variant (ResMKGA) to address asymmetric feature degradation under cross-center domain shifts in thyroid ultrasound by leveraging the complementary strengths of CNNs and ViTs to enhance both segmentation robustness and malignancy diagnostic accuracy.

Maziar Sabouri, Nourhan Bayasi, Arman RahmimWed, 11 Ma🔬 physics

Computer Vision-Based Vehicle Allotment System using Perspective Mapping

This paper proposes a cost-effective, computer vision-based smart parking system that utilizes YOLOv8 for vehicle detection and inverse perspective mapping to merge multi-camera views into a simulated 3D environment for guiding users to vacant spots.

Prachi Nandi, Sonakshi Satapathy, Suchismita ChinaraWed, 11 Ma💻 cs

X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models

This paper introduces X-GS, an extensible open framework that unifies 3D Gaussian Splatting with downstream multimodal models through a real-time, semantically enriched pipeline capable of processing unposed video streams for tasks like object detection and zero-shot captioning.

Yueen Ma, Irwin KingWed, 11 Ma💬 cs.CL

Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation

This paper introduces the first exemplar-free continual learning benchmark for Audio-Visual Segmentation (AVS) and proposes the ATLAS baseline, which utilizes audio-guided pre-fusion conditioning and Low-Rank Anchoring to effectively mitigate catastrophic forgetting in dynamic, evolving environments.

Siddeshwar Raghavan, Gautham Vinod, Bruce Coburn, Fengqing ZhuWed, 11 Ma⚡ eess

VisionCreator-R1: A Reflection-Enhanced Native Visual-Generation Agentic Model

The paper introduces VisionCreator-R1, a native visual-generation agent enhanced with explicit reflection mechanisms and trained via a Reflection-Plan Co-Optimization (RPCO) methodology that addresses credit assignment challenges to outperform state-of-the-art models on both single and multi-image generation benchmarks.

Jinxiang Lai, Wenzhe Zhao, Zexin Lu, Hualei Zhang, Qinyu Yang, Rongwei Quan, Zhimin Li, Shuai Shao, Song Guo, Qinglin LuWed, 11 Ma💻 cs

Where, What, Why: Toward Explainable 3D-GS Watermarking

This paper presents a representation-native framework for explainable 3D Gaussian Splatting watermarking that utilizes a Trio-Experts module and a Safety and Budget Aware Gate to optimize carrier selection and visual compensation, achieving superior robustness and fidelity while providing auditable insights into the watermarking process.

Mingshu Cai, Jiajun Li, Osamu Yoshie, Yuya Ieiri, Yixuan LiWed, 11 Ma💻 cs

Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM

Granulon is a novel multimodal large language model that leverages a DINOv3-based visual encoder enhanced with a text-conditioned granularity controller and adaptive token aggregation to dynamically unify pixel-level perception with coarse-grained semantics, significantly improving accuracy and reducing hallucinations compared to existing approaches.

Junyuan Mao, Qiankun Li, Linghao Meng, Zhicheng He, Xinliang Zhou, Kun Wang, Yang Liu, Yueming JinWed, 11 Ma💻 cs

CycleULM: A unified label-free deep learning framework for ultrasound localisation microscopy

CycleULM is a novel, label-free deep learning framework that leverages CycleGAN to bridge the simulation-to-reality gap in ultrasound localisation microscopy, significantly enhancing microbubble localisation accuracy, image resolution, and processing speed for real-time clinical application without requiring paired ground truth data.

Su Yan, Clara Rodrigo Gonzalez, Vincent C. H. Leung, Herman Verinaz-Jadan, Jiakang Chen, Matthieu Toulemonde, Kai Riemer, Jipeng Yan, Clotilde Vié, Qingyuan Tan, Peter D. Weinberg, Pier Luigi Dragotti, Kevin G. Murphy, Meng-Xing TangWed, 11 Ma⚡ eess

Association of Radiologic PPFE Change with Mortality in Lung Cancer Screening Cohorts

This study demonstrates that the longitudinal progression of radiologic pleuroparenchymal fibroelastosis (PPFE), quantified via automated analysis of low-dose CT scans, independently predicts increased mortality and adverse respiratory outcomes in large lung cancer screening cohorts.

Shahab Aslani, Mehran Azimbagirad, Daryl Cheng, Daisuke Yamada, Ryoko Egashira, Adam Szmul, Justine Chan-Fook, Robert Chapman, Alfred Chung Pui So, Shanshan Wang, John McCabe, Tianqi Yang, Jose M Brenes, Eyjolfur Gudmundsson, The SUMMIT Consortium, Susan M. Astley, Daniel C. Alexander, Sam M. Janes, Joseph JacobWed, 11 Ma🧬 q-bio

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

This paper systematically diagnoses the performance gap between text and image inputs in multimodal LLMs, revealing that visual text primarily amplifies reading errors rather than reasoning failures, and proposes a self-distillation method that effectively bridges this gap by training models on their own text-based reasoning traces paired with image inputs.

Kaiser Sun, Xiaochuang Yuan, Hongjun Liu, Chen Zhao, Cheng Zhang, Mark Dredze, Fan BaiWed, 11 Ma💬 cs.CL

ADHint: Adaptive Hints with Difficulty Priors for Reinforcement Learning

ADHint is a novel reinforcement learning framework that enhances reasoning capabilities and generalization by integrating sample difficulty priors to adaptively schedule hint ratios and employing consistency-based gradient modulation with rollout difficulty posteriors to stabilize learning and prevent destructive imitation.

Feng Zhang, Zezhong Tan, Xinhong Ma, Ziqiang Dong, Xi Leng, Jianfei Zhao, Xin Sun, Yang YangWed, 11 Ma🤖 cs.LG

Exploring Single Domain Generalization of LiDAR-based Semantic Segmentation under Imperfect Labels

This paper addresses the challenge of LiDAR-based 3D semantic segmentation under noisy labels and domain shifts by introducing the DGLSS-NL task, establishing a new benchmark, and proposing DuNe, a dual-view framework that achieves state-of-the-art robustness across multiple datasets.

Weitong Kong, Zichao Zeng, Di Wen, Jiale Wei, Kunyu Peng, June Moh Goo, Jan Boehm, Rainer StiefelhagenWed, 11 Ma🤖 cs.LG

← Previous Next →

cs.CV