cs.CV papers | Gist.Science

SGG-R $^{\rm 3}$ : From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation

The paper introduces SGG-R $^{\rm 3}$ , a structured reasoning framework that combines chain-of-thought-guided supervised fine-tuning with relation augmentation and a novel dual-granularity reward scheme in reinforcement learning to achieve end-to-end unbiased Scene Graph Generation with improved recall and reduced bias on long-tailed distributions.

Jiaye Feng, Qixiang Yin, Yuankun Liu, Tong Mo, Weiping Li2026-03-10💻 cs

Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time

This paper introduces EcoG-Bench, a rigorous bilingual benchmark for egocentric co-speech grounding that reveals a significant performance gap between humans and state-of-the-art MLLMs, highlighting how multimodal interface limitations rather than reasoning deficits hinder the alignment of speech with pointing gestures in situated collaboration.

Weijie Zhou, Xuantang Xiong, Zhenlin Hu, Xiaomeng Zhu, Chaoyang Zhao, Honghui Dong, Zhengyou Zhang, Ming Tang, Jinqiao Wang2026-03-10💻 cs

Extend Your Horizon: A Device-Agnostic Surgical Tool Tracking Framework with Multi-View Optimization for Augmented Reality

This paper presents a device-agnostic surgical tool tracking framework that fuses multiple sensing modalities within a dynamic scene graph to overcome line-of-sight occlusions and enhance the robustness of augmented reality visualization in operating rooms.

Jiaming Zhang, Mingxu Liu, Hongchao Shu, Ruixing Liang, Yihao Liu, Ojas Taskar, Amir Kheradmand, Mehran Armand, Alejandro Martin-Gomez2026-03-10💻 cs

On the Feasibility and Opportunity of Autoregressive 3D Object Detection

The paper introduces AutoReg3D, an autoregressive 3D object detector that reformulates LiDAR-based detection as a sequence generation task using a near-to-far ordering to eliminate reliance on hand-crafted components like anchors and NMS, thereby achieving competitive performance while enabling the integration of advanced language model techniques such as reinforcement learning.

Zanming Huang, Jinsu Yoo, Sooyoung Jeon, Zhenzhen Liu, Mark Campbell, Kilian Q Weinberger, Bharath Hariharan, Wei-Lun Chao, Katie Z Luo2026-03-10💻 cs

TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size

TeamHOI is a decentralized framework that leverages a Transformer-based policy and a masked Adversarial Motion Prior strategy to enable a single unified policy to control scalable, physically realistic cooperative human-object interactions among any number of humanoid agents.

Stefan Lionar, Gim Hee Lee2026-03-10💻 cs

AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models

AutoTraces is a novel autoregressive vision-language-trajectory model that leverages multimodal large language models with a specialized trajectory tokenization scheme and automated chain-of-thought reasoning to achieve state-of-the-art, long-horizon robot trajectory forecasting in human-populated environments.

Teng Wang, Yanting Lu, Ruize Wang2026-03-10💻 cs

ViSA-Enhanced Aerial VLN: A Visual-Spatial Reasoning Enhanced Framework for Aerial Vision-Language Navigation

This paper proposes the ViSA-enhanced framework, a triple-phase collaborative architecture that leverages structured visual prompting to enable Vision-Language Models to perform direct spatial reasoning on image planes, achieving a 70.3% improvement in success rate over state-of-the-art aerial Vision-Language Navigation methods on the CityNav benchmark.

Haoyu Tong, Xiangyu Dong, Xiaoguang Ma, Haoran Zhao, Yaoming Zhou, Chenghao Lin2026-03-10💻 cs

It's Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models

This paper addresses the significant challenge of analog clock reading in state-of-the-art Vision-Language Models by introducing the real-world, diverse TickTockVQA dataset and the Swap-DPO fine-tuning framework, which together substantially improve spatial-temporal reasoning and accuracy under complex visual conditions.

Jaeha Choi, Jin Won Lee, Siwoo You, Jangho Lee2026-03-10💻 cs

Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared

This paper proposes "Missing No More," a novel dictionary-guided framework that addresses the challenge of missing infrared modality in image fusion by learning a shared convolutional dictionary to enable interpretable coefficient-domain inference and fusion, thereby avoiding uncontrolled pixel-space generation while improving perceptual quality and downstream detection performance.

Yafei Zhang, Meng Ma, Huafeng Li, Yu Liu2026-03-10💻 cs

VSDiffusion: Taming Ill-Posed Shadow Generation via Visibility-Constrained Diffusion

VSDiffusion is a visibility-constrained two-stage diffusion framework that addresses the ill-posed nature of shadow generation by incorporating visibility priors through a shadow-gated cross-attention branch and a learned soft prior map to produce geometrically consistent and realistic cast shadows.

Jing Li, Jing Zhang2026-03-10💻 cs

AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis

AffordGrasp is a diffusion-based framework that generates physically stable and semantically accurate human grasps by integrating affordance-aware latent representations with a dual-conditioning process to bridge the modality gap between 3D object geometry and textual interaction instructions.

Xiaofei Wu, Yi Zhang, Yumeng Liu, Yuexin Ma, Yujiao Shi, Xuming He2026-03-10💻 cs

Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model

This paper introduces MambaDance, a novel dance generation framework that replaces Transformers with a Mamba-based diffusion model and employs a Gaussian-based beat representation to effectively capture the sequential, rhythmic, and music-synchronized nature of dance across varying sequence lengths.

Sangjune Park, Inhyeok Choi, Donghyeon Soon, Youngwoo Jeon, Kyungdon Joo2026-03-10💻 cs

Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

This paper proposes a two-stage cascaded framework that generates controllable complex human motion videos by first using an autoregressive model to synthesize 2D skeleton sequences from text descriptions and then employing a pose-conditioned diffusion model with adaptive layer fusion to render high-fidelity videos, supported by a new synthetic dataset designed to overcome limitations in existing benchmarks.

Ashkan Taghipour, Morteza Ghahremani, Zinuo Li, Hamid Laga, Farid Boussaid, Mohammed Bennamoun2026-03-10💻 cs

QualiTeacher: Quality-Conditioned Pseudo-Labeling for Real-World Image Restoration

QualiTeacher introduces a novel framework for real-world image restoration that transforms imperfect pseudo-labels into conditional supervisory signals by explicitly conditioning the student model on estimated label quality, thereby enabling the learning of a quality-graded restoration manifold that avoids artifact mimicry and achieves superior generalization.

Fengyang Xiao, Jingjia Feng, Peng Hu, Dingming Zhang, Lei Xu, Guanyi Qin, Lu Li, Chunming He, Sina Farsiu2026-03-10💻 cs

Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout

This paper presents a robust multimodal framework for the 10th ABAW Expression Recognition Challenge that utilizes a dual-branch Transformer with safe cross-attention and modality dropout to dynamically fuse audio and visual data, effectively addressing partial occlusions, missing modalities, and class imbalance to achieve 60.79% accuracy on the Aff-Wild2 validation set.

Jun Yu, Naixiang Zheng, Guoyuan Wang, Yunxiang Zhang, Lingsi Zhu, Jiaen Liang, Wei Huang, Shengping Liu2026-03-10💻 cs

Speed3R: Sparse Feed-forward 3D Reconstruction Models

Speed3R is a sparse feed-forward 3D reconstruction model that overcomes the quadratic computational bottleneck of dense attention by employing a dual-branch mechanism to focus on informative tokens, achieving a 12.4x inference speedup with minimal accuracy trade-offs.

Weining Ren, Xiao Tan, Kai Han2026-03-10💻 cs

See and Switch: Vision-Based Branching for Interactive Robot-Skill Programming

This paper introduces "See & Switch," a vision-based interactive framework for Programming by Demonstration that utilizes eye-in-hand images to enable reliable online conditional branching and anomaly detection in dexterous robot tasks, achieving high accuracy across diverse conditions and novice users.

Petr Vanc, Jan Kristof Behrens, Václav Hlaváč, Karla Stepanova2026-03-10💻 cs

ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning

ImageEdit-R1 is a novel multi-agent framework that employs reinforcement learning to coordinate specialized vision-language and generative agents, enabling dynamic, context-aware image editing that outperforms existing monolithic models and baselines in handling complex, multi-step user instructions.

Yiran Zhao, Yaoqi Ye, Xiang Liu, Michael Qizhe Shieh, Trung Bui2026-03-10💻 cs

Enhancing Cross-View UAV Geolocalization via LVLM-Driven Relational Modeling

This paper proposes a novel plug-and-play ranking architecture that leverages Large Vision-Language Models (LVLMs) and a relational-aware loss function to explicitly model cross-view interactions, thereby significantly enhancing the accuracy and stability of UAV-to-satellite image geolocalization.

Bowen Liu, Pengyue Jia, Wanyu Wang, Derong Xu, Jiawei Cheng, Jiancheng Dong, Xiao Han, Zimo Zhao, Chao Zhang, Bowen Yu, Fangyu Hong, Xiangyu Zhao2026-03-10💻 cs

Evaluating Generative Models via One-Dimensional Code Distributions

This paper proposes a novel evaluation framework for generative models that replaces traditional continuous feature-based metrics with training-free and no-reference metrics operating in discrete visual token space, demonstrating superior correlation with human judgments across a new large-scale benchmark called VisForm.

Zexi Jia, Pengcheng Luo, Yijia Zhong, Jinchao Zhang, Jie Zhou2026-03-10💻 cs

← Previous Next →

cs.CV

SGG-R3^{\rm 3}3: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation