cs.CV papers | Gist.Science

Robust Sparse Signal Recovery with Outliers: A Hard Thresholding Pursuit Approach Based on LAD

This paper introduces the Graded Fast Hard Thresholding Pursuit (GFHTP $_1$ ) algorithm, which utilizes a quantile-truncated step size for LAD minimization to achieve exact sparse signal recovery from outlier-contaminated measurements without requiring prior knowledge of the signal's sparsity level.

Jiao Xu, Peng Li, Bing Zheng2026-03-09🔢 math

SpatialMem: Metric-Aligned Long-Horizon Video Memory for Language Grounding and QA

SpatialMem is a memory-centric system that constructs a metric-aligned 3D scaffold from casual egocentric RGB videos to enable efficient, interpretable long-horizon language grounding, retrieval, and QA by linking open-vocabulary object nodes to spatial coordinates without requiring specialized sensors.

Xinyi Zheng, Yunze Liu, Chi-Hao Wu, Fan Zhang, Hao Zheng, Wenqi Zhou, Walterio W. Mayol-Cuevas, Junxiao Shen2026-03-09🤖 cs.AI

OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding

The paper introduces OnlineSI, a framework that enables Multimodal Large Language Models to continuously improve spatial understanding and grounding in dynamic environments by maintaining a finite spatial memory and integrating 3D point cloud data with semantic information.

Zixian Liu, Zhaoxi Chen, Liang Pan, Ziwei Liu2026-03-09💻 cs

SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training

This paper introduces SRA 2, a lightweight intrinsic guidance framework that accelerates diffusion transformer training and improves generation quality by aligning intermediate latent features with pre-trained VAE features via a simple projection layer, eliminating the need for external encoders or dual-model setups while incurring minimal computational overhead.

Mengmeng Wang, Dengyang Jiang, Liuzhuozheng Li, Yucheng Lin, Guojiang Shen, Xiangjie Kong, Yong Liu, Guang Dai, Jingdong Wang2026-03-09💻 cs

FARTrack: Fast Autoregressive Visual Tracking with High Performance

The paper introduces FARTrack, a fast autoregressive visual tracking framework that leverages Task-Specific Self-Distillation and Inter-frame Autoregressive Sparsification to achieve real-time, high-performance tracking (70.6% AO on GOT-10k) with exceptional inference speeds of up to 343 FPS on GPU and 121 FPS on CPU.

Guijie Wang, Tong Lin, Yifan Bai, Anjia Cao, Shiyi Liang, Wangbo Zhao, Xing Wei2026-03-09💻 cs

SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning

The paper introduces SpatialReward, a reward model that leverages explicit spatial reasoning to overcome the "Attention Collapse" limitation in existing evaluators, thereby providing fine-grained, accurate signals that significantly enhance online reinforcement learning performance for image editing tasks.

Yancheng Long, Yankai Yang, Hongyang Wei, Wei Chen, Tianke Zhang, Haonan fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Shuo Yang2026-03-09💻 cs

(MGS) $^2$ -Net: Unifying Micro-Geometric Scale and Macro-Geometric Structure for Cross-View Geo-Localization

The paper proposes (MGS) $^2$ -Net, a geometry-grounded framework that unifies Micro-Geometric Scale Adaptation and Macro-Geometric Structure Filtering to overcome geometric misalignment and achieve state-of-the-art cross-view geo-localization performance.

Minglei Li, Mengfan He, Chunyu Li, Chao Chen, Xingyu Shao, Ziyang Meng2026-03-09💻 cs

MiDAS: A Multimodal Data Acquisition System and Dataset for Robot-Assisted Minimally Invasive Surgery

This paper introduces MiDAS, an open-source, platform-agnostic system that enables non-invasive, time-synchronized multimodal data acquisition for robot-assisted minimally invasive surgery, validated by demonstrating that its external sensing approach achieves gesture recognition performance comparable to proprietary telemetry while releasing the first annotated dataset for hernia repair suturing.

Keshara Weerasinghe (MD), Seyed Hamid Reza Roodabeh (MD), Andrew Hawkins (MD), Zhaomeng Zhang, Zachary Schrader, Homa Alemzadeh2026-03-09🤖 cs.LG

DAV-GSWT: Diffusion-Active-View Sampling for Data-Efficient Gaussian Splatting Wang Tiles

DAV-GSWT is a data-efficient framework that combines diffusion priors with active view sampling to synthesize high-fidelity Gaussian Splatting Wang Tiles from minimal input observations, enabling the generation of expansive, photorealistic landscapes without relying on densely sampled exemplar reconstructions.

Rong Fu, Jiekai Wu, Haiyun Wei, Yee Tan Jia, Yang Li, Xiaowen Ma, Wangyu Wu, Simon Fong2026-03-09💻 cs

Robust Self-Supervised Cross-Modal Super-Resolution against Real-World Misaligned Observations

The paper proposes RobSelf, a robust self-supervised model that jointly optimizes a misalignment-aware feature translator and a content-aware reference filter to achieve state-of-the-art cross-modal super-resolution on real-world misaligned data with significantly improved efficiency.

Xiaoyu Dong, Jiahuan Li, Ziteng Cui, Naoto Yokoya2026-03-09💻 cs

UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment

UrbanAlign proposes a novel post-hoc calibration framework that aligns frozen vision-language models with human preferences for urban scene assessment by mining interpretable dimensions, extracting robust concept scores via an Observer-Debater-Judge chain, and calibrating them through locally-weighted ridge regression, achieving state-of-the-art accuracy without any model retraining.

Yecheng Zhang, Rong Zhao, Zhizhou Sha, Yong Li, Lei Wang, Ce Hou, Wen Ji, Hao Huang, Yunshan Wan, Jian Yu, Junhao Xia, Yuru Zhang, Chunlei Shi2026-03-09💻 cs

Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models

This paper demonstrates that affordance reasoning in Vision Foundation Models can be achieved in a zero-shot, training-free manner by fusing DINO's inherent geometric part structures with Flux's verb-conditioned interaction priors, thereby establishing geometric and interaction perception as the fundamental, composable building blocks of affordance understanding.

Qing Zhang, Xuesong Li, Jing Zhang2026-03-09💻 cs

StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives

StoryTailor is a zero-shot pipeline that generates temporally coherent, action-rich multi-subject visual narratives on a single RTX 4090 by synergizing Gaussian-Centered Attention, Action-Boost Singular Value Reweighting, and a Selective Forgetting Cache to simultaneously ensure action faithfulness, subject identity fidelity, and cross-frame background continuity.

Jinghao Hu, Yuhe Zhang, GuoHua Geng, Kang Li, Han Zhang2026-03-09💻 cs

UniVBench: Towards Unified Evaluation for Video Foundation Models

The paper introduces UniVBench, a comprehensive benchmark featuring 200 high-quality, human-created multi-shot videos and a unified agentic evaluation system (UniV-Eval) to holistically assess video foundation models across understanding, generation, editing, and reconstruction tasks, addressing the limitations of existing fragmented and task-specific evaluations.

Jianhui Wei, Xiaotian Zhang, Yichen Li, Yuan Wang, Yan Zhang, Ziyi Chen, Zhihang Tang, Wei Xu, Zuozhu Liu2026-03-09💻 cs

Protein Graph Neural Networks for Heterogeneous Cryo-EM Reconstruction

This paper introduces a geometry-aware Graph Neural Network autodecoder that leverages protein-structure priors and ellipsoidal support lifting to achieve higher accuracy in heterogeneous single-particle cryo-EM reconstruction compared to traditional MLP-based methods.

Jonathan Krook, Axel Janson, Joakim Andén + 2 more2026-03-09💻 cs

Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache

The paper introduces DPCache, a training-free acceleration framework for diffusion models that formulates sampling as a global path planning problem and utilizes dynamic programming on a path-aware cost tensor to select optimal key timesteps, thereby achieving significant speedups with minimal quality loss and even surpassing full-step baselines in certain metrics.

Bowen Cui, Yuanbin Wang, Huajiang Xu, Biaolong Chen, Aixi Zhang, Hao Jiang, Zhengzheng Jin, Xu Liu, Pipei Huang2026-03-09💻 cs

Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

This paper introduces Synthetic Visual Genome 2 (SVG2), a large-scale automated panoptic video scene graph dataset with over 636K videos, and presents TRaSER, a novel model that leverages trajectory-aligned token mechanisms to significantly outperform existing baselines in scene graph generation and downstream video question answering tasks.

Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G. You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Winson Han, Quan Kong, Rajat Saini, Ranjay Krishna2026-03-09💻 cs

Adaptive Dynamic Dehazing via Instruction-Driven and Task-Feedback Closed-Loop Optimization for Diverse Downstream Task Adaptation

This paper proposes a novel adaptive dynamic dehazing framework that utilizes a closed-loop optimization mechanism combining task performance feedback and text-based instruction guidance to enable real-time, training-free adaptation of dehazing outputs for diverse downstream vision tasks.

Yafei Zhang, Shuaitian Song, Huafeng Li, Shujuan Wang, Yu Liu2026-03-09💻 cs

Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark

This paper introduces PanScale, a large-scale cross-scale pansharpening dataset and benchmark, alongside ScaleFormer, a novel transformer-based architecture that achieves superior generalization across varying image resolutions by reframing scale adaptation as sequence length generalization through tokenization and rotary positional encoding.

Ke Cao, Xuanhua He, Xueheng Li, Lingting Zhu, Yingying Wang, Ao Ma, Zhanjie Zhang, Man Zhou, Chengjun Xie, Jie Zhang2026-03-09💻 cs

Mobile-VTON: High-Fidelity On-Device Virtual Try-On

Mobile-VTON is a high-fidelity, privacy-preserving framework that enables fully offline virtual try-on on commodity mobile devices by utilizing a modular TGT architecture with feature-guided adversarial distillation and trajectory-consistency training to match server-based performance without requiring cloud computing.

Zhenchen Wan, Ce Chen, Runqi Lin, Jiaxin Huang, Tianxi Chen, Yanwu Xu, Tongliang Liu, Mingming Gong2026-03-09💻 cs

← Previous Next →

cs.CV