cs.CV papers | Gist.Science

Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery

This paper proposes a reparameterized Tensor Ring functional decomposition that leverages Implicit Neural Representations and a structured basis combination to overcome the high-frequency modeling limitations of traditional methods, achieving superior performance in multi-dimensional data recovery tasks such as image inpainting and point cloud reconstruction.

Yangyang Xu, Junbo Ke, You-Wei Wen, Chao Wang2026-03-09🤖 cs.AI

FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters

FastLightGen is a novel algorithm that simultaneously compresses model parameters and reduces inference steps through an optimized teacher-student distillation framework, achieving state-of-the-art efficiency and visual quality in video generation with significantly fewer resources.

Shitong Shao, Yufei Gu, Zeke Xie2026-03-09💻 cs

VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning

This paper introduces VSearcher, a reinforcement learning-based multimodal search agent that transforms static models into capable long-horizon web browsers through an iterative data synthesis pipeline and an SFT-then-RL training strategy, achieving superior performance on the proposed MM-SearchExam benchmark.

Ruiyang Zhang, Qianguo Sun, Chao Song, Yiyan Qi, Zhedong Zheng2026-03-09💻 cs

Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

This paper introduces Think-as-You-See (TaYS), a unified framework that enables concurrent, streaming Chain-of-Thought reasoning for Large Vision-Language Models by decoupling visual encoding from textual reasoning, thereby outperforming traditional batch and interleaved approaches in both accuracy and latency for real-time video understanding.

Jialiang Zhang, Junlong Tong, Junyan Lin, Hao Wu, Yirong Sun, Yunpu Ma, Xiaoyu Shen2026-03-09💻 cs

CoEditor++: Instruction-based Visual Editing via Cognitive Reasoning

CoEditor++ is a training-free, cognitively structured framework that leverages a two-stage "what-to-edit" and "how-to-edit" reasoning process with self-reflection to achieve state-of-the-art, visually consistent, and interpretable instruction-based image editing using only open-source components.

Minheng Ni, Yutao Fan, Zhengyuan Yang, Yeli Shen, Yuxiang Wei, Yaowen Zhang, Lijuan Wang, Lei Zhang, Wangmeng Zuo2026-03-09💻 cs

RoboLayout: Differentiable 3D Scene Generation for Embodied Agents

RoboLayout is a differentiable 3D scene generation framework that extends LayoutVLM by integrating explicit reachability constraints and a local refinement stage to create semantically coherent, physically feasible indoor environments tailored to the specific capabilities of diverse embodied agents.

Ali Shamsaddinlou2026-03-09🤖 cs.AI

Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder

The paper introduces Omni-C, a single dense Transformer encoder that compresses heterogeneous modalities (text, audio, and image) into shared representations via unimodal contrastive pretraining, thereby eliminating the parameter overhead and routing complexity of Mixture-of-Expert architectures while achieving comparable performance with significantly reduced memory usage.

Kin Wai Lau, Yasar Abbas Ur Rehman, Lai-Man Po, Pedro Porto Buarque de Gusmão2026-03-09🤖 cs.AI

Clinical-Injection Transformer with Domain-Adapted MAE for Lupus Nephritis Prognosis Prediction

This paper proposes a novel multimodal framework, the Clinical-Injection Transformer with a domain-adapted MAE, which integrates routine PAS-stained histopathology images and clinical data to achieve high-accuracy three-class prognosis prediction for pediatric lupus nephritis, addressing previous limitations in data availability and modality integration.

Yuewen Huang, Zhitao Ye, Guangnan Feng, Fudan Zheng, Xia Gao, Yutong Lu2026-03-09🤖 cs.LG

Edges Are All You Need: Robust Gait Recognition via Label-Free Structure

This paper introduces SKETCHGAIT, a robust gait recognition framework that leverages a novel label-free "SKETCH" modality to extract dense structural cues from RGB images, demonstrating that combining this edge-based representation with traditional parsing methods significantly outperforms existing silhouette- and parsing-based approaches.

Chao Zhang, Zhuang Zheng, Ruixin Li, Zhanyong Mei2026-03-09💻 cs

Digital-Twin Losses for Lane-Compliant Trajectory Prediction at Urban Intersections

This paper presents a digital twin-driven V2X trajectory prediction framework for urban intersections that employs a novel twin loss function alongside standard MSE to enforce traffic rules, collision avoidance, and motion diversity, thereby significantly reducing safety violations while maintaining high prediction accuracy and real-time performance.

Kuo-Yi Chao, Erik Leo Haß, Melina Gegg, Jiajie Zhang, Ralph Raßhofer, Alois Christian Knoll2026-03-09💻 cs

AutothinkRAG: Complexity-Aware Control of Retrieval-Augmented Reasoning for Image-Text Interaction

AutoThinkRAG is a complexity-aware framework for image-text interaction that improves document question answering by routing queries based on difficulty and decoupling visual interpretation from logical reasoning to achieve state-of-the-art performance with reduced inference costs.

Jiashu Yang, Chi Zhang, Abudukelimu Wuerkaixi, Xuxin Cheng, Cao Liu, Ke Zeng, Xu Jia, Xunliang Cai2026-03-09💻 cs

Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models

This paper introduces Bias-Invariant Subnetwork Extraction (BISE), a method that identifies and isolates fair, bias-agnostic subnetworks within standard pre-trained models through pruning, enabling effective bias mitigation without retraining or additional unbiased data.

Ivan Luiz De Moura Matos, Abdel Djalil Sad Saoud, Ekaterina Iakovleva, Vito Paolo Pastore, Enzo Tartaglione2026-03-09🤖 cs.LG

Thinking with Spatial Code for Physical-World Video Reasoning

This paper introduces "Thinking with Spatial Code," a framework that converts RGB videos into explicit, temporally coherent 3D representations using a specialized spatial encoder and reinforcement learning, enabling large language models to achieve state-of-the-art performance in physical-world visual reasoning on VSI-Bench.

Jieneng Chen, Wenxin Ma, Ruisheng Yuan, Yunzhi Zhang, Jiajun Wu, Alan Yuille2026-03-09💻 cs

From Decoupled to Coupled: Robustness Verification for Learning-based Keypoint Detection with Joint Specifications

This paper introduces the first coupled robustness verification framework for heatmap-based keypoint detectors that uses a mixed-integer linear program to jointly bound deviations across all keypoints, thereby providing sound and less conservative guarantees than prior decoupled methods.

Xusheng Luo, Changliu Liu2026-03-09🤖 cs.LG

DreamCAD: Scaling Multi-modal CAD Generation using Differentiable Parametric Surfaces

DreamCAD is a multi-modal generative framework that enables scalable, high-fidelity CAD generation by representing editable BReps as differentiable parametric surfaces for training on unannotated 3D meshes, while also introducing the large-scale CADCap-1M dataset to advance text-to-CAD research.

Mohammad Sadil Khan, Muhammad Usama, Rolandos Alexandros Potamias, Didier Stricker, Muhammad Zeshan Afzal, Jiankang Deng, Ismail Elezi2026-03-09🤖 cs.AI

Adversarial Batch Representation Augmentation for Batch Correction in High-Content Cellular Screening

This paper proposes Adversarial Batch Representation Augmentation (ABRA), a domain generalization framework that synthesizes worst-case bio-batch perturbations via structured uncertainty modeling and angular geometric margins to achieve state-of-the-art batch correction and generalization in high-content cellular screening without relying on additional prior knowledge.

Lei Tong, Xujing Yao, Adam Corrigan, Long Chen, Navin Rathna Kumar, Kerry Hallbrook, Jonathan Orme, Yinhai Wang, Huiyu Zhou2026-03-09🤖 cs.AI

Post Fusion Bird's Eye View Feature Stabilization for Robust Multimodal 3D Detection

This paper introduces the Post Fusion Stabilizer (PFS), a lightweight, plug-and-play module that enhances the robustness of existing camera-LiDAR fusion 3D detectors against domain shifts and sensor failures by stabilizing bird's-eye view feature statistics and adaptively correcting degraded cues without requiring architectural changes or retraining.

Trung Tien Dong, Dev Thakkar, Arman Sargolzaei, Xiaomin Lin2026-03-09🤖 cs.AI

Rethinking Concept Bottleneck Models: From Pitfalls to Solutions

This paper introduces CBM-Suite, a comprehensive framework that addresses key limitations of Concept Bottleneck Models by proposing an entropy-based metric for concept relevance, a non-linear layer to prevent bypassing the bottleneck, and a distillation strategy to close the accuracy gap with opaque models.

Merve Tapli, Quentin Bouniot, Wolfgang Stammer, Zeynep Akata, Emre Akbas2026-03-09💻 cs

Making Reconstruction FID Predictive of Diffusion Generation FID

This paper introduces interpolated FID (iFID), a novel metric that achieves a strong correlation with diffusion generation FID by interpolating latent representations between dataset samples and their nearest neighbors, thereby overcoming the limitations of traditional reconstruction FID in predicting generative model quality.

Tongda Xu, Mingwei He, Shady Abu-Hussein, Jose Miguel Hernandez-Lobato, Haotian Zhang, Kai Zhao, Chao Zhou, Ya-Qin Zhang, Yan Wang2026-03-09🤖 cs.LG

When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On

This paper introduces Implicit Error Counting (IEC), a reference-free reinforcement learning post-training method that enumerates and weights errors to generate rewards, demonstrating superior performance over Rubrics as Rewards (RaR) in virtual try-on tasks where multiple valid outputs exist and ideal reference answers are unavailable.

Wisdom Ikezogwo, Mehmet Saygin Seyfioglu, Ranjay Krishna, Karim Bouyarmane2026-03-09🤖 cs.AI

← Previous Next →