cs.CV papers | Gist.Science

Remote Sensing Image Classification Using Deep Ensemble Learning

This paper proposes a deep ensemble learning framework that fuses four independent CNN-ViT hybrid models to overcome the performance bottlenecks of redundant feature representations, achieving state-of-the-art accuracy on remote sensing image classification datasets while maintaining computational efficiency.

Niful Islam, Md. Rayhan Ahmed, Nur Mohammad Fahad, Salekul Islam, A. K. M. Muzahidul Islam, Saddam Mukta, Swakkhar Shatabda2026-03-09🤖 cs.AI

Cog2Gen3D: Sculpturing 3D Semantic-Geometric Cognition for 3D Generation

Cog2Gen3D is a 3D cognition-guided diffusion framework that integrates semantic and absolute geometric features into a unified latent graph to overcome scale inconsistencies and achieve physically plausible, structurally rational 3D generation.

Haonan Wang, Hanyu Zhou, Haoyue Liu, Tao Gu, Luxin Yan2026-03-09💻 cs

VS3R: Robust Full-frame Video Stabilization via Deep 3D Reconstruction

VS3R is a novel video stabilization framework that synergizes feed-forward 3D reconstruction with generative video diffusion to achieve robust, high-fidelity full-frame stabilization by jointly estimating camera parameters and depth while effectively restoring disoccluded regions.

Muhua Zhu, Xinhao Jin, Yu Zhang, Yifei Xue, Tie Ji, Yizhen Lao2026-03-09💻 cs

Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery

The paper introduces MACRO, a self-evolving medical imaging agent that autonomously discovers and synthesizes reusable composite tools from verified execution trajectories to overcome the brittleness of static tool chains and enhance multi-step clinical decision-making across diverse domains.

Lin Fan, Pengyu Dai, Zhipeng Deng, Haolin Wang, Xun Gong, Yefeng Zheng, Yafei Ou2026-03-09🤖 cs.AI

TumorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Clinical Tumor Analysis

This paper introduces TumorChain, a multimodal interleaved reasoning framework paired with the large-scale TumorCoT dataset, to enhance the traceability, accuracy, and reliability of clinical tumor analysis by integrating 3D CT imaging with step-by-step Chain-of-Thought reasoning for lesion characterization and pathology prediction.

Sijing Li, Zhongwei Qiu, Jiang Liu, Wenqiao Zhang, Tianwei Lin, Yihan Xie, Jianxiang An, Boxiang Yun, Chenglin Yang, Jun Xiao, Guangyu Guo, Jiawen Yao, Wei Liu, Yuan Gao, Ke Yan, Weiwei Cao, Zhilin Zheng, Tony C. W. Mok, Kai Cao, Yu Shi, Jiuyu Zhang, Jian Zhou, Beng Chin Ooi, Yingda Xia, Ling Zhang2026-03-09💻 cs

PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues

The paper proposes PatchCue, a novel patch-based visual cue paradigm that enhances Vision-Language Model reasoning by aligning with human perceptual habits and leveraging patch-tokenized inputs through a two-stage training process, thereby outperforming existing pixel-level and point-based approaches across diverse benchmarks.

Yukun Qi, Pei Fu, Hang Li, Yuhan Liu, Chao Jiang, Bin Qin, Zhenbo Luo, Jian Luan2026-03-09💻 cs

Shifting Adaptation from Weight Space to Memory Space: A Memory-Augmented Agent for Medical Image Segmentation

This paper introduces MemSeg-Agent, a memory-augmented framework that shifts adaptation from weight space to memory space to enable efficient few-shot learning, federated learning, and test-time adaptation for robust medical image segmentation without requiring model fine-tuning.

Bowen Chen, Qiaohui Gao, Shaowen Wan, Shanhui Sun, Wei Liu, Xiang Li, Tianming Liu, Lin Zhao2026-03-09💻 cs

Systematic Evaluation of Novel View Synthesis for Video Place Recognition

This paper presents a systematic evaluation demonstrating that while small synthetic novel view additions improve Video Place Recognition (VPR) performance, the effectiveness of larger additions depends more on the quantity of views and dataset imagery type than on the magnitude of the viewpoint change.

Muhammad Zawad Mahmud, Samiha Islam, Damian Lyons2026-03-09💻 cs

CylinderSplat: 3D Gaussian Splatting with Cylindrical Triplanes for Panoramic Novel View Synthesis

CylinderSplat is a feed-forward framework for panoramic 3D Gaussian Splatting that introduces a novel cylindrical Triplane representation and a dual-branch architecture to effectively handle occlusions and geometric distortions in $360^\circ$ scenes, achieving state-of-the-art results in both single-view and multi-view novel view synthesis.

Qiwei Wang, Xianghui Ze, Jingyi Yu, Yujiao Shi2026-03-09💻 cs

PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

PixARMesh is a novel autoregressive method that directly reconstructs complete, high-fidelity, and artist-ready 3D indoor scene meshes from a single RGB image by jointly predicting object layout and geometry within a unified model, eliminating the need for implicit fields or post-hoc optimization.

Xiang Zhang, Sohyun Yoo, Hongrui Wu, Chuan Li, Jianwen Xie, Zhuowen Tu2026-03-09🤖 cs.LG

InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation

The paper proposes InnoAds-Composer, a single-stage framework that efficiently generates e-commerce posters by integrating subject, text, and style controls through an optimized token routing mechanism and a text enhancement module, while also introducing a new high-quality dataset and benchmark for this task.

Yuxin Qin, Ke Cao, Haowei Liu, Ao Ma, Fengheng Li, Honghe Zhu, Zheng Zhang, Run Ling, Wei Feng, Xuanhua He, Zhanjie Zhang, Zhen Guo, Haoyi Bian, Jingjing Lv, Junjie Shen, Ching Law2026-03-09💻 cs

Mitigating Bias in Concept Bottleneck Models for Fair and Interpretable Image Classification

This paper proposes three bias mitigation techniques—top-k concept filtering, removal of biased concepts, and adversarial debiasing—to address information leakage in Concept Bottleneck Models, thereby achieving superior fairness-performance tradeoffs for interpretable image classification compared to prior work.

Schrasing Tong, Antoine Salaun, Vincent Yuan, Annabel Adeyeri, Lalana Kagal2026-03-09🤖 cs.LG

CollabOD: Collaborative Multi-Backbone with Cross-scale Vision for UAV Small Object Detection

CollabOD is a lightweight collaborative detection framework designed to enhance UAV small object detection by integrating structural detail preservation, cross-path feature alignment, and localization-aware lightweight strategies to overcome challenges like scale variation and feature degradation in high-altitude imagery.

Xuecheng Bai, Yuxiang Wang, Chuanzhi Xu, Boyu Hu, Kang Han, Ruijie Pan, Xiaowei Niu, Xiaotian Guan, Liqiang Fu, Pengfei Ye2026-03-09💻 cs

Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D

This paper introduces Art3D, a novel framework that shifts 2D-to-3D conversion from geometric accuracy to artistic coherence by synthesizing disparities that capture professional cinematic intent through a dual-path architecture and indirect supervision.

Ping Chen, Zezhou Chen, Xingpeng Zhang, Yanlin Qian, Huan Hu, Xiang Liu, Zipeng Wang, Xin Wang, Zhaoxiang Liu, Kai Wang, Shiguo Lian2026-03-09💻 cs

Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image

Pano3DComposer is an efficient feed-forward framework that generates high-fidelity, complete 360-degree 3D scenes from single panoramic images by decoupling object generation from layout estimation through a novel plug-and-play Object-World Transformation Predictor and a Coarse-to-Fine alignment mechanism.

Zidian Qiu, Ancong Wu2026-03-09💻 cs

CORE-Seg: Reasoning-Driven Segmentation for Complex Lesions via Reinforcement Learning

This paper introduces CORE-Seg, a reinforcement learning-driven framework that integrates a Semantic-Guided Prompt Adapter with a progressive SFT-to-GRPO training strategy to bridge the gap between visual segmentation and cognitive reasoning for complex medical lesions, achieving state-of-the-art performance on the newly proposed ComLesion-14K Chain-of-Thought benchmark.

Yuxin Xie, Yuming Chen, Yishan Yang, Yi Zhou, Tao Zhou, Zhen Zhao, Jiacheng Liu, Huazhu Fu2026-03-09🤖 cs.AI

BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

This paper introduces BlackMirror, a novel black-box, training-free framework that detects backdoored text-to-image models by identifying and verifying the stability of partial semantic deviations between instructions and generated images, overcoming the limitations of existing image-similarity-based methods against diverse backdoor attacks.

Feiran Li, Qianqian Xu, Shilong Bao, Zhiyong Yang, Xilin Zhao, Xiaochun Cao, Qingming Huang2026-03-09🤖 cs.AI

RAC: Rectified Flow Auto Coder

The paper introduces RAC (Rectified Flow Auto Coder), a novel architecture that replaces traditional VAEs by leveraging rectified flow for multi-step, bidirectional inference, thereby achieving superior reconstruction and generation quality with significantly reduced parameters and computational cost.

Sen Fang, Yalin Feng, Yanxin Zhang, Dimitris N. Metaxas2026-03-09🤖 cs.AI

Towards Driver Behavior Understanding: Weakly-Supervised Risk Perception in Driving Scenes

This paper introduces RAID, a large-scale dataset for driver risk perception research, and proposes a weakly-supervised framework that leverages driver maneuvers and responses to identify risk sources, achieving significant performance improvements over state-of-the-art methods.

Nakul Agarwal, Yi-Ting Chen, Behzad Dariush2026-03-09💻 cs

Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation

This paper introduces TAR-ViTPose, a novel video-based human pose estimation method that enhances static Vision Transformers by employing joint-centric temporal aggregation and global restoring attention to leverage temporal coherence, thereby achieving superior accuracy and efficiency compared to existing state-of-the-art approaches.

Hongwei Fang, Jiahang Cai, Xun Wang, Wenwu Yang2026-03-09💻 cs

← Previous Next →