cs.CV papers | Gist.Science

Med-Evo: Test-time Self-evolution for Medical Multimodal Large Language Models

Med-Evo is a novel self-evolution framework for medical multimodal large language models that leverages label-free reinforcement learning, featuring Feature-driven Pseudo Labeling and Hard-Soft Reward mechanisms, to significantly enhance model performance on unlabeled test data without requiring additional annotated medical datasets.

Dunyuan Xu, Xikai Yang, Juzheng Miao, Yaoqian Li, Jinpeng Li, Pheng-Ann Heng2026-03-10💻 cs

SLNet: A Super-Lightweight Geometry-Adaptive Network for 3D Point Cloud Recognition

The paper introduces SLNet, a super-lightweight 3D point cloud recognition network utilizing Nonparametric Adaptive Point Embedding (NAPE) and Geometric Modulation Units (GMU) to achieve state-of-the-art accuracy on benchmarks like ModelNet40 and ScanObjectNN with significantly fewer parameters and computational costs compared to existing models.

Mohammad Saeid, Amir Salarpour, Pedram MohajerAnsari, Mert D. Pesé2026-03-10🤖 cs.LG

Image Generation Models: A Technical History

This paper provides a comprehensive technical survey of the history and evolution of image generation models, detailing the objectives, architectures, and limitations of various approaches from VAEs to diffusion methods, while also addressing recent advancements in video generation and the critical challenges of robustness and responsible deployment.

Rouzbeh Shirvani2026-03-10💬 cs.CL

SIGMAE: A Spectral-Index-Guided Foundation Model for Multispectral Remote Sensing

SIGMAE is a novel foundation model for multispectral remote sensing that enhances Masked Autoencoder pretraining by incorporating domain-specific spectral indices to guide dynamic token masking toward semantically salient regions, thereby achieving superior performance across various downstream tasks compared to existing geospatial models.

Xiaokang Zhang, Bo Li, Chufeng Zhou, Weikang Yu, Lefei Zhang2026-03-10💻 cs

Selective Transfer Learning of Cross-Modality Distillation for Monocular 3D Object Detection

This paper introduces MonoSTL, a selective transfer learning framework that addresses the negative transfer caused by modality gaps in cross-modality distillation for monocular 3D object detection by employing similar architectures and novel depth-aware selective distillation modules to effectively transfer LiDAR depth information to image-based networks, achieving state-of-the-art performance on KITTI and NuScenes benchmarks.

Rui Ding, Meng Yang, Nanning Zheng2026-03-10💻 cs

Classifying Novel 3D-Printed Objects without Retraining: Towards Post-Production Automation in Additive Manufacturing

This paper introduces the ThingiPrint dataset and a contrastive fine-tuning approach that enables the classification of novel 3D-printed objects using their CAD models without requiring model retraining, thereby addressing a critical bottleneck in automating industrial post-production workflows.

Fanis Mathioulakis, Gorjan Radevski, Silke GC Cleuren, Michel Janssens, Brecht Das, Koen Schauwaert, Tinne Tuytelaars2026-03-10💻 cs

FedEU: Evidential Uncertainty-Driven Federated Fine-Tuning of Vision Foundation Models for Remote Sensing Image Segmentation

FedEU is a novel federated learning framework that enhances remote sensing image segmentation by integrating evidential uncertainty quantification and client-specific feature embeddings to guide adaptive global aggregation, thereby improving model robustness and reliability across heterogeneous distributed datasets.

Xiaokang Zhang, Xuran Xiong, Jianzhong Huang, Lefei Zhang2026-03-10💻 cs

EVLF: Early Vision-Language Fusion for Generative Dataset Distillation

This paper introduces Early Vision-Language Fusion (EVLF), a plug-and-play method that aligns textual and visual embeddings early in the diffusion process to overcome the visual dominance issues of late-stage guidance, thereby generating semantically faithful and visually coherent synthetic datasets that improve downstream classification accuracy.

Wenqi Cai, Yawen Zou, Guang Li, Chunzhi Gu, Chao Zhang2026-03-10💻 cs

Multi-Modal Decouple and Recouple Network for Robust 3D Object Detection

This paper proposes a Multi-Modal Decouple and Recouple Network that enhances robust 3D object detection under data corruption by explicitly separating BEV features into invariant and specific components to enable cross-modal compensation, followed by an adaptive fusion of three specialized experts tailored to different corruption scenarios.

Rui Ding, Zhaonian Kuang, Yuzhe Ji, Meng Yang, Xinhu Zheng, Gang Hua2026-03-10💻 cs

RobustSCI: Beyond Reconstruction to Restoration for Snapshot Compressive Imaging under Real-World Degradations

This paper introduces RobustSCI, a pioneering framework that shifts snapshot compressive imaging from simple reconstruction to robust restoration by proposing a novel network architecture and a large-scale benchmark to effectively recover pristine scenes from real-world degraded measurements caused by motion blur and low light.

Hao Wang, Yuanfan Li, Qi Zhou, Zhankuo Xu, Jiong Ni, Xin Yuan2026-03-10💻 cs

RayD3D: Distilling Depth Knowledge Along the Ray for Robust Multi-View 3D Object Detection

The paper proposes RayD3D, a novel cross-modal distillation framework that transfers depth knowledge specifically along the camera-to-object ray to filter out irrelevant LiDAR information, thereby significantly enhancing the robustness of multi-view 3D object detection models against real-world data corruptions without increasing inference costs.

Rui Ding, Zhaonian Kuang, Zongwei Zhou, Meng Yang, Xinhu Zheng, Gang Hua2026-03-10💻 cs

DocCogito: Aligning Layout Cognition and Step-Level Grounded Reasoning for Document Understanding

DocCogito is a unified framework for document understanding that aligns global layout perception with structured, region-grounded reasoning through a lightweight layout tower and a deterministic Visual-Semantic Chain, achieving state-of-the-art performance on multiple benchmarks by enforcing systematic coupling between layout priors and evidence-based reasoning.

Yuchuan Wu, Minghan Zhuo, Teng Fu, Mengyang Zhao, Bin Li, Xiangyang Xue2026-03-10💻 cs

AMR-CCR: Anchored Modular Retrieval for Continual Chinese Character Recognition

This paper proposes AMR-CCR, an anchored modular retrieval framework with script-conditioned injection and multi-prototype dictionaries to address the challenges of continual, class-incremental ancient Chinese character recognition, accompanied by the new EvoCON benchmark for systematic evaluation.

Yuchuan Wu, Yinglian Zhu, Haiyang Yu, Ke Niu, Bin Li, Xiangyang Xue2026-03-10💻 cs

High-Fidelity Medical Shape Generation via Skeletal Latent Diffusion

This paper proposes a skeletal latent diffusion framework that leverages a differentiable skeletonization module and a large-scale MedSDF dataset to achieve high-fidelity, computationally efficient medical shape generation while effectively addressing challenges posed by anatomical geometric complexity and data scarcity.

Guoqing Zhang, Jingyun Yang, Siqi Chen, Anping Zhang, Yang Li2026-03-10💻 cs

A Unified View of Drifting and Score-Based Models

This paper establishes a unified theoretical framework demonstrating that drifting models, which optimize kernel-based mean-shift discrepancies, are mathematically equivalent to score-matching objectives on kernel-smoothed distributions, thereby precisely connecting them to diffusion models and clarifying their relationship with Distribution Matching Distillation.

Chieh-Hsin Lai, Bac Nguyen, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon, Molei Tao2026-03-10🤖 cs.LG

EvolveReason: Self-Evolving Reasoning Paradigm for Explainable Deepfake Facial Image Identification

The paper proposes EvolveReason, a self-evolving reasoning paradigm that combines a human-like chain-of-thought framework, a forgery latent-space distribution capture module, and a reinforcement learning-based self-evolution strategy to enhance the accuracy, detail, and reliability of explainable deepfake facial image identification.

Binjia Zhou, Dawei Luo, Shuai Chen, Feng Xu, Seow, Haoyuan Li, Jiachi Wang, Jiawen Wang, Zunlei Feng, Yijun Bei2026-03-10💻 cs

SketchGraphNet: A Memory-Efficient Hybrid Graph Transformer for Large-Scale Sketch Corpora Recognition

This paper introduces SketchGraphNet, a memory-efficient hybrid graph transformer that models free-hand sketches as structured graphs to achieve state-of-the-art recognition accuracy on the newly constructed 3.44-million-sample SketchGraph benchmark while significantly reducing computational resource requirements.

Shilong Chen, Mingyuan Li, Zhaoyang Wang, Zhonglin Ye, Haixing Zhao2026-03-10💻 cs

ACCURATE: Arbitrary-shaped Continuum Reconstruction Under Robust Adaptive Two-view Estimation

The paper proposes ACCURATE, a robust 3D reconstruction framework that combines image segmentation with geometry-constrained topology traversal and dynamic programming to achieve high-accuracy reconstruction of arbitrary-shaped, deformable continuum bodies like guidewires and catheters under biplanar X-ray imaging.

Yaozhi Zhang, Shun Yu, Yugang Zhang, Yang Liu2026-03-10💻 cs

Scale-Aware UAV-to-Satellite Cross-View Geo-Localization: A Semantic Geometric Approach

This paper proposes a semantic geometric framework that leverages small vehicles as metric anchors within a decoupled stereoscopic projection model to recover absolute scale from monocular UAV images, thereby enabling scale-adaptive satellite image cropping and significantly improving cross-view geo-localization robustness under real-world scale ambiguity.

Yibin Ye, Shuo Chen, Kun Wang, Xiaokai Song, Jisheng Dang, Qifeng Yu, Xichao Teng, Zhang Li2026-03-10💻 cs

How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation

This paper introduces UniLongGen, a training-free inference strategy that improves long-horizon interleaved image generation by dynamically curating context to discard accumulated visual noise, thereby overcoming the reliability collapse caused by dense visual token interference in unified multimodal models.

Haoyu Chen, Qing Liu, Yuqian Zhou, He Zhang, Zhaowen Wang, Mengwei Ren, Jingjing Ren, Xiang Wang, Zhe Lin, Lei Zhu2026-03-10💻 cs

← Previous Next →