cs.CV papers | Gist.Science

DocCogito: Aligning Layout Cognition and Step-Level Grounded Reasoning for Document Understanding

DocCogito is a unified framework for document understanding that aligns global layout perception with structured, region-grounded reasoning through a lightweight layout tower and a deterministic Visual-Semantic Chain, achieving state-of-the-art performance on multiple benchmarks by enforcing systematic coupling between layout priors and evidence-based reasoning.

Yuchuan Wu, Minghan Zhuo, Teng Fu, Mengyang Zhao, Bin Li, Xiangyang Xue2026-03-10💻 cs

AMR-CCR: Anchored Modular Retrieval for Continual Chinese Character Recognition

This paper proposes AMR-CCR, an anchored modular retrieval framework with script-conditioned injection and multi-prototype dictionaries to address the challenges of continual, class-incremental ancient Chinese character recognition, accompanied by the new EvoCON benchmark for systematic evaluation.

Yuchuan Wu, Yinglian Zhu, Haiyang Yu, Ke Niu, Bin Li, Xiangyang Xue2026-03-10💻 cs

High-Fidelity Medical Shape Generation via Skeletal Latent Diffusion

This paper proposes a skeletal latent diffusion framework that leverages a differentiable skeletonization module and a large-scale MedSDF dataset to achieve high-fidelity, computationally efficient medical shape generation while effectively addressing challenges posed by anatomical geometric complexity and data scarcity.

Guoqing Zhang, Jingyun Yang, Siqi Chen, Anping Zhang, Yang Li2026-03-10💻 cs

A Unified View of Drifting and Score-Based Models

This paper establishes a unified theoretical framework demonstrating that drifting models, which optimize kernel-based mean-shift discrepancies, are mathematically equivalent to score-matching objectives on kernel-smoothed distributions, thereby precisely connecting them to diffusion models and clarifying their relationship with Distribution Matching Distillation.

Chieh-Hsin Lai, Bac Nguyen, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon, Molei Tao2026-03-10🤖 cs.LG

EvolveReason: Self-Evolving Reasoning Paradigm for Explainable Deepfake Facial Image Identification

The paper proposes EvolveReason, a self-evolving reasoning paradigm that combines a human-like chain-of-thought framework, a forgery latent-space distribution capture module, and a reinforcement learning-based self-evolution strategy to enhance the accuracy, detail, and reliability of explainable deepfake facial image identification.

Binjia Zhou, Dawei Luo, Shuai Chen, Feng Xu, Seow, Haoyuan Li, Jiachi Wang, Jiawen Wang, Zunlei Feng, Yijun Bei2026-03-10💻 cs

SketchGraphNet: A Memory-Efficient Hybrid Graph Transformer for Large-Scale Sketch Corpora Recognition

This paper introduces SketchGraphNet, a memory-efficient hybrid graph transformer that models free-hand sketches as structured graphs to achieve state-of-the-art recognition accuracy on the newly constructed 3.44-million-sample SketchGraph benchmark while significantly reducing computational resource requirements.

Shilong Chen, Mingyuan Li, Zhaoyang Wang, Zhonglin Ye, Haixing Zhao2026-03-10💻 cs

ACCURATE: Arbitrary-shaped Continuum Reconstruction Under Robust Adaptive Two-view Estimation

The paper proposes ACCURATE, a robust 3D reconstruction framework that combines image segmentation with geometry-constrained topology traversal and dynamic programming to achieve high-accuracy reconstruction of arbitrary-shaped, deformable continuum bodies like guidewires and catheters under biplanar X-ray imaging.

Yaozhi Zhang, Shun Yu, Yugang Zhang, Yang Liu2026-03-10💻 cs

Scale-Aware UAV-to-Satellite Cross-View Geo-Localization: A Semantic Geometric Approach

This paper proposes a semantic geometric framework that leverages small vehicles as metric anchors within a decoupled stereoscopic projection model to recover absolute scale from monocular UAV images, thereby enabling scale-adaptive satellite image cropping and significantly improving cross-view geo-localization robustness under real-world scale ambiguity.

Yibin Ye, Shuo Chen, Kun Wang, Xiaokai Song, Jisheng Dang, Qifeng Yu, Xichao Teng, Zhang Li2026-03-10💻 cs

How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation

This paper introduces UniLongGen, a training-free inference strategy that improves long-horizon interleaved image generation by dynamically curating context to discard accumulated visual noise, thereby overcoming the reliability collapse caused by dense visual token interference in unified multimodal models.

Haoyu Chen, Qing Liu, Yuqian Zhou, He Zhang, Zhaowen Wang, Mengwei Ren, Jingjing Ren, Xiang Wang, Zhe Lin, Lei Zhu2026-03-10💻 cs

CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization

The paper introduces CONSTANT, a novel one-shot handwriting generation framework that leverages Style-Aware Quantization and a latent patch-based contrastive objective within a diffusion model to overcome existing limitations in capturing diverse writer styles and generating high-quality, realistic handwritten images across multiple languages.

Anh-Duy Le, Van-Linh Pham, Thanh-Nam Vo, Xuan Toan Mai, Tuan-Anh Tran2026-03-10💻 cs

DreamSAC: Learning Hamiltonian World Models via Symmetry Exploration

DreamSAC is a framework that enhances extrapolative generalization in physics simulations by combining an unsupervised symmetry exploration strategy, which actively probes conservation laws via a Hamiltonian-based curiosity bonus, with a Hamiltonian-based world model that learns invariant physical states from raw observations through a novel contrastive objective.

Jinzhou Tang, Fan Feng, Minghao Fu, Wenjun Lin, Biwei Huang, Keze Wang2026-03-10🤖 cs.LG

ReconDrive: Fast Feed-Forward 4D Gaussian Splatting for Autonomous Driving Scene Reconstruction

ReconDrive is a fast, feed-forward framework that adapts the VGGT foundation model with hybrid prediction heads and static-dynamic composition to achieve high-fidelity, scalable 4D Gaussian Splatting for autonomous driving scenes, outperforming existing feed-forward methods while matching the quality of slower optimization-based approaches.

Haibao Yu, Kuntao Xiao, Jiahang Wang, Ruiyang Hao, Yuxin Huang, Guoran Hu, Haifang Qin, Bowen Jing, Yuntian Bo, Ping Luo2026-03-10💻 cs

Active Inference for Micro-Gesture Recognition: EFE-Guided Temporal Sampling and Adaptive Learning

This paper proposes an active inference-based framework for micro-gesture recognition that utilizes Expected Free Energy-guided temporal sampling and uncertainty-aware adaptive learning to overcome challenges like low amplitude, noise, and inter-subject variability, demonstrating significant performance improvements on the SMG dataset.

Weijia Feng, Jingyu Yang, Ruojia Zhang, Fengtao Sun, Qian Gao, Chenyang Wang, Tongtong Su, Jia Guo, Xiaobai Li, Minglai Shao2026-03-10💻 cs

PureCC: Pure Learning for Text-to-Image Concept Customization

PureCC is a novel concept customization framework that employs a decoupled learning objective and a dual-branch training pipeline to achieve high-fidelity text-to-image personalization while effectively preserving the original model's behavior and capabilities.

Zhichao Liao, Xiaole Xian, Qingyu Li, Wenyu Qin, Meng Wang, Weicheng Xie, Siyang Song, Pingfa Feng, Long Zeng, Liang Pan2026-03-10💻 cs

Brain-WM: Brain Glioblastoma World Model

Brain-WM is a pioneering brain glioblastoma world model that utilizes a novel Y-shaped Mixture-of-Transformers architecture to unify next-step treatment prediction and future MRI generation, effectively capturing the co-evolutionary dynamics between tumor progression and treatment response to optimize clinical outcomes.

Chenhui Wang, Boyun Zheng, Liuxin Bao, Zhihao Peng, Peter Y. M. Woo, Hongming Shan, Yixuan Yuan2026-03-10💻 cs

SiamGM: Siamese Geometry-Aware and Motion-Guided Network for Real-Time Satellite Video Object Tracking

The paper proposes SiamGM, a real-time Siamese network for satellite video object tracking that integrates a geometry-aware Inter-Frame Graph Attention module and a motion-guided optimization strategy to effectively address challenges like small targets and occlusions while achieving 130 FPS without computational overhead.

Zixiao Wen, Zhen Yang, Jiawei Li, Xiantai Xiang, Guangyao Zhou, Yuxin Hu, Yuhan Liu2026-03-10💻 cs

GRD-Net: Generative-Reconstructive-Discriminative Anomaly Detection with Region of Interest Attention Module

The paper proposes GRD-Net, a novel architecture combining a generative adversarial network with a region-of-interest attention module to improve industrial surface anomaly detection and localization by learning from normal products and synthetic defects while focusing on relevant areas, thereby reducing reliance on biased post-processing algorithms.

Niccolò Ferrari, Michele Fraccaroli, Evelina Lamma2026-03-10🤖 cs.LG

Efficient RGB-D Scene Understanding via Multi-task Adaptive Learning and Cross-dimensional Feature Guidance

This paper proposes an efficient multi-task RGB-D scene understanding model that integrates an enhanced fusion encoder, specialized feature interaction layers, and a dynamic adaptive loss function to simultaneously perform semantic, instance, and panoptic segmentation, orientation estimation, and scene classification with improved accuracy and speed across multiple datasets.

Guodong Sun, Junjie Liu, Gaoyang Zhang, Bo Wu, Yang Zhang2026-03-10💻 cs

A Systematic Comparison of Training Objectives for Out-of-Distribution Detection in Image Classification

This paper systematically evaluates four training objectives—Cross-Entropy, Prototype, Triplet, and Average Precision Losses—for out-of-distribution detection in image classification, revealing that while they achieve comparable in-distribution accuracy, Cross-Entropy Loss delivers the most consistent performance across both near- and far-OOD scenarios under standardized protocols.

Furkan Genç, Onat Özdemir, Emre Akbas2026-03-10🤖 cs.LG

Integration of deep generative Anomaly Detection algorithm in high-speed industrial line

This paper presents a semi-supervised deep generative anomaly detection framework, utilizing a residual autoencoder with a dense bottleneck, that achieves high-accuracy, real-time defect detection and localization on high-speed pharmaceutical Blow-Fill-Seal production lines while operating within strict 500 ms timing constraints.

Niccolò Ferrari, Nicola Zanarini, Michele Fraccaroli, Alice Bizzarri, Evelina Lamma2026-03-10🤖 cs.LG

← Previous Next →