cs.CV papers | Gist.Science

LaVCa: LLM-assisted Visual Cortex Captioning

The paper proposes LaVCa, a novel data-driven approach that leverages large language models to generate natural-language captions for images, thereby providing more accurate and detailed interpretations of human visual cortex voxel selectivity and revealing fine-grained functional differentiation within the visual cortex compared to existing deep neural network-based methods.

Takuya Matsuyama, Shinji Nishimoto, Yu Takagi2026-03-10🤖 cs.LG

Subclass Classification of Gliomas Using MRI Fusion Technique

This study proposes a high-accuracy glioma subclass classification framework that fuses 2D and 3D UNET-segmented multimodal MRI images using weighted averaging and classifies them via a pre-trained ResNet50 model, achieving a 99.25% accuracy rate.

Kiranmayee Janardhan, Christy Bobby Thomas2026-03-10💻 cs

A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning

This paper proposes Leave-One-Out PPO (LOOP), a novel reinforcement learning method that combines variance reduction techniques from REINFORCE with the robustness of PPO to achieve a superior balance between sample efficiency and performance in fine-tuning text-to-image diffusion models.

Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin + 4 more2026-03-10🤖 cs.AI

Enhancing Alzheimer's Diagnosis: Leveraging Anatomical Landmarks in Graph Convolutional Neural Networks on Tetrahedral Meshes

This paper proposes a novel transformer-based geometric deep learning model that tokenizes tetrahedral meshes with anatomical landmarks to accurately classify Alzheimer's disease and predict brain amyloid positivity in medium-risk individuals, offering a robust alternative to costly and invasive PET scans.

Yanxi Chen, Mohammad Farazi, Zhangsihao Yang, Yonghui Fan, Nicholas Ashton, Eric M Reiman, Yi Su, Yalin Wang2026-03-10💻 cs

From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction

This paper proposes a unified framework for occlusion-robust two-hand reconstruction that combines a fusion-alignment encoder to implicitly integrate heterogeneous 2D structural priors from vision foundation models with a penetration-free diffusion model that guides 3D pose generation toward collision-free, kinematically coherent interactions.

Gaoge Han, Yongkang Cheng, Zhe Chen, Shaoli Huang, Tongliang Liu2026-03-10💻 cs

DeepSparse: A Foundation Model for Sparse-View CBCT Reconstruction

The paper introduces DeepSparse, a foundation model for sparse-view CBCT reconstruction that utilizes a novel DiCE network and a HyViP pretraining framework to achieve superior image quality with reduced radiation exposure while overcoming the computational and generalization limitations of existing methods.

Yiqun Lin, Jixiang Chen, Hualiang Wang, Jiewen Yang, Jiarong Guo, Yi Zhang, Xiaomeng Li2026-03-10💻 cs

Deep Unrolled Meta-Learning for Multi-Coil and Multi-Modality MRI with Adaptive Optimization

This paper proposes a unified deep meta-learning framework that unrolls an adaptive optimization algorithm to jointly accelerate multi-coil MRI reconstruction and cross-modality synthesis, demonstrating superior generalization and image quality under aggressive undersampling and domain shifts compared to conventional supervised methods.

Merham Fouladvand, Peuroly Batra2026-03-10🔢 math

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

To address the data scarcity in dexterous manipulation imitation learning, this paper introduces EgoDex, the largest and most diverse dataset of its kind featuring 829 hours of Apple Vision Pro-captured egocentric videos with precise, native 3D hand and finger tracking, alongside established benchmarks for training and evaluating manipulation policies.

Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, Jian Zhang2026-03-10🤖 cs.LG

Vid2World: Crafting Video Diffusion Models to Interactive World Models

Vid2World is a general framework that transforms pre-trained video diffusion models into interactive world models by implementing causalization techniques and a causal action guidance mechanism to enable high-fidelity, controllable, and autoregressive future prediction across diverse domains.

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, Mingsheng Long2026-03-10🤖 cs.LG

Generative Prior-Guided Neural Interface Reconstruction for 3D Electrical Impedance Tomography

This paper introduces a "solver-in-the-loop" framework for 3D Electrical Impedance Tomography that combines a pre-trained 3D generative prior with a rigorous boundary integral equation solver to enforce physical constraints as hard conditions, thereby achieving superior geometric accuracy and data efficiency in reconstructing complex interfaces compared to traditional optimization and deep learning methods.

Haibo Liu, Junqing Chen, Guang Lin2026-03-10🔢 math

ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers

The paper introduces ViTaPEs, a transformer-based architecture that employs a novel two-stage positional encoding strategy to effectively fuse visual and tactile modalities, achieving state-of-the-art performance and zero-shot generalization across diverse recognition and robotic grasping tasks without relying on pre-trained vision-language models.

Fotios Lygerakis, Ozan Özdenizci, Elmar Rückert2026-03-10🤖 cs.LG

From Semantic To Instance: A Semi-Self-Supervised Learning Approach

This paper proposes a semi-self-supervised learning approach featuring a novel GLMask representation and a semantic-to-instance pipeline that achieves state-of-the-art instance segmentation performance with minimal manual annotation, demonstrating superior results on both dense agricultural wheat head images and the general-purpose COCO dataset.

Keyhan Najafian, Farhad Maleki, Lingling Jin, Ian Stavness2026-03-10🤖 cs.LG

Transforming H&E images into IHC: A Variance-Penalized GAN for Precision Oncology

This study introduces a variance-penalized GAN based on pyramid pix2pix that generates high-fidelity HER2-specific immunohistochemistry (IHC) images from routine hematoxylin and eosin (H&E) slides, effectively mitigating mode collapse and outperforming baseline models to enable cost-effective, scalable precision oncology diagnostics.

Sara Rehmat, Hafeez Ur Rehman, Byeong-Gwon Kang, Sarra Ayouni, Yunyoung Nam2026-03-10💻 cs

Adopting a human developmental visual diet yields robust, shape-based AI vision

By implementing a novel "developmental visual diet" inspired by human visual maturation, this study demonstrates that guiding AI learning processes rather than simply scaling data yields models with superior shape-based recognition, robustness to distortions, and alignment with human vision.

Zejin Lu, Sushrut Thorat, Radoslaw M Cichy, Tim C Kietzmann2026-03-10🤖 cs.LG

TransUNet-GradCAM: A Hybrid Transformer-U-Net with Self-Attention and Explainable Visualizations for Foot Ulcer Segmentation

This paper presents TransUNet-GradCAM, a hybrid Vision Transformer-U-Net model that effectively segments diabetic foot ulcers by combining global attention with local feature extraction, achieving high accuracy on internal and external datasets while providing explainable visualizations for clinical utility.

Akwasi Asare, Mary Sagoe, Justice Williams Asare, Stephen Edward Moore2026-03-10💻 cs

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

This paper introduces IAG, the first input-aware backdoor attack on vision-language models for visual grounding, which utilizes a text-conditioned UNet to dynamically generate imperceptible, target-specific triggers that achieve high attack success rates across various models and datasets while maintaining stealth and robustness against defenses.

Junxian Li, Beining Xu, Simin Chen, Jiatong Li, Jingdi Lei, Haodong Zhao, Di Zhang2026-03-10💬 cs.CL

Classification of Driver Behaviour Using External Observation Techniques for Autonomous Vehicles

This study presents a novel computer vision framework that utilizes external observation techniques, including YOLO-based object detection and lane monitoring, to classify distracted and impaired driver behaviors in non-connected vehicles without relying on inter-vehicular communication.

Ian Nell, Shane Gilroy2026-03-10⚡ eess

UltraUPConvNet: A UPerNet- and ConvNeXt-Based Multi-Task Network for Ultrasound Tissue Segmentation and Disease Prediction

UltraUPConvNet is a computationally efficient, multi-task framework based on UPerNet and ConvNeXt that simultaneously performs ultrasound tissue segmentation and disease prediction, achieving state-of-the-art performance on a large-scale dataset with reduced computational overhead.

Zhi Chen, Le Zhang2026-03-10💻 cs

MICA: Multi-Agent Industrial Coordination Assistant

This paper introduces MICA, a privacy-preserving, speech-interactive multi-agent system that leverages Adaptive Step Fusion and a safety-audited coordination topology to deliver robust, real-time industrial assistance for assembly and maintenance tasks on resource-constrained hardware.

Di Wen, Kunyu Peng, Junwei Zheng, Yufan Chen, Yitian Shi, Jiale Wei, Ruiping Liu, Kailun Yang, Rainer Stiefelhagen2026-03-10🤖 cs.LG

ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models

This paper introduces the ORIC framework and benchmark to evaluate and improve Large Vision-Language Models' object recognition capabilities under contextual incongruity, demonstrating that such scenarios significantly degrade performance and that targeted Visual Reinforcement Fine-Tuning can effectively mitigate these failures.

Zhaoyang Li, Zhan Ling, Yuchen Zhou, Litian Gong, Erdem Bıyık, Hao Su2026-03-10🤖 cs.LG

← Previous Next →