cs.CV papers | Gist.Science

Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models

This paper introduces HarmonicEval, a reference-free, multi-criteria evaluation metric for vision-language models that aggregates criterion-wise scores to better align with human judgments across diverse multi-modal tasks, supported by the newly constructed MMHE benchmark containing 18,000 expert human evaluations.

Masanari Ohi, Masahiro Kaneko, Naoaki Okazaki, Nakamasa InoueTue, 10 Ma💬 cs.CL

From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models

This paper proposes a method that leverages pretrained vision-language models to learn compact, abstract symbolic world models from limited visual demonstrations, enabling zero-shot generalization and long-horizon planning for complex robotic tasks across novel objects, environments, and goals.

Ashay Athalye, Nishanth Kumar, Tom Silver, Yichao Liang, Jiuguang Wang, Tomás Lozano-Pérez, Leslie Pack KaelblingTue, 10 Ma🤖 cs.LG

Efficient Semi-Supervised Adversarial Training via Latent Clustering-Based Data Reduction

This paper proposes efficient data reduction strategies for semi-supervised adversarial training that utilize latent clustering techniques to select or generate critical boundary-adjacent samples, significantly reducing data requirements and computational costs while maintaining state-of-the-art robustness.

Somrita Ghosh, Yuelin Xu, Xiao ZhangTue, 10 Ma🤖 cs.LG

MAP-based Problem-Agnostic diffusion model for Inverse Problems

This paper proposes a novel, problem-agnostic diffusion model that enhances inverse problems by decomposing the conditional score function into an unconditional pretrained component and a Gaussian-prior-based MAP-guided term, resulting in superior content preservation and structural coherence compared to state-of-the-art methods.

Pingping Tao, Haixia Liu, Jing SuTue, 10 Ma💻 cs

Strengthening Generative Robot Policies through Predictive World Modeling

This paper introduces Generative Predictive Control (GPC), a framework that enhances robotic manipulation by combining a diffusion-based policy with an action-conditioned world model to enable online planning and optimization, consistently outperforming standard behavior cloning across diverse simulation and real-world tasks.

Han Qi, Haocheng Yin, Aris Zhu, Yilun Du, Heng YangTue, 10 Ma🤖 cs.LG

Prompt-SID: Learning Structural Representation Prompt via Latent Diffusion for Single-Image Denoising

Prompt-SID is a self-supervised single-image denoising framework that leverages a latent diffusion-based structural representation generator and a scale replay training mechanism to preserve detailed structural information without relying on expensive paired datasets.

Huaqiu Li, Wang Zhang, Xiaowan Hu, Tao Jiang, Zikang Chen, Haoqian WangTue, 10 Ma💻 cs

LaVCa: LLM-assisted Visual Cortex Captioning

The paper proposes LaVCa, a novel data-driven approach that leverages large language models to generate natural-language captions for images, thereby providing more accurate and detailed interpretations of human visual cortex voxel selectivity and revealing fine-grained functional differentiation within the visual cortex compared to existing deep neural network-based methods.

Takuya Matsuyama, Shinji Nishimoto, Yu TakagiTue, 10 Ma🤖 cs.LG

Subclass Classification of Gliomas Using MRI Fusion Technique

This study proposes a high-accuracy glioma subclass classification framework that fuses 2D and 3D UNET-segmented multimodal MRI images using weighted averaging and classifies them via a pre-trained ResNet50 model, achieving a 99.25% accuracy rate.

Kiranmayee Janardhan, Christy Bobby ThomasTue, 10 Ma💻 cs

Enhancing Alzheimer's Diagnosis: Leveraging Anatomical Landmarks in Graph Convolutional Neural Networks on Tetrahedral Meshes

This paper proposes a novel transformer-based geometric deep learning model that tokenizes tetrahedral meshes with anatomical landmarks to accurately classify Alzheimer's disease and predict brain amyloid positivity in medium-risk individuals, offering a robust alternative to costly and invasive PET scans.

Yanxi Chen, Mohammad Farazi, Zhangsihao Yang, Yonghui Fan, Nicholas Ashton, Eric M Reiman, Yi Su, Yalin WangTue, 10 Ma💻 cs

From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction

This paper proposes a unified framework for occlusion-robust two-hand reconstruction that combines a fusion-alignment encoder to implicitly integrate heterogeneous 2D structural priors from vision foundation models with a penetration-free diffusion model that guides 3D pose generation toward collision-free, kinematically coherent interactions.

Gaoge Han, Yongkang Cheng, Zhe Chen, Shaoli Huang, Tongliang LiuTue, 10 Ma💻 cs

DeepSparse: A Foundation Model for Sparse-View CBCT Reconstruction

The paper introduces DeepSparse, a foundation model for sparse-view CBCT reconstruction that utilizes a novel DiCE network and a HyViP pretraining framework to achieve superior image quality with reduced radiation exposure while overcoming the computational and generalization limitations of existing methods.

Yiqun Lin, Jixiang Chen, Hualiang Wang, Jiewen Yang, Jiarong Guo, Yi Zhang, Xiaomeng LiTue, 10 Ma💻 cs

Deep Unrolled Meta-Learning for Multi-Coil and Multi-Modality MRI with Adaptive Optimization

This paper proposes a unified deep meta-learning framework that unrolls an adaptive optimization algorithm to jointly accelerate multi-coil MRI reconstruction and cross-modality synthesis, demonstrating superior generalization and image quality under aggressive undersampling and domain shifts compared to conventional supervised methods.

Merham Fouladvand, Peuroly BatraTue, 10 Ma🔢 math

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

To address the data scarcity in dexterous manipulation imitation learning, this paper introduces EgoDex, the largest and most diverse dataset of its kind featuring 829 hours of Apple Vision Pro-captured egocentric videos with precise, native 3D hand and finger tracking, alongside established benchmarks for training and evaluating manipulation policies.

Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, Jian ZhangTue, 10 Ma🤖 cs.LG

Vid2World: Crafting Video Diffusion Models to Interactive World Models

Vid2World is a general framework that transforms pre-trained video diffusion models into interactive world models by implementing causalization techniques and a causal action guidance mechanism to enable high-fidelity, controllable, and autoregressive future prediction across diverse domains.

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, Mingsheng LongTue, 10 Ma🤖 cs.LG

Generative Prior-Guided Neural Interface Reconstruction for 3D Electrical Impedance Tomography

This paper introduces a "solver-in-the-loop" framework for 3D Electrical Impedance Tomography that combines a pre-trained 3D generative prior with a rigorous boundary integral equation solver to enforce physical constraints as hard conditions, thereby achieving superior geometric accuracy and data efficiency in reconstructing complex interfaces compared to traditional optimization and deep learning methods.

Haibo Liu, Junqing Chen, Guang LinTue, 10 Ma🔢 math

ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers

The paper introduces ViTaPEs, a transformer-based architecture that employs a novel two-stage positional encoding strategy to effectively fuse visual and tactile modalities, achieving state-of-the-art performance and zero-shot generalization across diverse recognition and robotic grasping tasks without relying on pre-trained vision-language models.

Fotios Lygerakis, Ozan Özdenizci, Elmar RückertTue, 10 Ma🤖 cs.LG

From Semantic To Instance: A Semi-Self-Supervised Learning Approach

This paper proposes a semi-self-supervised learning approach featuring a novel GLMask representation and a semantic-to-instance pipeline that achieves state-of-the-art instance segmentation performance with minimal manual annotation, demonstrating superior results on both dense agricultural wheat head images and the general-purpose COCO dataset.

Keyhan Najafian, Farhad Maleki, Lingling Jin, Ian StavnessTue, 10 Ma🤖 cs.LG

Transforming H&E images into IHC: A Variance-Penalized GAN for Precision Oncology

This study introduces a variance-penalized GAN based on pyramid pix2pix that generates high-fidelity HER2-specific immunohistochemistry (IHC) images from routine hematoxylin and eosin (H&E) slides, effectively mitigating mode collapse and outperforming baseline models to enable cost-effective, scalable precision oncology diagnostics.

Sara Rehmat, Hafeez Ur Rehman, Byeong-Gwon Kang, Sarra Ayouni, Yunyoung NamTue, 10 Ma💻 cs

Adopting a human developmental visual diet yields robust, shape-based AI vision

By implementing a novel "developmental visual diet" inspired by human visual maturation, this study demonstrates that guiding AI learning processes rather than simply scaling data yields models with superior shape-based recognition, robustness to distortions, and alignment with human vision.

Zejin Lu, Sushrut Thorat, Radoslaw M Cichy, Tim C KietzmannTue, 10 Ma🤖 cs.LG

TransUNet-GradCAM: A Hybrid Transformer-U-Net with Self-Attention and Explainable Visualizations for Foot Ulcer Segmentation

This paper presents TransUNet-GradCAM, a hybrid Vision Transformer-U-Net model that effectively segments diabetic foot ulcers by combining global attention with local feature extraction, achieving high accuracy on internal and external datasets while providing explainable visualizations for clinical utility.

Akwasi Asare, Mary Sagoe, Justice Williams Asare, Stephen Edward MooreTue, 10 Ma💻 cs

← Previous Next →