cs.CV papers | Gist.Science

NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction

NOVA3R is a feed-forward approach for 3D reconstruction from unposed images that employs a scene-token mechanism and diffusion-based decoder to learn a global, view-agnostic representation, thereby overcoming pixel-aligned limitations to produce complete and physically plausible amodal reconstructions.

Weirong Chen, Chuanxia Zheng, Ganlin Zhang + 2 more2026-03-06💻 cs

A Unified Framework for Joint Detection of Lacunes and Enlarged Perivascular Spaces

This paper proposes a unified, morphology-decoupled framework that leverages cross-task attention and mixed-supervision losses to overcome feature interference and class imbalance, achieving state-of-the-art joint detection of lacunes and enlarged perivascular spaces on both the VALDO 2021 and external EPAD cohorts.

Lucas He, Krinos Li, Hanyuan Zhang + 7 more2026-03-06💻 cs

Gaussian Wardrobe: Compositional 3D Gaussian Avatars for Free-Form Virtual Try-On

The paper introduces "Gaussian Wardrobe," a novel framework that utilizes a compositional 3D Gaussian representation to disentangle shape-agnostic garment layers from human bodies, enabling high-fidelity photorealistic avatars and versatile virtual try-on applications with free-form clothing transfer.

Zhiyi Chen, Hsuan-I Ho, Tianjian Jiang + 3 more2026-03-06💻 cs

Lost in Translation: How Language Re-Aligns Vision for Cross-Species Pathology

This study demonstrates that introducing "Semantic Anchoring," a text-alignment mechanism, effectively resolves intrinsic embedding collapse and domain-locking in cross-species pathology models by using language as a stable coordinate system to re-align visual features, thereby significantly improving cancer detection performance across same-cancer, cross-cancer, and cross-species scenarios.

Ekansh Arora2026-03-06💻 cs

The Thinking Boundary: Quantifying Reasoning Suitability of Multimodal Tasks via Dual Tuning

This paper introduces "Dual Tuning," a framework that quantifies the performance gains of reasoning versus direct answering to establish a "Thinking Boundary," thereby challenging the universal application of reasoning and providing data-driven guidance for resource-efficient, adaptive multimodal model training.

Ruobing Zheng, Tianqi Li, Jianing Li + 3 more2026-03-06💻 cs

SkillNet: Create, Evaluate, and Connect AI Skills

SkillNet is an open infrastructure that addresses the lack of systematic skill accumulation in AI agents by providing a unified ontology, a repository of over 200,000 skills, and evaluation tools to create, connect, and assess skills, thereby significantly enhancing agent performance and efficiency across diverse tasks.

Yuan Liang, Ruobin Zhong, Haoming Xu + 46 more2026-03-06✓ Author reviewed ⓘ💻 cs

Recognition of Daily Activities through Multi-Modal Deep Learning: A Video, Pose, and Object-Aware Approach for Ambient Assisted Living

This paper proposes a multi-modal deep learning framework that fuses 3D CNN-based video features, Graph Convolutional Network-analyzed pose data, and object detection context via cross-attention to robustly recognize daily activities for Ambient Assisted Living, achieving competitive accuracy on the Toyota SmartHome dataset.

Kooshan Hashemifard, Pau Climent-Pérez, Francisco Florez-Revuelta2026-03-06💻 cs

InverseNet: Benchmarking Operator Mismatch and Calibration Across Compressive Imaging Modalities

This paper introduces InverseNet, the first cross-modality benchmark demonstrating that operator mismatch severely degrades deep learning performance in compressive imaging, while revealing that operator-conditioned architectures and blind calibration can effectively recover these losses across simulated and real-world scenarios.

Chengshuai Yang, Xin Yuan2026-03-06💻 cs

Fusion and Grouping Strategies in Deep Learning for Local Climate Zone Classification of Multimodal Remote Sensing Data

This study evaluates various deep learning fusion and grouping strategies for classifying Local Climate Zones using multimodal SAR and MSI data, demonstrating that a baseline hybrid fusion model combined with band grouping and label merging achieves the highest accuracy (76.6%) while significantly improving predictions for underrepresented classes.

Ancymol Thomas, Jaya Sreevalsan-Nair2026-03-06💻 cs

Structure-Guided Histopathology Synthesis via Dual-LoRA Diffusion

The paper proposes Dual-LoRA Controllable Diffusion, a unified framework that leverages multi-class nuclei centroids as spatial priors and task-specific LoRA adapters to simultaneously achieve high-fidelity local structure completion and realistic global tissue synthesis, significantly outperforming existing GAN and diffusion baselines in histopathology modeling.

Xuan Xu, Prateek Prasanna2026-03-06💻 cs

Mask-aware inference with State-Space Models

This paper introduces Partial Vision Mamba (PVM), a novel architectural component that adapts State Space Models like Mamba to handle arbitrarily shaped invalid data through mask-aware operations, demonstrating its effectiveness across depth completion, image inpainting, and classification tasks.

Ignasi Mas, Ramon Morros, Javier-Ruiz Hidalgo + 1 more2026-03-06💻 cs

PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing

The paper introduces PinPoint, a comprehensive real-world benchmark for Composed Image Retrieval featuring multi-answer ground truths, explicit hard negatives, and multi-image queries to reveal significant limitations in current methods, alongside proposing a training-free MLLM-based reranking solution to address these gaps.

Rohan Mahadev, Joyce Yuan, Patrick Poirson + 3 more2026-03-06💻 cs

SGR3 Model: Scene Graph Retrieval-Reasoning Model in 3D

The paper introduces SGR3, a training-free 3D scene graph generation framework that leverages multi-modal large language models and retrieval-augmented generation with a weighted patch-level similarity mechanism to achieve competitive performance without requiring explicit 3D reconstruction or multi-modal data.

Zirui Wang, Ruiping Liu, Yufan Chen + 7 more2026-03-06💻 cs

Spinverse: Differentiable Physics for Permeability-Aware Microstructure Reconstruction from Diffusion MRI

Spinverse is a differentiable physics framework that reconstructs explicit microstructural interfaces from diffusion MRI by optimizing learnable face permeabilities on a fixed tetrahedral grid, utilizing geometric priors and multi-sequence optimization to overcome ill-posedness and recover complex tissue geometries without altering mesh connectivity.

Prathamesh Pradeep Khole, Mario M. Brenes, Zahra Kais Petiwala + 5 more2026-03-06💻 cs

Using Vision + Language Models to Predict Item Difficulty

This study demonstrates that a multimodal approach combining vision and language models (GPT-4.1-nano) to analyze both visualization images and text features significantly outperforms unimodal methods in predicting the difficulty of data literacy test items for U.S. adults, achieving a mean absolute error of 0.224.

Samin Khan2026-03-06💻 cs

sFRC for assessing hallucinations in medical image restoration

This paper proposes sFRC, a novel method that performs Fourier Ring Correlation analysis over small patches to robustly detect and quantify hallucinations in deep learning-based medical image restoration across various undersampled imaging problems.

Prabhat Kc, Rongping Zeng, Nirmal Soni + 1 more2026-03-06🔬 physics

Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks

This paper introduces PulseFocus, a training-free inference-time method that mitigates diffuse attention patterns and positional biases in reasoning VLMs by structuring chain-of-thought generation into interleaved planning and focus blocks, thereby significantly improving performance on multi-image benchmarks.

Chenjun Li2026-03-06💻 cs

A Benchmark Study of Neural Network Compression Methods for Hyperspectral Image Classification

This paper presents a systematic benchmark study evaluating the effectiveness of pruning, quantization, and knowledge distillation in compressing neural networks for hyperspectral image classification, demonstrating that these methods can significantly reduce model size and computational costs while maintaining competitive accuracy for resource-constrained remote sensing applications.

Sai Shi2026-03-06💻 cs

Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

This paper evaluates the viability of zero-shot Multimodal LLMs for real-world video anomaly detection, revealing that while prompt engineering can significantly improve F1-scores, a persistent conservative bias toward the "normal" class severely limits recall, highlighting a critical gap between current MLLM capabilities and the demands of practical surveillance.

Shanle Yao, Armin Danesh Pazho, Narges Rashvand + 1 more2026-03-06💻 cs

FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation

This paper introduces FOZO, a memory-efficient, backpropagation-free test-time adaptation method that utilizes zeroth-order prompt optimization with dynamically decaying perturbations to achieve superior performance on resource-constrained devices and quantized models compared to existing gradient-based and forward-only approaches.

Xingyu Wang, Tao Wang2026-03-06💻 cs

← Previous Next →