cs.CV papers | Gist.Science

NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

NORD is a data-efficient Vision-Language-Action model for autonomous driving that achieves competitive performance on Waymo and NAVSIM benchmarks using less than 60% of the training data and no reasoning annotations by addressing the difficulty bias in standard Group Relative Policy Optimization through the Dr. GRPO algorithm.

Ishaan Rawal, Shubh Gupta, Yihan Hu + 1 more2026-02-27🤖 cs.AI

Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization

The paper proposes Durian, a difficulty-aware group normalization method that re-groups multimodal samples by perceptual complexity and reasoning uncertainty to stabilize reward normalization and enhance reasoning performance in multimodal large language models.

Jinghan Li, Junfeng Fang, Jinda Lu + 5 more2026-02-27💻 cs

EndoDDC: Learning Sparse to Dense Reconstruction for Endoscopic Robotic Navigation via Diffusion Depth Completion

The paper proposes EndoDDC, a novel diffusion-based framework that integrates image features with sparse depth and gradient information to achieve robust and accurate dense depth reconstruction for endoscopic robotic navigation, effectively overcoming challenges like weak textures and light reflections.

Yinheng Lin, Yiming Huang, Beilei Cui + 4 more2026-02-27💻 cs

CoLoGen: Progressive Learning of Concept-Localization Duality for Unified Image Generation

CoLoGen is a unified diffusion framework that resolves the representational conflict between conceptual understanding and spatial localization in conditional image generation by employing a progressive learning curriculum and a novel Progressive Representation Weaving module to dynamically integrate specialized expert features.

YuXin Song, Yu Lu, Haoyuan Sun + 6 more2026-02-27💻 cs

Solaris: Building a Multiplayer Video World Model in Minecraft

The paper introduces Solaris, a multiplayer video world model for Minecraft that leverages a novel automated data collection system and a staged training pipeline to overcome the limitations of single-agent models by simulating consistent multi-view observations and complex multi-agent interactions.

Georgy Savva, Oscar Michel, Daohan Lu + 6 more2026-02-27💻 cs

Adaptive Prefiltering for High-Dimensional Similarity Search: A Frequency-Aware Approach

This paper proposes an adaptive prefiltering framework for high-dimensional similarity search that dynamically allocates computational budgets based on query frequency patterns and cluster coherence, achieving equivalent recall with 20.4% fewer distance computations than static methods while maintaining sub-millisecond latency.

Teodor-Ioan Calin2026-02-27💻 cs

CrossLLM-Mamba: Multimodal State Space Fusion of LLMs for RNA Interaction Prediction

CrossLLM-Mamba is a novel, scalable framework that leverages bidirectional Mamba encoders to model RNA interaction prediction as a dynamic state-space alignment problem, achieving state-of-the-art performance across RNA-protein, RNA-small molecule, and RNA-RNA tasks by capturing context-dependent molecular binding more effectively than static fusion methods.

Rabeya Tus Sadia, Qiang Ye, Qiang Cheng2026-02-27🧬 q-bio

Enabling clinical use of foundation models in histopathology

This paper demonstrates that introducing novel robustness losses during the training of downstream task-specific models effectively mitigates technical biases in histopathology foundation models, thereby enhancing both prediction accuracy and clinical applicability without requiring the retraining of the foundation models themselves.

Audun L. Henriksen, Ole-Johan Skrede, Lisa van der Schee + 31 more2026-02-27🤖 cs.AI

Optimizing Neural Network Architecture for Medical Image Segmentation Using Monte Carlo Tree Search

This paper introduces MNAS-Unet, a novel medical image segmentation framework that leverages Monte Carlo Tree Search to dynamically optimize network architecture, achieving superior accuracy on multiple datasets while significantly reducing search costs and model size compared to state-of-the-art methods.

Liping Meng, Fan Nie, Yunyun Zhang + 1 more2026-02-27💻 cs

AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction

This paper presents AeroDGS, a physics-guided 4D Gaussian splatting framework that leverages a monocular geometry lifting module and physics-based optimization priors to achieve robust, high-fidelity dynamic reconstruction from single-view aerial UAV videos, addressing the inherent depth ambiguity and motion instability of such scenarios.

Hanyang Liu, Rongjun Qin2026-02-27🤖 cs.AI

Enhancing Renal Tumor Malignancy Prediction: Deep Learning with Automatic 3D CT Organ Focused Attention

This study introduces a deep learning framework utilizing an Organ Focused Attention (OFA) loss function to accurately predict renal tumor malignancy from 3D CT images without requiring labor-intensive manual segmentation, achieving performance that surpasses conventional segmentation-based models on both private and public datasets.

Zhengkang Fan, Chengkun Sun, Russell Terry + 2 more2026-02-27🤖 cs.AI

Vision Transformers Need More Than Registers

This paper identifies that artifacts in Vision Transformers stem from a "lazy aggregation" behavior where the model relies on irrelevant background patches as shortcuts for global semantics, and proposes a solution that selectively integrates patch features into the CLS token to mitigate this issue and improve performance across diverse supervision paradigms.

Cheng Shi, Yizhou Yu, Sibei Yang2026-02-27💻 cs

MolFM-Lite: Multi-Modal Molecular Property Prediction with Conformer Ensemble Attention and Cross-Modal Fusion

MolFM-Lite is a multi-modal machine learning model that improves molecular property prediction by jointly encoding 1D sequences, 2D graphs, and 3D conformer ensembles through cross-attention fusion and FiLM conditioning, achieving significant performance gains over single-modality baselines on MoleculeNet benchmarks.

Syed Omer Shah, Mohammed Maqsood Ahmed, Danish Mohiuddin Mohammed + 2 more2026-02-27🤖 cs.LG

SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read

This paper introduces SimpleOCR, a plug-and-play training strategy that renders text queries directly onto images to force Multimodal Large Language Models to overcome "modality laziness" and genuinely read visual text, achieving significant performance gains on out-of-distribution benchmarks with extreme data efficiency.

Yibo Peng, Peng Xia, Ding Zhong + 6 more2026-02-27🤖 cs.LG

Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

This paper demonstrates the feasibility of deploying privacy-preserving, real-time episodic memory question answering on edge devices by utilizing a two-threaded pipeline with Multimodal Large Language Models, achieving competitive accuracy and low latency compared to cloud-based solutions.

Giuseppe Lando, Rosario Forte, Antonino Furnari2026-02-27💻 cs

MammoWise: Multi-Model Local RAG Pipeline for Mammography Report Generation

MammoWise is a practical, local, multi-model pipeline that leverages open-source Vision Language Models enhanced with few-shot prompting and Retrieval Augmented Generation to generate high-quality mammography reports and perform accurate clinical classifications while ensuring privacy and reproducibility.

Raiyan Jahangir, Nafiz Imtiaz Khan, Amritanand Sudheerkumar + 1 more2026-02-27💻 cs

Space Syntax-guided Post-training for Residential Floor Plan Generation

This paper proposes Space Syntax-guided Post-training (SSPT), a framework that integrates architectural theory into residential floor plan generation via a non-differentiable oracle and reinforcement learning to enhance public space dominance and functional hierarchy while outperforming distribution-fitted baselines in efficiency and stability.

Zhuoyang Jiang, Dongqing Zhang2026-02-27🤖 cs.LG

Pix2Key: Controllable Open-Vocabulary Retrieval with Semantic Decomposition and Self-Supervised Visual Dictionary Learning

Pix2Key is a novel composed image retrieval framework that utilizes semantic decomposition and self-supervised visual dictionary learning to represent queries and candidates as open-vocabulary dictionaries, thereby achieving superior intent-aware matching and diversity-aware reranking without relying on supervised triplets.

Guoyizhe Wei, Yang Jiao, Nan Xi + 4 more2026-02-27💻 cs

HARU-Net: Hybrid Attention Residual U-Net for Edge-Preserving Denoising in Cone-Beam Computed Tomography

This paper introduces HARU-Net, a novel Hybrid Attention Residual U-Net architecture that integrates hybrid attention transformers and residual learning to effectively denoise low-dose Cone-Beam Computed Tomography (CBCT) images while preserving critical anatomical edges, outperforming state-of-the-art methods in both image quality metrics and computational efficiency.

Khuram Naveed, Ruben Pauwels2026-02-27⚡ eess

DisQ-HNet: A Disentangled Quantized Half-UNet for Interpretable Multimodal Image Synthesis Applications to Tau-PET Synthesis from T1 and FLAIR MRI

DisQ-HNet is a novel, interpretable framework that synthesizes tau-PET images from T1 and FLAIR MRI by employing a Partial Information Decomposition-guided vector-quantized encoder and a Half-UNet decoder to disentangle modality contributions while preserving anatomical details and disease-relevant signals for Alzheimer's disease analysis.

Agamdeep S. Chopra, Caitlin Neher, Tianyi Ren + 2 more2026-02-27🤖 cs.AI

← Previous Next →