cs.CV papers | Gist.Science

Pareto-Guided Optimization for Uncertainty-Aware Medical Image Segmentation

This paper proposes a Pareto-guided optimization framework for medical image segmentation that employs a region-wise curriculum strategy and a fuzzy labeling mechanism to prioritize learning from certain regions, thereby stabilizing gradients and guiding the model toward Pareto-optimal solutions that outperform traditional methods in handling boundary ambiguity.

Jinming Zhang, Youpeng Yang, Xi Yang + 5 more2026-02-25💻 cs

DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning

The paper proposes DVLA-RL, a novel few-shot learning framework that leverages reinforcement learning gating to dynamically integrate progressive dual-level vision-language alignments—ranging from fine-grained attributes to holistic descriptions generated by large language models—thereby achieving state-of-the-art performance across diverse benchmarks.

Wenhao Li, Xianjing Meng, Qiangchang Wang + 3 more2026-02-25💻 cs

All-Optical Segmentation via Diffractive Neural Networks for Autonomous Driving

This paper proposes a novel all-optical computing framework using diffractive neural networks to perform energy-efficient, real-time semantic segmentation and lane detection for autonomous driving, demonstrating its effectiveness on the CityScapes dataset and in diverse simulated driving scenarios.

Yingjie Li, Daniel Robinson, Weilu Gao + 1 more2026-02-25💻 cs

GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing

GOT-Edit introduces an online cross-modality model editing framework that integrates 3D geometric cues into generic object tracking by leveraging a pre-trained Visual Geometry Grounded Transformer and null-space constraints, thereby significantly enhancing robustness against occlusion and distractors while preserving semantic discrimination.

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo + 1 more2026-02-25⚡ eess

UI-Venus-1.5 Technical Report

The paper introduces UI-Venus-1.5, a unified end-to-end GUI agent family featuring 2B, 8B, and 30B-A3B variants that leverage mid-training, online reinforcement learning, and model merging to achieve state-of-the-art performance on diverse benchmarks and robust real-world navigation across Chinese mobile apps.

Venus Team, Changlong Gao, Zhangxuan Gu + 24 more2026-02-25💬 cs.CL

Ecological mapping with geospatial foundation models

This study systematically evaluates geospatial foundation models (Prithvi-EO-2.0 and TerraMind) for ecological mapping, demonstrating their consistent superiority over traditional baselines across forest trait estimation, land cover mapping, and peatland detection while highlighting the critical importance of dataset alignment and high-resolution inputs for optimal performance.

Craig Mahlasi, Gciniwe S. Baloyi, Zaheed Gaffoor + 6 more2026-02-25💻 cs

DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving

DriveMamba proposes a task-centric, scalable state space model for efficient end-to-end autonomous driving that replaces the sequential Transformer-based paradigm with a unified Mamba decoder featuring linear-complexity operators and bidirectional trajectory-guided scanning to overcome information loss, cumulative errors, and computational inefficiencies in handling spatiotemporal inputs.

Haisheng Su, Wei Wu, Feixiang Song + 3 more2026-02-25💻 cs

Sim2Radar: Toward Bridging the Radar Sim-to-Real Gap with VLM-Guided Scene Reconstruction

Sim2Radar is an end-to-end framework that bridges the radar sim-to-real gap by synthesizing physics-based mmWave data from single-view RGB images using VLM-guided material inference, thereby significantly improving downstream 3D radar perception through transfer learning.

Emily Bejerano, Federico Tondolo, Ayaan Qayyum + 2 more2026-02-25🤖 cs.AI

Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

This paper introduces HERO, a novel paradigm for humanoid robots that combines large vision models for open-vocabulary scene understanding with a residual-aware end-effector tracking policy trained in simulation, enabling robust and generalizable visual loco-manipulation of diverse objects in real-world environments.

Runpei Dong, Ziyan Li, Xialin He + 1 more2026-02-25💻 cs

Tree crop mapping of South America reveals links to deforestation and conservation

This study presents the first 10m-resolution tree crop map for South America, revealing that existing regulatory definitions often misclassify established smallholder agroforestry as forest, thereby highlighting the need for high-resolution data to ensure equitable and effective zero-deforestation policies.

Yuchang Jiang, Anton Raichuk, Xiaoye Tong + 6 more2026-02-25💻 cs

EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models

The paper proposes EAGLE, a tuning-free framework that leverages expert model outputs to guide Multimodal Large Language Models toward accurate and interpretable industrial anomaly detection without requiring parameter updates, achieving performance comparable to fine-tuned methods.

Xiaomeng Peng, Xilang Huang, Seon Han Choi2026-02-25💻 cs

Probability-Invariant Random Walk Learning on Gyral Folding-Based Cortical Similarity Networks for Alzheimer's and Lewy Body Dementia Diagnosis

This paper proposes a probability-invariant random walk framework that classifies individualized gyral folding-based cortical similarity networks without requiring explicit node alignment, thereby overcoming anatomical heterogeneity to achieve robust diagnosis of Alzheimer's disease and Lewy body dementia.

Minheng Chen, Tong Chen, Chao Cao + 4 more2026-02-25🧬 q-bio

MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions

The paper introduces MIRROR, a framework that enhances Vision-Language Models' reasoning and reduces hallucinations by implementing a closed-loop iterative process of drafting, critiquing, and revising answers through explicit region-based visual verification, supported by the newly constructed ReflectV dataset.

Haoyu Zhang, Yuwei Wu, Pengxiang Li + 6 more2026-02-25💻 cs

Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models

This paper introduces SymPL, a framework that reformulates challenging allocentric spatial reasoning tasks into structured symbolic-layout representations, thereby significantly enhancing the performance and robustness of vision-language models in both allocentric and egocentric settings.

Jaeyun Jang, Seunghui Shin, Taeho Park + 1 more2026-02-25💻 cs

TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding

The paper introduces TraceVision, a novel end-to-end vision-language model that integrates trajectory-aware spatial understanding through a Trajectory-aware Visual Perception module and a specialized training pipeline, achieving state-of-the-art performance in tasks like captioning, localization, and segmentation by simulating human visual attention trajectories.

Fan Yang, Shurong Zheng, Hongyin Zhao + 5 more2026-02-25💻 cs

Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation

This paper proposes a dual-teacher contrastive distillation framework that leverages both multispectral and optical vision foundation model teachers to enable efficient, state-of-the-art cross-modal representation learning for multispectral Earth observation data without compromising performance on optical inputs.

Filip Wolf, Blaž Rolih, Luka Čehovin Zajc2026-02-25💻 cs

A Very Big Video Reasoning Suite

To address the lack of large-scale data for studying video reasoning, this paper introduces the Very Big Video Reasoning (VBVR) suite, comprising a massive dataset of over one million video clips across 200 tasks and a verifiable benchmark, which together enable the first large-scale scaling study revealing early signs of emergent generalization in video reasoning models.

Maijunxian Wang, Ruisi Wang, Juyi Lin + 53 more2026-02-25🤖 cs.AI

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Mobile-O is a compact, unified vision-language-diffusion model featuring a novel Mobile Conditioning Projector that enables efficient, real-time multimodal understanding and generation directly on mobile devices, achieving competitive performance with significantly faster inference speeds compared to existing models.

Abdelrahman Shaker, Ahmed Heakl, Jaseel Muhammad + 8 more2026-02-25💻 cs

VISION-ICE: Video-based Interpretation and Spatial Identification of Arrhythmia Origins via Neural Networks in Intracardiac Echocardiography

This paper proposes VISION-ICE, an AI framework utilizing 3D Convolutional Neural Networks to analyze intracardiac echocardiography videos for automated, real-time localization of arrhythmia origins, achieving 66.2% accuracy and demonstrating potential to streamline electrophysiological procedures.

Dorsa EPMoghaddam, Feng Gao, Drew Bernard + 3 more2026-02-25🤖 cs.LG

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

OptimusVLA is a dual-memory augmented Vision-Language-Action model that enhances robotic manipulation efficiency and robustness by replacing isotropic noise with a global prior memory for faster inference and incorporating a local consistency memory to ensure temporal coherence, achieving superior performance across simulation and real-world benchmarks compared to state-of-the-art baselines.

Zaijing Li, Bing Hu, Rui Shao + 5 more2026-02-25🤖 cs.AI

← Previous Next →