cs.CV papers | Gist.Science

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Fast-ThinkAct is an efficient Vision-Language-Action framework that utilizes preference-guided distillation of verbalizable latent reasoning to significantly reduce inference latency while maintaining strong performance in long-horizon planning, few-shot adaptation, and failure recovery.

Chi-Pin Huang, Yunze Man, Zhiding Yu + 4 more2026-02-25🤖 cs.AI

Generating metamers of human scene understanding

This paper introduces MetamerGen, a dual-stream latent diffusion model that generates perceptually aligned image metamers by fusing peripheral scene gist with high-resolution fixation details, thereby validating its ability to reconstruct human scene understanding through behavioral experiments.

Ritik Raina, Abe Leite, Alexandros Graikos + 3 more2026-02-25🤖 cs.AI

Principal Component Analysis-Based Terahertz Self-Supervised Denoising and Deblurring Deep Neural Networks

This paper proposes a principal component analysis-based self-supervised deep neural network (THz-SSDD) that effectively addresses the simultaneous challenges of low-frequency blurring and high-frequency noise in terahertz amplitude images by leveraging a Recorrupted-to-Recorrupted learning strategy and PCA reconstruction without requiring labeled data.

Pengfei Zhu, Stefano Sfarra, Hai Zhang + 4 more2026-02-25💻 cs

Earth Embeddings as Products: Taxonomy, Ecosystem, and Standardized Access

This paper addresses the fragmentation of pre-computed Geospatial Foundation Model embeddings by proposing a three-layer taxonomy and extending TorchGeo with a unified API to standardize access, thereby enabling interoperable, reproducible, and accessible Earth observation workflows.

Heng Fang, Adam J. Stewart, Isaac Corley + 2 more2026-02-25💻 cs

Affinity Contrastive Learning for Skeleton-based Human Activity Understanding

This paper proposes ACLNet, an Affinity Contrastive Learning Network that enhances skeleton-based human activity understanding by leveraging structural inter-class similarities to form activity superclasses and employing a dynamic temperature schedule with margin-based contrastive strategies to improve feature discrimination across multiple benchmarks.

Hongda Liu, Yunfan Liu, Min Ren + 3 more2026-02-25💻 cs

CER-HV: A Human-in-the-Loop Framework for Cleaning Datasets Applied to Arabic-Script HTR

This paper introduces CER-HV, a human-in-the-loop framework that effectively identifies and cleans label errors in Arabic-script handwritten text recognition datasets, thereby revealing significant data quality issues and improving recognition performance across multiple languages.

Sana Al-azzawi, Elisa Barney, Marcus Liwicki2026-02-25💻 cs

Pareto-Guided Optimization for Uncertainty-Aware Medical Image Segmentation

This paper proposes a Pareto-guided optimization framework for medical image segmentation that employs a region-wise curriculum strategy and a fuzzy labeling mechanism to prioritize learning from certain regions, thereby stabilizing gradients and guiding the model toward Pareto-optimal solutions that outperform traditional methods in handling boundary ambiguity.

Jinming Zhang, Youpeng Yang, Xi Yang + 5 more2026-02-25💻 cs

DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning

The paper proposes DVLA-RL, a novel few-shot learning framework that leverages reinforcement learning gating to dynamically integrate progressive dual-level vision-language alignments—ranging from fine-grained attributes to holistic descriptions generated by large language models—thereby achieving state-of-the-art performance across diverse benchmarks.

Wenhao Li, Xianjing Meng, Qiangchang Wang + 3 more2026-02-25💻 cs

All-Optical Segmentation via Diffractive Neural Networks for Autonomous Driving

This paper proposes a novel all-optical computing framework using diffractive neural networks to perform energy-efficient, real-time semantic segmentation and lane detection for autonomous driving, demonstrating its effectiveness on the CityScapes dataset and in diverse simulated driving scenarios.

Yingjie Li, Daniel Robinson, Weilu Gao + 1 more2026-02-25💻 cs

GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing

GOT-Edit introduces an online cross-modality model editing framework that integrates 3D geometric cues into generic object tracking by leveraging a pre-trained Visual Geometry Grounded Transformer and null-space constraints, thereby significantly enhancing robustness against occlusion and distractors while preserving semantic discrimination.

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo + 1 more2026-02-25⚡ eess

UI-Venus-1.5 Technical Report

The paper introduces UI-Venus-1.5, a unified end-to-end GUI agent family featuring 2B, 8B, and 30B-A3B variants that leverage mid-training, online reinforcement learning, and model merging to achieve state-of-the-art performance on diverse benchmarks and robust real-world navigation across Chinese mobile apps.

Venus Team, Changlong Gao, Zhangxuan Gu + 24 more2026-02-25💬 cs.CL

Ecological mapping with geospatial foundation models

This study systematically evaluates geospatial foundation models (Prithvi-EO-2.0 and TerraMind) for ecological mapping, demonstrating their consistent superiority over traditional baselines across forest trait estimation, land cover mapping, and peatland detection while highlighting the critical importance of dataset alignment and high-resolution inputs for optimal performance.

Craig Mahlasi, Gciniwe S. Baloyi, Zaheed Gaffoor + 6 more2026-02-25💻 cs

DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving

DriveMamba proposes a task-centric, scalable state space model for efficient end-to-end autonomous driving that replaces the sequential Transformer-based paradigm with a unified Mamba decoder featuring linear-complexity operators and bidirectional trajectory-guided scanning to overcome information loss, cumulative errors, and computational inefficiencies in handling spatiotemporal inputs.

Haisheng Su, Wei Wu, Feixiang Song + 3 more2026-02-25💻 cs

Sim2Radar: Toward Bridging the Radar Sim-to-Real Gap with VLM-Guided Scene Reconstruction

Sim2Radar is an end-to-end framework that bridges the radar sim-to-real gap by synthesizing physics-based mmWave data from single-view RGB images using VLM-guided material inference, thereby significantly improving downstream 3D radar perception through transfer learning.

Emily Bejerano, Federico Tondolo, Ayaan Qayyum + 2 more2026-02-25🤖 cs.AI

Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

This paper introduces HERO, a novel paradigm for humanoid robots that combines large vision models for open-vocabulary scene understanding with a residual-aware end-effector tracking policy trained in simulation, enabling robust and generalizable visual loco-manipulation of diverse objects in real-world environments.

Runpei Dong, Ziyan Li, Xialin He + 1 more2026-02-25💻 cs

Tree crop mapping of South America reveals links to deforestation and conservation

This study presents the first 10m-resolution tree crop map for South America, revealing that existing regulatory definitions often misclassify established smallholder agroforestry as forest, thereby highlighting the need for high-resolution data to ensure equitable and effective zero-deforestation policies.

Yuchang Jiang, Anton Raichuk, Xiaoye Tong + 6 more2026-02-25💻 cs

EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models

The paper proposes EAGLE, a tuning-free framework that leverages expert model outputs to guide Multimodal Large Language Models toward accurate and interpretable industrial anomaly detection without requiring parameter updates, achieving performance comparable to fine-tuned methods.

Xiaomeng Peng, Xilang Huang, Seon Han Choi2026-02-25💻 cs

Probability-Invariant Random Walk Learning on Gyral Folding-Based Cortical Similarity Networks for Alzheimer's and Lewy Body Dementia Diagnosis

This paper proposes a probability-invariant random walk framework that classifies individualized gyral folding-based cortical similarity networks without requiring explicit node alignment, thereby overcoming anatomical heterogeneity to achieve robust diagnosis of Alzheimer's disease and Lewy body dementia.

Minheng Chen, Tong Chen, Chao Cao + 4 more2026-02-25🧬 q-bio

MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions

The paper introduces MIRROR, a framework that enhances Vision-Language Models' reasoning and reduces hallucinations by implementing a closed-loop iterative process of drafting, critiquing, and revising answers through explicit region-based visual verification, supported by the newly constructed ReflectV dataset.

Haoyu Zhang, Yuwei Wu, Pengxiang Li + 6 more2026-02-25💻 cs

Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models

This paper introduces SymPL, a framework that reformulates challenging allocentric spatial reasoning tasks into structured symbolic-layout representations, thereby significantly enhancing the performance and robustness of vision-language models in both allocentric and egocentric settings.

Jaeyun Jang, Seunghui Shin, Taeho Park + 1 more2026-02-25💻 cs

← Previous Next →