cs.CV papers | Gist.Science

SMR-Net:Robot Snap Detection Based on Multi-Scale Features and Self-Attention Network

To address the limitations of traditional visual methods in robot automated assembly, this paper proposes SMR-Net, a self-attention-based multi-scale detection algorithm paired with a dedicated sensor, which significantly improves snap localization precision and robustness in complex scenarios by integrating attention-enhanced feature extraction, parallel multi-scale processing, and adaptive reweighting.

Kuanxu Hou2026-03-03💻 cs

From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing

The paper proposes TAR-FAS, a tool-augmented reasoning framework that enhances generalizable Face Anti-Spoofing by enabling MLLMs to combine intuitive observations with adaptive, fine-grained visual tool investigations through a specialized dataset and training pipeline.

Haoyuan Zhang, Keyao Wang, Guosheng Zhang + 11 more2026-03-03🤖 cs.AI

MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline

The paper introduces MM-DeepResearch, a multimodal deep research agent that overcomes data scarcity, trajectory generation, and training cost challenges through Hyper-Search for data synthesis, DR-TTS for specialized tool optimization and trajectory planning, and an offline search engine for cost-effective reinforcement learning.

Huanjin Yao, Qixiang Yin, Min Yang + 5 more2026-03-03🤖 cs.AI

Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures

This paper proposes ELF-VLA, a framework that enhances Vision-Language-Action models for autonomous driving by replacing vague scalar rewards with explicit, diagnostic failure feedback to guide targeted policy refinement, thereby overcoming exploration limitations and achieving state-of-the-art performance on the NAVSIM benchmark.

Yuechen Luo, Qimao Chen, Fang Li + 5 more2026-03-03💻 cs

LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model

LLaDA-o is a state-of-the-art, length-adaptive omni diffusion model that leverages a Mixture of Diffusion framework with a shared attention backbone to effectively unify discrete text understanding and continuous visual generation, achieving top-tier performance on multimodal benchmarks.

Zebin You, Xiaolu Zhang, Jun Zhou + 2 more2026-03-03🤖 cs.LG

SHIELD8-UAV: Sequential 8-bit Hardware Implementation of a Precision-Aware 1D-F-CNN for Low-Energy UAV Acoustic Detection and Temporal Tracking

This paper presents SHIELD8-UAV, a low-energy, sequential 8-bit hardware accelerator for UAV acoustic detection that achieves real-time, precision-aware inference on resource-constrained edge devices through a shared multi-precision datapath, layer-sensitivity quantization, and structured channel pruning.

Susmita Ghanta, Karan Nathwani, Rohit Chaurasiya2026-03-03⚡ eess

Adaptive Augmentation-Aware Latent Learning for Robust LiDAR Semantic Segmentation

The paper proposes A3Point, an adaptive framework that enhances LiDAR semantic segmentation robustness under adverse weather by utilizing a semantic confusion prior and shift region localization to effectively leverage diverse augmentations while mitigating semantic shifts.

Wangkai Li, Zhaoyang Li, Yuwen Pan + 3 more2026-03-03💻 cs

Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval

This paper introduces MCMR, a large-scale benchmark designed to evaluate fine-grained, multi-condition multimodal retrieval across five product domains, revealing that while visual cues drive early precision, MLLM-based rerankers significantly enhance compositional matching by verifying complex query-candidate consistency.

Xuan Lu, Kangle Li, Haohang Huang + 3 more2026-03-03💻 cs

Can Vision Language Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective

This paper introduces AesEval-Bench, a comprehensive benchmark and training dataset designed to systematically evaluate and enhance Vision Language Models' ability to assess graphic design aesthetics across multiple dimensions, tasks, and indicators.

Arctanx An, Shizhao Sun, Danqing Huang + 5 more2026-03-03💻 cs

Unified Vision-Language Modeling via Concept Space Alignment

This paper introduces V-SONAR, a unified vision-language embedding space aligned with the multilingual SONAR text space, and leverages it to develop V-LCM, a model that achieves state-of-the-art performance in video captioning and significantly outperforms existing vision-language models across 61 diverse languages through concept space alignment and latent diffusion training.

Yifu Qiu, Paul-Ambroise Duquenne, Holger Schwenk2026-03-03💬 cs.CL

Differential privacy representation geometry for medical image analysis

This paper introduces DP-RGMI, a framework that analyzes differential privacy in medical imaging by decomposing utility loss into representation geometry and task-head utilization, revealing that privacy mechanisms induce non-uniform anisotropic reshaping of features and create a utilization gap even when linear separability is preserved.

Soroosh Tayebi Arasteh, Marziyeh Mohammadi, Sven Nebelung + 1 more2026-03-03🤖 cs.LG

Data-Efficient Brushstroke Generation with Diffusion Models for Oil Painting

This paper proposes StrokeDiff, a data-efficient diffusion-based framework with Smooth Regularization that generates diverse, controllable, and human-like oil painting brushstrokes from a small dataset, enabling structured and expressive multimedia content creation.

Dantong Qin, Alessandro Bozzon, Xian Yang + 3 more2026-03-03💻 cs

Egocentric Co-Pilot: Web-Native Smart-Glasses Agents for Assistive Egocentric AI

This paper introduces Egocentric Co-Pilot, a web-native neuro-symbolic framework for smart glasses that combines an LLM-orchestrated toolset with advanced temporal reasoning and multimodal intent mapping to deliver state-of-the-art, always-on assistive AI for navigation and daily tasks, demonstrating superior performance and user satisfaction over commercial baselines through both cloud and local deployment evaluations.

Sicheng Yang, Yukai Huang, Weitong Cai + 8 more2026-03-03🤖 cs.AI

GroundedSurg: A Multi-Procedure Benchmark for Language-Conditioned Surgical Tool Segmentation

This paper introduces GroundedSurg, the first multi-procedure benchmark designed to evaluate language-conditioned, instance-level surgical tool segmentation by pairing surgical images with natural language descriptions and precise spatial annotations to address the limitations of existing category-level evaluation paradigms in clinical AI.

Tajamul Ashraf, Abrar Ul Riyaz, Wasif Tak + 4 more2026-03-03💻 cs

GuiDINO: Rethinking Vision Foundation Model in Medical Image Segmentation

GuiDINO introduces a framework that leverages DINOv3 as a visual guidance generator to produce spatial guide masks via a lightweight TokenBook mechanism, effectively enhancing medical image segmentation across diverse datasets and backbones without requiring full fine-tuning of the foundation model.

Zhuonan Liang, Wei Guo, Jie Gan + 4 more2026-03-03💻 cs

ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models

ClinCoT is a novel framework that enhances medical vision-language models by introducing a clinical-aware visual chain-of-thought mechanism and an iterative, scoring-based preference optimization strategy to improve factual grounding and reduce hallucinations through region-level reasoning.

Xiwei Liu, Yulong Li, Xinlin Zhuang + 5 more2026-03-03🤖 cs.AI

Predictive Reasoning with Augmented Anomaly Contrastive Learning for Compositional Visual Relations

This paper proposes PR-A $^2$ CL, a novel framework for compositional visual relations that combines augmented anomaly contrastive learning with a predictive-and-verify paradigm to effectively identify outlier images by iteratively predicting and verifying compositional rules.

Chengtai Li, Yuting He, Jianfeng Ren + 4 more2026-03-03🤖 cs.AI

Teacher-Guided Causal Interventions for Image Denoising: Orthogonal Content-Noise Disentanglement in Vision Transformers

The paper proposes TCD-Net, a Vision Transformer-based image denoising framework that utilizes teacher-guided causal interventions, including environmental bias adjustment and orthogonal content-noise disentanglement, to eliminate spurious correlations and achieve state-of-the-art fidelity and real-time performance.

Kuai Jiang, Zhaoyan Ding, Guijuan Zhang + 2 more2026-03-03💻 cs

ArtLLM: Generating Articulated Assets via 3D LLM

ArtLLM is a novel framework that leverages a 3D multimodal large language model to autoregressively predict kinematic structures and generate high-fidelity articulated 3D assets directly from complete meshes, significantly outperforming existing methods in accuracy and generalization for applications like robotics and simulation.

Penghao Wang, Siyuan Xie, Hongyu Yan + 4 more2026-03-03💻 cs

TC-SSA: Token Compression via Semantic Slot Aggregation for Gigapixel Pathology Reasoning

This paper proposes TC-SSA, a learnable token compression framework that utilizes gated semantic slot aggregation to efficiently process gigapixel whole slide images by reducing visual tokens to 1.7% of the original sequence while preserving diagnostically critical information and outperforming existing sampling-based methods in both reasoning and classification tasks.

Zhuo Chen, Shawn Young, Lijian Xu2026-03-03🤖 cs.AI

← Previous Next →