cs.CV papers | Gist.Science

The Texture-Shape Dilemma: Boundary-Safe Synthetic Generation for 3D Medical Transformers

This paper addresses the limitations of existing formula-driven synthetic data by proposing a Physics-inspired Spatially-Decoupled Synthesis framework that resolves the texture-shape conflict through a gradient-shielded buffer zone and spectral texture injection, thereby significantly enhancing the performance of 3D medical Vision Transformers on BTCV and MSD datasets without relying on real patient data.

Jiaqi Tang, Weixuan Xu, Shu Zhang + 2 more2026-03-03💻 cs

Foundation Models in Remote Sensing: Evolving from Unimodality to Multimodality

This paper presents a comprehensive technical survey on foundation models in remote sensing, exploring their evolution from unimodal to multimodal approaches while providing a tutorial-like guide to help researchers understand, train, and apply these models to real-world tasks.

Danfeng Hong, Chenyu Li, Xuyang Li + 2 more2026-03-03💻 cs

MLRecon: Robust Markerless Freehand 3D Ultrasound Reconstruction via Coarse-to-Fine Pose Estimation

MLRecon is a robust, low-cost framework for markerless freehand 3D ultrasound reconstruction that utilizes a commodity RGB-D camera and a vision foundation model-based pipeline with a dual-stage refinement network to achieve drift-resilient, sub-millimeter accurate probe pose tracking and high-quality volumetric imaging.

Yi Zhang, Puxun Tu, Kun Wang + 3 more2026-03-03💻 cs

GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis

The paper proposes GeodesicNVS, a novel view synthesis framework that combines deterministic Data-to-Data Flow Matching with Probability Density Geodesic constraints derived from pretrained diffusion models to achieve superior view consistency and geometric coherence compared to traditional diffusion-based approaches.

Xuqin Wang, Tao Wu, Yanfeng Zhang + 5 more2026-03-03💻 cs

Implementation of Licensed Plate Detection and Noise Removal in Image Processing

This paper discusses the implementation of car license plate recognition systems, highlighting their growing necessity in Malaysia due to increasing vehicle numbers and their diverse applications in traffic management, law enforcement, and other specialized fields.

Yiquan Gao2026-03-03⚡ eess

RaUF: Learning the Spatial Uncertainty Field of Radar

This paper proposes RaUF, a spatial uncertainty field learning framework that addresses the low fidelity and ambiguity of millimeter-wave radar by modeling anisotropic probabilistic uncertainty and employing a bidirectional domain attention mechanism to suppress spurious returns, thereby delivering highly reliable spatial detections with well-calibrated uncertainty for downstream perception tasks.

Shengpeng Wang, Kuangyu Wang, Wei Wang2026-03-03💻 cs

Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features

This paper proposes Content-Aware Frequency Encoding (CAFE) and its enhanced variant CAFE+, which utilize learnable parallel linear layers and Chebyshev features to overcome the spectral bias of Implicit Neural Representations by explicitly synthesizing and selecting task-relevant frequency bases for superior signal reconstruction.

Junbo Ke, Yangyang Xu, You-Wei Wen + 1 more2026-03-03🤖 cs.AI

Vision-Language Feature Alignment for Road Anomaly Segmentation

The paper proposes VL-Anomaly, a novel framework that leverages vision-language model priors and a multi-source inference strategy to significantly improve road anomaly segmentation by reducing false positives on normal backgrounds and enhancing the detection of out-of-distribution obstacles.

Zhuolin He, Jiacheng Tang, Jian Pu + 1 more2026-03-03💻 cs

SMR-Net:Robot Snap Detection Based on Multi-Scale Features and Self-Attention Network

To address the limitations of traditional visual methods in robot automated assembly, this paper proposes SMR-Net, a self-attention-based multi-scale detection algorithm paired with a dedicated sensor, which significantly improves snap localization precision and robustness in complex scenarios by integrating attention-enhanced feature extraction, parallel multi-scale processing, and adaptive reweighting.

Kuanxu Hou2026-03-03💻 cs

From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing

The paper proposes TAR-FAS, a tool-augmented reasoning framework that enhances generalizable Face Anti-Spoofing by enabling MLLMs to combine intuitive observations with adaptive, fine-grained visual tool investigations through a specialized dataset and training pipeline.

Haoyuan Zhang, Keyao Wang, Guosheng Zhang + 11 more2026-03-03🤖 cs.AI

MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline

The paper introduces MM-DeepResearch, a multimodal deep research agent that overcomes data scarcity, trajectory generation, and training cost challenges through Hyper-Search for data synthesis, DR-TTS for specialized tool optimization and trajectory planning, and an offline search engine for cost-effective reinforcement learning.

Huanjin Yao, Qixiang Yin, Min Yang + 5 more2026-03-03🤖 cs.AI

Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures

This paper proposes ELF-VLA, a framework that enhances Vision-Language-Action models for autonomous driving by replacing vague scalar rewards with explicit, diagnostic failure feedback to guide targeted policy refinement, thereby overcoming exploration limitations and achieving state-of-the-art performance on the NAVSIM benchmark.

Yuechen Luo, Qimao Chen, Fang Li + 5 more2026-03-03💻 cs

LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model

LLaDA-o is a state-of-the-art, length-adaptive omni diffusion model that leverages a Mixture of Diffusion framework with a shared attention backbone to effectively unify discrete text understanding and continuous visual generation, achieving top-tier performance on multimodal benchmarks.

Zebin You, Xiaolu Zhang, Jun Zhou + 2 more2026-03-03🤖 cs.LG

SHIELD8-UAV: Sequential 8-bit Hardware Implementation of a Precision-Aware 1D-F-CNN for Low-Energy UAV Acoustic Detection and Temporal Tracking

This paper presents SHIELD8-UAV, a low-energy, sequential 8-bit hardware accelerator for UAV acoustic detection that achieves real-time, precision-aware inference on resource-constrained edge devices through a shared multi-precision datapath, layer-sensitivity quantization, and structured channel pruning.

Susmita Ghanta, Karan Nathwani, Rohit Chaurasiya2026-03-03⚡ eess

Adaptive Augmentation-Aware Latent Learning for Robust LiDAR Semantic Segmentation

The paper proposes A3Point, an adaptive framework that enhances LiDAR semantic segmentation robustness under adverse weather by utilizing a semantic confusion prior and shift region localization to effectively leverage diverse augmentations while mitigating semantic shifts.

Wangkai Li, Zhaoyang Li, Yuwen Pan + 3 more2026-03-03💻 cs

Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval

This paper introduces MCMR, a large-scale benchmark designed to evaluate fine-grained, multi-condition multimodal retrieval across five product domains, revealing that while visual cues drive early precision, MLLM-based rerankers significantly enhance compositional matching by verifying complex query-candidate consistency.

Xuan Lu, Kangle Li, Haohang Huang + 3 more2026-03-03💻 cs

Can Vision Language Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective

This paper introduces AesEval-Bench, a comprehensive benchmark and training dataset designed to systematically evaluate and enhance Vision Language Models' ability to assess graphic design aesthetics across multiple dimensions, tasks, and indicators.

Arctanx An, Shizhao Sun, Danqing Huang + 5 more2026-03-03💻 cs

Unified Vision-Language Modeling via Concept Space Alignment

This paper introduces V-SONAR, a unified vision-language embedding space aligned with the multilingual SONAR text space, and leverages it to develop V-LCM, a model that achieves state-of-the-art performance in video captioning and significantly outperforms existing vision-language models across 61 diverse languages through concept space alignment and latent diffusion training.

Yifu Qiu, Paul-Ambroise Duquenne, Holger Schwenk2026-03-03💬 cs.CL

Differential privacy representation geometry for medical image analysis

This paper introduces DP-RGMI, a framework that analyzes differential privacy in medical imaging by decomposing utility loss into representation geometry and task-head utilization, revealing that privacy mechanisms induce non-uniform anisotropic reshaping of features and create a utilization gap even when linear separability is preserved.

Soroosh Tayebi Arasteh, Marziyeh Mohammadi, Sven Nebelung + 1 more2026-03-03🤖 cs.LG

Data-Efficient Brushstroke Generation with Diffusion Models for Oil Painting

This paper proposes StrokeDiff, a data-efficient diffusion-based framework with Smooth Regularization that generates diverse, controllable, and human-like oil painting brushstrokes from a small dataset, enabling structured and expressive multimedia content creation.

Dantong Qin, Alessandro Bozzon, Xian Yang + 3 more2026-03-03💻 cs

← Previous Next →