cs.CV papers | Gist.Science

Benchmarking Vision-Based Object Tracking for USVs in Complex Maritime Environments

This study proposes and validates a vision-guided object-tracking framework for unmanned surface vehicles (USVs) in complex maritime environments by benchmarking seven deep learning-based trackers and control algorithms, ultimately identifying the Transformer-based SeqTrack and Linear Quadratic Regulator (LQR) controller as the most robust solution for stable tracking under adverse conditions.

Muhayy Ud Din, Ahsan B. Bakht, Waseem Akram + 3 more2026-02-26💻 cs

Object-Centric World Models from Few-Shot Annotations for Sample-Efficient Reinforcement Learning

This paper introduces OC-STORM, an object-centric model-based reinforcement learning framework that leverages few-shot annotated frames and pretrained segmentation to improve sample efficiency and dynamics prediction in complex visual environments, outperforming existing baselines on Atari 100k and Hollow Knight benchmarks.

Weipu Zhang, Adam Jelley, Trevor McInroe + 2 more2026-02-26🤖 cs.LG

VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning

This paper introduces VOILA, a dynamic benchmark that evaluates multimodal large language models' ability to perform abstract relational reasoning through visual analogies, revealing that current models significantly struggle with inter-image relationships compared to human performance despite improvements from multi-step prompting strategies.

Nilay Yilmaz, Maitreya Patel, Yiran Lawrence Luo + 4 more2026-02-26💬 cs.CL

PD-VLA: Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding

This paper introduces PD-VLA, a training-free parallel decoding framework that accelerates Vision-Language-Action models integrated with action chunking by reformulating autoregressive decoding as a parallel fixed-point iteration system, thereby significantly improving inference speed while maintaining competitive performance in both simulation and real-world robotic tasks.

Wenxuan Song, Jiayi Chen, Pengxiang Ding + 9 more2026-02-26💻 cs

Unified Reward Model for Multimodal Understanding and Generation

This paper introduces UnifiedReward, the first unified reward model that jointly assesses diverse multimodal understanding and generation tasks to foster synergistic improvements, leveraging a large-scale human preference dataset and a two-stage data filtering strategy to align vision models via Direct Preference Optimization.

Yibin Wang, Yuhang Zang, Hao Li + 2 more2026-02-26💻 cs

TRACE: Your Diffusion Model is Secretly an Instance Edge Detector

TRACE demonstrates that text-to-image diffusion models inherently encode instance boundary priors, enabling a fast, unsupervised segmentation method that extracts edges from self-attention maps to achieve state-of-the-art performance without requiring costly instance-level annotations.

Sanghyun Jo, Ziseok Lee, Wooyeol Lee + 3 more2026-02-26💻 cs

Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation

AnyIR is a lightweight, unified image restoration framework that achieves state-of-the-art performance across multiple degradations by leveraging a joint embedding mechanism with gated sublatent space reweighting and spatial-frequency parallel fusion, eliminating the need for large language models or excessive model scaling.

Bin Ren, Eduard Zamfir, Zongwei Wu + 7 more2026-02-26💻 cs

Twin Co-Adaptive Dialogue for Progressive Image Generation

This paper introduces Twin-Co, a framework that utilizes synchronized, co-adaptive dialogue to iteratively refine text-to-image generation by dynamically incorporating user feedback to resolve prompt ambiguities and align the output with user intent.

Jianhui Wang, Yangfan He, Yan Zhong + 12 more2026-02-26💻 cs

Identifying Memorization of Diffusion Models through $p$ -Laplace Analysis: Estimators, Bounds and Applications

This paper proposes a novel method for identifying memorized training data in diffusion models by leveraging $p$ -Laplace operators derived from estimated score functions, providing both theoretical error bounds and empirical validation on text-to-image models where the approach successfully detects memorization even without access to the conditioning text.

Jonathan Brokman, Itay Gershon, Amit Giloni + 4 more2026-02-26🔢 math

Transformer-based cardiac substructure segmentation from contrast and non-contrast computed tomography for radiotherapy planning

This study demonstrates that a hybrid pretrained transformer-convolutional network (SMIT) utilizing balanced curriculum learning achieves data-efficient, robust cardiac substructure segmentation across diverse CT imaging protocols and patient populations, outperforming standard nnU-Net and TotalSegmentator while requiring significantly fewer annotated training scans.

Aneesh Rangnekar, Nikhil Mankuzhy, Jonas Willmann + 5 more2026-02-26⚡ eess

JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models

The paper introduces JailBound, a novel two-stage jailbreak framework that exploits the implicit safety decision boundaries within Vision-Language Models' latent fusion layers to jointly optimize cross-modal perturbations, achieving significantly higher attack success rates than state-of-the-art methods while exposing critical safety vulnerabilities in these models.

Jiaxin Song, Yixu Wang, Jie Li + 4 more2026-02-26💻 cs

Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection

The paper introduces PROGRESS, a data- and compute-efficient framework for vision-language models that dynamically prioritizes instruction-tuning samples based on relative error-driven learning progress, enabling superior performance with significantly less data and supervision compared to state-of-the-art baselines.

Shivam Chandhok, Qian Yang, Oscar Manas + 3 more2026-02-26🤖 cs.AI

LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning

This paper proposes LoRA-Edit, a controllable video editing method that utilizes spatiotemporal masks to fine-tune pretrained Image-to-Video models via Low-Rank Adaptation, enabling precise user guidance over both content preservation and the temporal evolution of generated regions.

Chenjian Gao, Lihe Ding, Xin Cai + 3 more2026-02-26💻 cs

Capturing Stable HDR Videos Using a Dual-Camera System

This paper proposes a novel dual-camera system with an asynchronous exposure control and a corresponding exposure-adaptive fusion network (EAFNet) to generate stable, high-quality HDR videos by decoupling temporal luminance anchoring from detail reconstruction, thereby overcoming the temporal flicker issues inherent in traditional single-camera alternating exposure methods.

Qianyu Zhang, Bolun Zheng, Lingyu Zhu + 4 more2026-02-26⚡ eess

Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers

This paper proposes RALU, a training-free spatial acceleration framework for Diffusion Transformers that employs mixed-resolution latent upsampling with region-adaptive edge processing and noise-timestep matching to achieve significant inference speedups (up to 15.9 $\times$ ) while maintaining high generation quality.

Wongi Jeong, Kyungryeol Lee, Hoigi Seo + 1 more2026-02-26⚡ eess

LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding

This paper introduces LLaDA-MedV, the first large language diffusion model tailored for biomedical image understanding, which achieves state-of-the-art performance on multiple VQA benchmarks and demonstrates superior capabilities in generating informative, length-controlled responses compared to existing autoregressive models.

Xuanzhao Dong, Wenhui Zhu, Xiwen Chen + 5 more2026-02-26💻 cs

Lang2Lift: A Language-Guided Autonomous Forklift System for Outdoor Industrial Pallet Handling

This paper presents Lang2Lift, an end-to-end autonomous forklift system that leverages natural language instructions to guide perception, pose estimation, and motion planning for successful pallet pick-up operations in diverse, unstructured outdoor industrial environments.

Huy Hoang Nguyen, Johannes Huemer, Markus Murschitz + 3 more2026-02-26💻 cs

Voxel Densification for Serialized 3D Object Detection: Mitigating Sparsity via Pre-serialization Expansion

This paper proposes a Voxel Densification Module (VDM) that utilizes pre-serialization spatial expansion via sparse 3D convolutions to overcome the inherent voxel dimension constraints of serialized 3D object detection frameworks, thereby significantly enhancing detection accuracy across multiple benchmarks while managing computational costs through strategic downsampling.

Qifeng Liu, Dawei Zhao, Yabo Dong + 6 more2026-02-26💻 cs

Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

This paper proposes V $^2$ Drop, a variation-aware dynamic token dropping method that significantly accelerates Large Vision-Language Model inference by progressively removing visual tokens with minimal variation, achieving substantial latency reductions while maintaining near-original performance in image and video understanding tasks.

Junjie Chen, Xuyang Liu, Zichen Wen + 3 more2026-02-26💻 cs

MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification

MedicalPatchNet is a novel, inherently self-explainable deep learning architecture for chest X-ray classification that achieves performance comparable to EfficientNetV2-S while significantly improving diagnostic interpretability and pathology localization accuracy through a transparent patch-based aggregation mechanism, thereby enhancing clinical trust and safety.

Patrick Wienholt, Christiane Kuhl, Jakob Nikolas Kather + 2 more2026-02-26🤖 cs.LG

← Previous Next →

cs.CV