cs.CV papers | Gist.Science

ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning

ImageEdit-R1 is a novel multi-agent framework that employs reinforcement learning to coordinate specialized vision-language and generative agents, enabling dynamic, context-aware image editing that outperforms existing monolithic models and baselines in handling complex, multi-step user instructions.

Yiran Zhao, Yaoqi Ye, Xiang Liu, Michael Qizhe Shieh, Trung Bui2026-03-10💻 cs

Enhancing Cross-View UAV Geolocalization via LVLM-Driven Relational Modeling

This paper proposes a novel plug-and-play ranking architecture that leverages Large Vision-Language Models (LVLMs) and a relational-aware loss function to explicitly model cross-view interactions, thereby significantly enhancing the accuracy and stability of UAV-to-satellite image geolocalization.

Bowen Liu, Pengyue Jia, Wanyu Wang, Derong Xu, Jiawei Cheng, Jiancheng Dong, Xiao Han, Zimo Zhao, Chao Zhang, Bowen Yu, Fangyu Hong, Xiangyu Zhao2026-03-10💻 cs

Evaluating Generative Models via One-Dimensional Code Distributions

This paper proposes a novel evaluation framework for generative models that replaces traditional continuous feature-based metrics with training-free and no-reference metrics operating in discrete visual token space, demonstrating superior correlation with human judgments across a new large-scale benchmark called VisForm.

Zexi Jia, Pengcheng Luo, Yijia Zhong, Jinchao Zhang, Jie Zhou2026-03-10💻 cs

Synthetic Defect Image Generation for Power Line Insulator Inspection Using Multimodal Large Language Models

This paper proposes a training-free pipeline using multimodal large language models to generate diverse, high-fidelity synthetic defect images for power line insulators, which significantly improves classification performance and data efficiency in low-data regimes by augmenting limited real-world datasets.

Xuesong Wang, Caisheng Wang2026-03-10💻 cs

TALON: Test-time Adaptive Learning for On-the-Fly Category Discovery

The paper proposes TALON, a test-time adaptive learning framework for on-the-fly category discovery that overcomes the limitations of static hash-based methods by dynamically updating semantic prototypes and the feature encoder to continuously integrate new knowledge, while employing margin-aware logit calibration to prevent category explosion and significantly improve novel-class accuracy.

Yanan Wu, Yuhan Yan, Tailai Chen, Zhixiang Chi, ZiZhang Wu, Yi Jin, Yang Wang, Zhenbo Li2026-03-10💻 cs

From Reactive to Map-Based AI: Tuned Local LLMs for Semantic Zone Inference in Object-Goal Navigation

This paper proposes a "Map-Based AI" framework that integrates a LoRA-fine-tuned Llama-2 model for semantic zone inference with a hybrid topological-grid mapping system to enable systematic, TSP-optimized exploration, significantly outperforming traditional reactive baselines in Object-Goal Navigation tasks within the AI2-THOR simulator.

Yudai Noda, Kanji Tanaka2026-03-10💻 cs

DSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation

This paper introduces DSH-Bench, a comprehensive benchmark featuring a hierarchical subject taxonomy, granular difficulty and scenario classification, and a novel Subject Identity Consistency Score (SICS) metric to systematically evaluate and diagnose subject-driven text-to-image generation models.

Zhenyu Hu, Qing Wang, Te Cao, Luo Liao, Longfei Lu, Liqun Liu, Shuang Li, Hang Chen, Mengge Xue, Yuan Chen, Chao Deng, Peng Shu, Huan Yu, Jie Jiang2026-03-10💻 cs

TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization

TrianguLang is a feed-forward, pose-free framework for 3D object localization that leverages Geometry-Aware Semantic Attention to achieve state-of-the-art accuracy and geometric consistency across multiple views without requiring camera calibration or per-scene optimization.

Bryce Grant, Aryeh Rothenberg, Atri Banerjee, Peng Wang2026-03-10💻 cs

Adaptive MLP Pruning for Large Vision Transformers

This paper proposes Adaptive MLP Pruning (AMP), a method that utilizes a label-free information entropy criterion for accurate neuron importance evaluation and a binary search algorithm for adaptive pruning, achieving roughly 40% parameter and FLOPs reduction in large vision transformers like CLIP and DINOv2 with near-lossless performance.

Chengchao Shen2026-03-10💻 cs

SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving

The paper proposes SAMoE-VLA, a novel Vision-Language-Action framework for autonomous driving that replaces unstable token-level Mixture-of-Experts with a scene-adaptive mechanism driven by bird's-eye-view features and a conditional cross-modal causal attention module, achieving state-of-the-art performance with fewer parameters on both open-loop and closed-loop benchmarks.

Zihan You, Hongwei Liu, Chenxu Dang, Zhe Wang, Sining Ang, Aoqi Wang, Yan Wang2026-03-10💻 cs

Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows

FoleyFlow introduces a novel video-to-audio generation framework that achieves superior semantic and rhythmic synchronization by aligning unimodal encoders through masked audio-visual modeling and employing a dynamic conditional flow that utilizes temporally varying video features to guide audio synthesis.

Shentong Mo, Yibing Song2026-03-10🤖 cs.LG

UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing

UniGround introduces a novel, training-free framework for universal 3D visual grounding that leverages global candidate filtering and local precision reasoning to achieve state-of-the-art zero-shot performance in localizing arbitrary objects within complex 3D environments without relying on pre-trained models or 3D supervision.

Jiaxi Zhang, Yunheng Wang, Wei Lu, Taowen Wang, Weisheng Xu, Shuning Zhang, Yixiao Feng, Yuetong Fang, Renjing Xu2026-03-10💻 cs

Fast Low-light Enhancement and Deblurring for 3D Dark Scenes

FLED-GS is a fast framework for novel view synthesis in 3D dark scenes that addresses compound low-light, noise, and motion blur degradations by reformulating restoration as an alternating cycle of 2D deblurring and noise-aware 3D Gaussian Splatting reconstruction, achieving superior performance with significantly faster training and rendering speeds compared to state-of-the-art methods.

Feng Zhang, Jinglong Wang, Ze Li, Yanghong Zhou, Yang Chen, Lei Chen, Xiatian Zhu2026-03-10💻 cs

VesselFusion: Diffusion Models for Vessel Centerline Extraction from 3D CT Images

This paper introduces VesselFusion, a diffusion model-based approach that utilizes coarse-to-fine representation and voting-based aggregation to achieve more accurate and natural vessel centerline extraction from 3D CT images compared to conventional deterministic methods.

Soichi Mita, Shumpei Takezaki, Ryoma Bise2026-03-10💻 cs

MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data

This paper introduces MV-Fashion, a large-scale multi-view video dataset featuring 3,273 sequences with pixel-level annotations, ground-truth material properties, and paired flat/worn garment images, designed to bridge the realism and annotation gaps in existing datasets for virtual try-on and size estimation tasks.

Hunor Laczkó, Libang Jia, Loc-Phat Truong, Diego Hernández, Sergio Escalera, Jordi Gonzalez, Meysam Madadi2026-03-10💻 cs

Edged USLAM: Edge-Aware Event-Based SLAM with Learning-Based Depth Priors

Edged USLAM is a hybrid visual-inertial SLAM system that integrates an edge-aware front-end and a lightweight depth module to achieve robust, drift-minimized localization on UAVs, particularly excelling in slow or structured trajectories under challenging illumination where purely event-based or learning-based methods may struggle.

Sebnem Sarıözkan, Hürkan Sahin, Olaya Álvarez-Tuñón, Erdal Kayacan2026-03-10💻 cs

MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals

The paper introduces MERLIN, a novel training framework accompanied by the EM-100k dataset and EM-Bench benchmark, to overcome data scarcity, evaluation gaps, and low-SNR fragility in building robust Multimodal Large Language Models for electromagnetic signals.

Junyu Shen, Zhendong She, Chenghanyu Zhang, Yuchuang Sun, Luqing Luo, Dingwei Tan, Zonghao Guo, Bo Guo, Zehua Han, Wupeng Xie, Yaxin Mu, Peng Zhang, Peipei Li, Fengxiang Wang, Yangang Sun, Maosong Sun2026-03-10💻 cs

ALOOD: Exploiting Language Representations for LiDAR-based Out-of-Distribution Object Detection

The paper proposes ALOOD, a novel approach that leverages vision-language model representations to align LiDAR object features with language embeddings, thereby enabling zero-shot out-of-distribution object detection for safer autonomous driving.

Michael Kösel, Marcel Schreiber, Michael Ulrich, Claudius Gläser, Klaus Dietmayer2026-03-10🤖 cs.LG

Fusion-Poly: A Polyhedral Framework Based on Spatial-Temporal Fusion for 3D Multi-Object Tracking

Fusion-Poly is a novel spatial-temporal fusion framework for 3D multi-object tracking that effectively leverages asynchronous LiDAR and camera observations to enable higher-frequency state updates and achieve state-of-the-art performance on the nuScenes benchmark.

Xian Wu, Yitao Wu, Xiaoyu Li, Zijia Li, Lijun Zhao, Lining Sun2026-03-10💻 cs

MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data

This paper proposes MM-TS, a novel framework for multi-modal contrastive learning that dynamically adjusts temperature and margin schedules based on local data distribution to address long-tail imbalances, unifying InfoNCE and max-margin objectives to achieve state-of-the-art performance across multiple image- and video-language datasets.

Siarhei Sheludzko, Dhimitrios Duka, Bernt Schiele, Hilde Kuehne, Anna Kukleva2026-03-10💻 cs

← Previous Next →