cs.CV papers | Gist.Science

SABER: Spatially Consistent 3D Universal Adversarial Objects for BEV Detectors

This paper introduces SABER, a novel framework that generates spatially consistent, universal 3D adversarial objects to realistically and effectively attack Bird's-Eye-View detectors in autonomous driving by optimizing non-invasive environmental manipulations that maintain multi-view and temporal consistency.

Aixuan Li, Mochu Xiang, Bosen Hou + 3 more2026-03-04💻 cs

Interaction Field Matching: Overcoming Limitations of Electrostatic Models

This paper introduces Interaction Field Matching (IFM), a generalized framework that overcomes the modeling complexities of Electrostatic Field Matching by leveraging a novel interaction field inspired by strong quark-antiquark interactions to improve data generation and transfer performance.

Stepan I. Manukhov, Alexander Kolesov, Vladimir V. Palyulin + 1 more2026-03-04🤖 cs.AI

HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

This paper introduces HSSBench, a comprehensive multilingual benchmark featuring over 13,000 samples generated through a novel expert-agent collaboration pipeline, designed to evaluate and address the current limitations of Multimodal Large Language Models in handling the interdisciplinary and abstract reasoning tasks characteristic of the Humanities and Social Sciences.

Zhaolu Kang, Junhao Gong, Jiaxu Yan + 15 more2026-03-04🤖 cs.AI

Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models

This paper introduces Frame Guidance, a training-free method that enables fine-grained, frame-level control over video generation in diffusion models through efficient latent processing and optimization, eliminating the need for costly fine-tuning while supporting diverse tasks like keyframe guidance, stylization, and looping.

Sangwon Jang, Taekyung Ki, Jaehyeong Jo + 4 more2026-03-04🤖 cs.AI

Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward

Perception-R1 addresses the limitation of existing RLVR methods in enhancing multimodal perception by introducing a novel visual perception reward derived from Chain-of-Thought annotations, which effectively boosts both perception and reasoning capabilities of Multimodal Large Language Models to achieve state-of-the-art performance with minimal training data.

Tong Xiao, Xin Xu, Zhenya Huang + 4 more2026-03-04🤖 cs.AI

StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams

StreamSplat is a fully feed-forward framework that enables real-time, online reconstruction of dynamic 3D scenes from uncalibrated video streams into 3D Gaussian Splatting representations, achieving state-of-the-art quality with a 1200x speedup over traditional optimization-based methods through probabilistic sampling, bidirectional deformation, and adaptive Gaussian fusion.

Zike Wu, Qi Yan, Xuanyu Yi + 2 more2026-03-04🤖 cs.LG

Evolutionary Caching to Accelerate Your Off-the-Shelf Diffusion Model

The paper proposes ECAD, a genetic algorithm-based evolutionary caching method that learns optimal, model-specific inference schedules to significantly accelerate off-the-shelf diffusion models while maintaining high image quality and generalizing across resolutions and architectures without requiring parameter modifications.

Anirud Aggarwal, Abhinav Shrivastava, Matthew Gwilliam2026-03-04💻 cs

Synthetic Perception: Can Generated Images Unlock Latent Visual Prior for Text-Centric Reasoning?

This paper demonstrates that generating images on-the-fly via Text-to-Image models can unlock latent visual priors to significantly enhance text-centric reasoning, provided there is strong semantic alignment, task visual groundability, and high generative fidelity.

Yuesheng Huang, Peng Zhang, Xiaoxin Wu + 2 more2026-03-04💻 cs

SceneStreamer: Continuous Scenario Generation as Next Token Group Prediction

SceneStreamer is a unified autoregressive transformer framework that generates continuous, long-horizon traffic scenarios by predicting sequences of tokens representing dynamic elements like agents and traffic signals, thereby enabling the creation of realistic, diverse, and adaptive environments that significantly improve the robustness and generalization of autonomous driving policies.

Zhenghao Peng, Yuxin Liu, Bolei Zhou2026-03-04💻 cs

Navigating with Annealing Guidance Scale in Diffusion Space

This paper proposes a novel, memory-efficient annealing guidance scheduler that dynamically adjusts the guidance scale during diffusion sampling based on conditional noisy signals, thereby significantly improving both image quality and text alignment without requiring additional activations.

Shai Yehezkel, Omer Dahary, Andrey Voynov + 1 more2026-03-04🤖 cs.AI

MC-INR: Efficient Encoding of Multivariate Scientific Simulation Data using Meta-Learning and Clustered Implicit Neural Representations

This paper proposes MC-INR, a novel framework that leverages meta-learning, dynamic error-based re-clustering, and a branched architecture to efficiently encode complex multivariate scientific simulation data on unstructured grids, overcoming the inflexibility and single-variable limitations of existing Implicit Neural Representation methods.

Hyunsoo Son, Jeonghyun Noh, Suemin Jeon + 2 more2026-03-04🤖 cs.LG

CoBELa: Steering Transparent Generation via Concept Bottlenecks on Energy Landscapes

CoBELa is a decoder-free, energy-based framework that enables transparent and compositional generative control by conditioning a frozen pretrained generator entirely through additive per-concept energy functions, achieving high image quality and concept accuracy without requiring model retraining.

Sangwon Kim, Kyoungoh Lee, Jeyoun Dong + 1 more2026-03-04🤖 cs.AI

InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

InstructVLA introduces a novel Vision-Language-Action Instruction Tuning paradigm that successfully bridges flexible multimodal reasoning and precise manipulation by jointly optimizing embodied reasoning and action generation, thereby achieving state-of-the-art performance in both simulated and real-world robotic tasks without sacrificing pre-trained capabilities.

Shuai Yang, Hao Li, Bin Wang + 7 more2026-03-04💻 cs

DMTrack: Spatio-Temporal Multimodal Tracking via Dual-Adapter

DMTrack introduces a novel dual-adapter architecture featuring a spatio-temporal modality adapter and a progressive modality complementary adapter to achieve state-of-the-art spatio-temporal multimodal tracking performance with only 0.93M trainable parameters.

Weihong Li, Shaohua Dong, Haonan Lu + 3 more2026-03-04🤖 cs.AI

Zero-shot CT Super-Resolution using Diffusion-based 2D Projection Priors and Signed 3D Gaussians

This paper proposes a novel zero-shot 3D CT super-resolution framework that combines diffusion-based 2D projection priors with a signed 3D Gaussian splatting method featuring Negative Alpha Blending to reconstruct high-resolution volumes from single low-resolution inputs without requiring paired training data.

Jeonghyun Noh, Hyun-Jic Oh, Won-Ki Jeong2026-03-04⚡ eess

MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

MMTok is a novel method that enhances the inference efficiency of Vision-Language Models by formulating vision token selection as a maximum coverage problem, leveraging complementary multimodal information from both vision and text to prune redundant tokens while preserving high performance.

Sixun Dong, Juhua Hu, Mian Zhang + 3 more2026-03-04💻 cs

ConEQsA: Concurrent and Asynchronous Embodied Questions Scheduling and Answering

This paper introduces ConEQsA, a novel agentic framework and benchmark designed to address the challenges of concurrent and asynchronous embodied question answering by leveraging shared memory and urgency-aware scheduling to outperform traditional sequential approaches in realistic multi-question scenarios.

Haisheng Wang, Dong Liu, Weiming Zhi2026-03-04🤖 cs.AI

Are VLMs Ready for Lane Topology Awareness in Autonomous Driving?

This paper systematically evaluates Vision-Language Models' capabilities in autonomous driving lane topology awareness through a new BEV-based diagnostic framework, revealing that while performance correlates with model size and reasoning depth, current models—including frontier closed-source systems—still struggle with fundamental spatial reasoning tasks essential for safe navigation.

Xin Chen, Jia He, Maozheng Li + 5 more2026-03-04💻 cs

SiNGER: A Clearer Voice Distills Vision Transformers Further

The paper introduces SiNGER, a novel knowledge distillation framework that utilizes singular nullspace-guided energy reallocation via a LoRA-based adapter to suppress high-norm artifacts in Vision Transformer teachers while preserving informative signals, thereby enabling student models to achieve state-of-the-art performance and clearer representations.

Geunhyeok Yu, Sunjae Jeong, Yoonyoung Choi + 2 more2026-03-04🤖 cs.AI

Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents

The paper introduces Earth-Agent, a novel agentic framework that unifies RGB and spectral Earth observation data within an MCP-based tool ecosystem to enable complex, multi-step quantitative reasoning, accompanied by the Earth-Bench benchmark for comprehensive evaluation of such capabilities.

Peilin Feng, Zhutao Lv, Junyan Ye + 8 more2026-03-04💻 cs

← Previous Next →