cs.CV papers | Gist.Science

Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

This paper employs mechanistic interpretability to map the internal information flow of VideoLLMs, revealing a consistent three-stage pathway of cross-frame interaction, video-language integration, and answer generation that enables effective temporal reasoning while allowing for significant attention edge pruning without performance loss.

Minji Kim, Taekyung Kim, Bohyung Han2026-03-04💻 cs

Self-Aug: Query and Entropy Adaptive Decoding for Large Vision-Language Models

This paper introduces Self-Aug, a training-free decoding strategy for Large Vision-Language Models that combines query-dependent self-augmentation prompting and entropy-adaptive thresholding to significantly reduce hallucinations and enhance factual consistency without requiring additional model training.

Eun Woo Im, Muhammad Kashif Ali, Vivek Gupta2026-03-04🤖 cs.AI

Inpainting the Red Planet: Diffusion Models for the Reconstruction of Martian Environments in Virtual Reality

This paper proposes an unconditional diffusion model trained on augmented HiRISE heightmaps to reconstruct missing Martian terrain data in virtual reality, demonstrating superior accuracy and perceptual similarity compared to traditional interpolation methods.

Giuseppe Lorenzo Catalano, Agata Marta Soccini2026-03-04🤖 cs.AI

CASR-Net: An Image Processing-focused Deep Learning-based Coronary Artery Segmentation and Refinement Network for X-ray Coronary Angiogram

This paper introduces CASR-Net, a three-stage deep learning pipeline featuring a novel multichannel preprocessing strategy and a Self-ONN-based UNet architecture that achieves state-of-the-art coronary artery segmentation and refinement on X-ray angiograms, thereby enhancing the accuracy of coronary artery disease diagnosis.

Alvee Hassan, Rusab Sarmun, Muhammad E. H. Chowdhury + 4 more2026-03-04🤖 cs.AI

Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects

Kinematify is an automated framework that synthesizes physically consistent, high-degree-of-freedom articulated objects directly from arbitrary RGB images or text by combining Monte Carlo Tree Search for kinematic topology inference with geometry-driven optimization for joint parameter estimation.

Jiawei Wang, Dingyou Wang, Jiaming Hu + 3 more2026-03-04💻 cs

Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision

This paper introduces DetGain, an architecture-agnostic online data curation method for object detection that dynamically selects informative training samples by estimating their marginal contributions to dataset-level Average Precision, thereby improving accuracy and robustness across various detectors.

Zitang Sun, Masakazu Yoshimura, Junji Otsuka + 2 more2026-03-04💻 cs

PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation

PrismAudio is a novel video-to-audio generation framework that addresses objective entanglement and human preference alignment by integrating a decomposed Chain-of-Thought reasoning structure with multi-dimensional rewards and a computationally efficient Fast-GRPO algorithm, achieving state-of-the-art performance across semantic, temporal, aesthetic, and spatial dimensions.

Huadai Liu, Kaicheng Luo, Wen Wang + 6 more2026-03-04⚡ eess

Markovian Scale Prediction: A New Era of Visual Autoregressive Generation

The paper introduces Markov-VAR, a novel visual autoregressive model that reformulates next-scale prediction as a Markov process using a sliding window to compress historical context, thereby significantly improving both generation quality and computational efficiency compared to traditional full-context VAR approaches.

Yu Zhang, Jingyi Liu, Yiwei Shi + 4 more2026-03-04💻 cs

ALARM: Automated MLLM-Based Anomaly Detection in Complex-EnviRonment Monitoring with Uncertainty Quantification

This paper introduces ALARM, an automated framework that leverages multi-modal large language models integrated with uncertainty quantification and quality-assurance techniques to achieve robust and reliable visual anomaly detection in complex, ambiguous environments.

Congjing Zhang, Feng Lin, Xinyi Zhao + 5 more2026-03-04🤖 cs.AI

Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation

This paper proposes SSMP, a novel self-paced and self-corrective masked prediction method that overcomes the error propagation limitations of traditional selection-then-ranking paradigms by employing bi-directional contextual modeling and progressive refinement to achieve state-of-the-art automatic movie trailer generation.

Sidan Zhu, Hongteng Xu, Dixin Luo2026-03-04💻 cs

Value Gradient Guidance for Flow Matching Alignment

This paper introduces VGG-Flow, a gradient-matching finetuning method grounded in optimal control theory that aligns flow matching models with human preferences by matching velocity field differences to value function gradients, thereby achieving efficient adaptation and robust prior preservation on models like Stable Diffusion 3.

Zhen Liu, Tim Z. Xiao, Carles Domingo-Enrich + 2 more2026-03-04🤖 cs.LG

Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner

The paper presents AVI-Edit, a novel framework that achieves precise audio-sync video instance editing by employing a granularity-aware mask refiner for spatial accuracy, a self-feedback audio agent for temporal control, and a newly constructed large-scale dataset.

Haojie Zheng, Shuchen Weng, Jingqi Liu + 3 more2026-03-04💻 cs

CHAMMI-75: Pre-training multi-channel models with heterogeneous microscopy images

The paper introduces CHAMMI-75, a diverse open-access dataset of 75 heterogeneous multi-channel microscopy studies, which enables the training of channel-adaptive machine learning models to overcome the limitations of specialized, single-modality approaches in quantifying cellular morphology.

Vidit Agrawal, John Peters, Tyler N. Thompson + 13 more2026-03-04🤖 cs.LG

UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving

UniDrive-WM is a unified vision-language model that integrates scene understanding, trajectory planning, and trajectory-conditioned future image generation into a single architecture, achieving state-of-the-art performance on the Bench2Drive benchmark by leveraging generative predictions to iteratively refine planning and enhance scene comprehension.

Zhexiao Xiong, Xin Ye, Burhan Yaman + 5 more2026-03-04💻 cs

Hot-Start from Pixels: Low-Resolution Visual Tokens for Chinese Language Modeling

This paper demonstrates that low-resolution visual inputs (as small as 8x8 pixels) can effectively replace traditional index-based tokens for Chinese language modeling, achieving comparable accuracy while exhibiting a significantly faster "hot-start" learning phase.

Shuyang Xiang, Hao Guan2026-03-04🤖 cs.AI

Unsupervised Deformable Image Registration with Local-Global Attention and Image Decomposition

This paper introduces LGANet++, an unsupervised deformable image registration framework that leverages a novel local-global attention mechanism and feature decomposition to achieve superior accuracy and generalizability across cross-patient, cross-time, and cross-modal medical imaging scenarios compared to state-of-the-art methods.

Zhengyong Huang, Xingwen Sun, Xuting Chang + 5 more2026-03-04⚡ eess

Graph Recognition via Subgraph Prediction

This paper introduces GraSP, a unified and transferable method for visual graph recognition that predicts subgraphs to overcome the limitations of existing task-specific solutions across diverse graph types and contexts.

André Eberhard, Gerhard Neumann, Pascal Friederich2026-03-04🤖 cs.LG

MLV-Edit: Towards Consistent and Highly Efficient Editing for Minute-Level Videos

MLV-Edit is a training-free, flow-based framework that enables consistent and efficient editing of minute-level videos by employing a divide-and-conquer strategy with Velocity Blend and Attention Sink modules to resolve motion inconsistencies and prevent structural drift across long sequences.

Yangyi Cao, Yuanhang Li, Lan Chen + 1 more2026-03-04💻 cs

VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

VideoTemp-o3 is a unified agentic framework that harmonizes temporal grounding and video understanding through a flexible, self-correcting localization pipeline, specialized training mechanisms, and a new high-quality benchmark to overcome the limitations of conventional uniform sampling in long-video analysis.

Wenqi Liu, Yunxiao Wang, Shijie Ma + 14 more2026-03-04🤖 cs.AI

WristMIR: Coarse-to-Fine Region-Aware Retrieval of Pediatric Wrist Radiographs with Radiology Report-Driven Learning

WristMIR is a novel framework that leverages radiology report-driven learning and a two-stage coarse-to-fine retrieval process to effectively identify pediatric wrist radiographs with analogous fracture patterns, significantly outperforming existing baselines in retrieval accuracy, fracture classification, and clinical relevance without requiring manual image annotations.

Mert Sonmezer, Serge Vasylechko, Duygu Atasoy + 2 more2026-03-04💻 cs

← Previous Next →