cs.CV papers | Gist.Science

PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation

PrismAudio is a novel video-to-audio generation framework that addresses objective entanglement and human preference alignment by integrating a decomposed Chain-of-Thought reasoning structure with multi-dimensional rewards and a computationally efficient Fast-GRPO algorithm, achieving state-of-the-art performance across semantic, temporal, aesthetic, and spatial dimensions.

Huadai Liu, Kaicheng Luo, Wen Wang + 6 more2026-03-04⚡ eess

Markovian Scale Prediction: A New Era of Visual Autoregressive Generation

The paper introduces Markov-VAR, a novel visual autoregressive model that reformulates next-scale prediction as a Markov process using a sliding window to compress historical context, thereby significantly improving both generation quality and computational efficiency compared to traditional full-context VAR approaches.

Yu Zhang, Jingyi Liu, Yiwei Shi + 4 more2026-03-04💻 cs

ALARM: Automated MLLM-Based Anomaly Detection in Complex-EnviRonment Monitoring with Uncertainty Quantification

This paper introduces ALARM, an automated framework that leverages multi-modal large language models integrated with uncertainty quantification and quality-assurance techniques to achieve robust and reliable visual anomaly detection in complex, ambiguous environments.

Congjing Zhang, Feng Lin, Xinyi Zhao + 5 more2026-03-04🤖 cs.AI

Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation

This paper proposes SSMP, a novel self-paced and self-corrective masked prediction method that overcomes the error propagation limitations of traditional selection-then-ranking paradigms by employing bi-directional contextual modeling and progressive refinement to achieve state-of-the-art automatic movie trailer generation.

Sidan Zhu, Hongteng Xu, Dixin Luo2026-03-04💻 cs

Value Gradient Guidance for Flow Matching Alignment

This paper introduces VGG-Flow, a gradient-matching finetuning method grounded in optimal control theory that aligns flow matching models with human preferences by matching velocity field differences to value function gradients, thereby achieving efficient adaptation and robust prior preservation on models like Stable Diffusion 3.

Zhen Liu, Tim Z. Xiao, Carles Domingo-Enrich + 2 more2026-03-04🤖 cs.LG

Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner

The paper presents AVI-Edit, a novel framework that achieves precise audio-sync video instance editing by employing a granularity-aware mask refiner for spatial accuracy, a self-feedback audio agent for temporal control, and a newly constructed large-scale dataset.

Haojie Zheng, Shuchen Weng, Jingqi Liu + 3 more2026-03-04💻 cs

CHAMMI-75: Pre-training multi-channel models with heterogeneous microscopy images

The paper introduces CHAMMI-75, a diverse open-access dataset of 75 heterogeneous multi-channel microscopy studies, which enables the training of channel-adaptive machine learning models to overcome the limitations of specialized, single-modality approaches in quantifying cellular morphology.

Vidit Agrawal, John Peters, Tyler N. Thompson + 13 more2026-03-04🤖 cs.LG

UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving

UniDrive-WM is a unified vision-language model that integrates scene understanding, trajectory planning, and trajectory-conditioned future image generation into a single architecture, achieving state-of-the-art performance on the Bench2Drive benchmark by leveraging generative predictions to iteratively refine planning and enhance scene comprehension.

Zhexiao Xiong, Xin Ye, Burhan Yaman + 5 more2026-03-04💻 cs

Hot-Start from Pixels: Low-Resolution Visual Tokens for Chinese Language Modeling

This paper demonstrates that low-resolution visual inputs (as small as 8x8 pixels) can effectively replace traditional index-based tokens for Chinese language modeling, achieving comparable accuracy while exhibiting a significantly faster "hot-start" learning phase.

Shuyang Xiang, Hao Guan2026-03-04🤖 cs.AI

Unsupervised Deformable Image Registration with Local-Global Attention and Image Decomposition

This paper introduces LGANet++, an unsupervised deformable image registration framework that leverages a novel local-global attention mechanism and feature decomposition to achieve superior accuracy and generalizability across cross-patient, cross-time, and cross-modal medical imaging scenarios compared to state-of-the-art methods.

Zhengyong Huang, Xingwen Sun, Xuting Chang + 5 more2026-03-04⚡ eess

Graph Recognition via Subgraph Prediction

This paper introduces GraSP, a unified and transferable method for visual graph recognition that predicts subgraphs to overcome the limitations of existing task-specific solutions across diverse graph types and contexts.

André Eberhard, Gerhard Neumann, Pascal Friederich2026-03-04🤖 cs.LG

MLV-Edit: Towards Consistent and Highly Efficient Editing for Minute-Level Videos

MLV-Edit is a training-free, flow-based framework that enables consistent and efficient editing of minute-level videos by employing a divide-and-conquer strategy with Velocity Blend and Attention Sink modules to resolve motion inconsistencies and prevent structural drift across long sequences.

Yangyi Cao, Yuanhang Li, Lan Chen + 1 more2026-03-04💻 cs

VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

VideoTemp-o3 is a unified agentic framework that harmonizes temporal grounding and video understanding through a flexible, self-correcting localization pipeline, specialized training mechanisms, and a new high-quality benchmark to overcome the limitations of conventional uniform sampling in long-video analysis.

Wenqi Liu, Yunxiao Wang, Shijie Ma + 14 more2026-03-04🤖 cs.AI

WristMIR: Coarse-to-Fine Region-Aware Retrieval of Pediatric Wrist Radiographs with Radiology Report-Driven Learning

WristMIR is a novel framework that leverages radiology report-driven learning and a two-stage coarse-to-fine retrieval process to effectively identify pediatric wrist radiographs with analogous fracture patterns, significantly outperforming existing baselines in retrieval accuracy, fracture classification, and clinical relevance without requiring manual image annotations.

Mert Sonmezer, Serge Vasylechko, Duygu Atasoy + 2 more2026-03-04💻 cs

The Garbage Dataset (GD): A Multi-Class Image Benchmark for Automated Waste Segregation

This paper introduces the Garbage Dataset (GD), a publicly available benchmark of 12,259 labeled images across 10 waste categories, and evaluates its effectiveness using state-of-the-art deep learning models to advance automated waste segregation while addressing challenges like class imbalance, background complexity, and environmental trade-offs.

Suman Kunwar2026-03-04💻 cs

EO-VAE: Towards A Multi-sensor Tokenizer for Earth Observation Data

The paper introduces EO-VAE, a unified multi-sensor variational autoencoder that employs dynamic hypernetworks to efficiently tokenize diverse Earth observation data with flexible channel combinations, achieving superior reconstruction fidelity compared to existing modality-specific tokenizers.

Nils Lehmann, Yi Wang, Zhitong Xiong + 1 more2026-03-04💻 cs

MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

MedXIAOHE is a state-of-the-art medical vision-language foundation model that leverages an entity-aware continual pretraining framework, reinforcement learning, and tool-augmented agentic training to achieve superior diagnostic reasoning, reliability, and performance across diverse medical benchmarks.

Baorong Shi, Bo Cui, Boyuan Jiang + 17 more2026-03-04⚡ eess

UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling

This paper introduces UniTAF, a modular framework that unifies Text-to-Speech and Audio-to-Face models to enable internal feature transfer and emotion control, validating the feasibility of reusing intermediate representations for improved audio-facial consistency without prioritizing generation quality.

Qiangong Zhou, Nagasaka Tomohiro2026-03-04⚡ eess

CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion

CRAFT-LoRA is a novel framework that achieves high-fidelity, personalized image generation with improved content-style disentanglement and flexible semantic control by combining rank-constrained fine-tuning, prompt-guided expert aggregation, and a training-free timestep-dependent guidance scheme, all without requiring additional retraining overhead.

Yu Li, Yujun Cai, Chi Zhang2026-03-04💻 cs

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

This paper introduces CFE-Bench, a challenging multimodal benchmark derived from authentic university exam problems across 20+ STEM domains, which reveals that even frontier models struggle with maintaining correct intermediate states and step efficiency in multi-step reasoning despite achieving moderate overall accuracy.

Chongyang Gao, Diji Yang, Shuyan Zhou + 4 more2026-03-04💬 cs.CL

← Previous Next →