cs.CV papers | Gist.Science

Maximizing Asynchronicity in Event-based Neural Networks

This paper introduces EVA, a novel event-by-event asynchronous-to-synchronous (A2S) framework inspired by language modeling that generates highly expressive features, outperforming prior methods in recognition tasks and achieving state-of-the-art results in detection for event-based vision.

Haiqing Hao, Nikola Zubic, Weihua He, Zhipeng Sui, Davide Scaramuzza, Wenhui Wang2026-03-09🤖 cs.AI

BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation

This paper introduces BusterX, an MLLM-powered framework for AI-generated video forgery detection that combines the comprehensive GenBuster-200K dataset and the multi-track GenBuster-Bench to shift from black-box classification to interpretable visual reasoning, achieving superior accuracy and explanation quality compared to leading models.

Haiquan Wen, Yiwei He, Zhenglin Huang + 7 more2026-03-09💻 cs

DVD-Quant: Data-free Video Diffusion Transformers Quantization

This paper introduces DVD-Quant, a novel data-free post-training quantization framework for Video Diffusion Transformers that utilizes Bounded-init Grid Refinement, Auto-scaling Rotated Quantization, and $\delta$ -Guided Bit Switching to achieve a 2 $\times$ speedup and enable W4A4 quantization without compromising visual fidelity.

Zhiteng Li, Hanxuan Li, Junyi Wu, Kai Liu, Haotong Qin, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang2026-03-09💻 cs

Alchemist: Turning Public Text-to-Image Data into Generative Gold

This paper introduces "Alchemist," a novel methodology that leverages pre-trained generative models to curate a compact, high-quality, general-purpose supervised fine-tuning dataset, which significantly enhances the aesthetic quality and alignment of public text-to-image models while preserving diversity.

Valerii Startsev, Alexander Ustyuzhanin, Alexey Kirillov, Dmitry Baranchuk, Sergey Kastryulin2026-03-09💻 cs

Instance Data Condensation for Image Super-Resolution

This paper introduces Instance Data Condensation (IDC), a novel framework utilizing Random Local Fourier Feature Extraction and Multi-level Feature Distribution Matching to synthesize a highly compact (10% volume) dataset for Image Super-Resolution that achieves performance comparable to the original full dataset while significantly reducing computational and storage requirements.

Tianhao Peng, Ho Man Kwan, Yuxuan Jiang, Ge Gao, Fan Zhang, Xiaozhong Xu, Shan Liu, David Bull2026-03-09💻 cs

VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs

This paper introduces VisioMath, a benchmark of 1,800 K-12 mathematics problems featuring visually similar diagrams, to reveal that current Large Multimodal Models struggle with fine-grained comparative reasoning due to image-text misalignment and to demonstrate that alignment-oriented strategies can significantly improve performance.

Can Li, Ying Liu, Ting Zhang, Mei Wang, Hua Huang2026-03-09🤖 cs.AI

VisualPrompter: Semantic-Aware Prompt Optimization with Visual Feedback for Text-to-Image Synthesis

VisualPrompter is a novel, training-free framework that enhances text-to-image synthesis by employing a self-reflection module to identify missing concepts and a fine-grained optimization mechanism to refine prompts, thereby achieving state-of-the-art semantic alignment between user descriptions and generated images.

Shiyu Wu, Mingzhen Sun, Weining Wang, Yequan Wang, Jing Liu2026-03-09💻 cs

SPoT: Subpixel Placement of Tokens in Vision Transformers

The paper introduces SPoT, a novel tokenization strategy that positions Vision Transformer tokens continuously within images via an oracle-guided search to overcome grid-based limitations, significantly reducing the required token count while improving performance and interpretability.

Martine Hjelkrem-Tan, Marius Aasan, Gabriel Y. Arteaga, Adín Ramírez Rivera2026-03-09🤖 cs.LG

SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability

SPARC introduces a novel framework that unifies concept representations across diverse AI architectures and modalities by enforcing global sparsity and cross-reconstruction loss, thereby creating a shared latent space that enables direct cross-model interpretability and applications like text-guided localization without manual alignment.

Ali Nasiri-Sarvi, Hassan Rivaz, Mahdi S. Hosseini2026-03-09🤖 cs.AI

Token Bottleneck: One Token to Remember Dynamics

This paper introduces Token Bottleneck (ToBo), a self-supervised learning framework that compresses dynamic scenes into a compact token to predict future states with minimal hints, thereby enabling robust sequential scene understanding and robotic manipulation across both simulated and real-world environments.

Taekyung Kim, Dongyoon Han, Byeongho Heo, Jeongeun Park, Sangdoo Yun2026-03-09💻 cs

NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation

This paper introduces NarrLV, the first comprehensive benchmark for evaluating long video generation models by defining Temporal Narrative Atoms to quantify narrative richness and employing an MLLM-based framework to assess narrative expression, revealing current model capabilities and limitations.

X. Feng, H. Yu, M. Wu, S. Hu, J. Chen, C. Zhu, J. Wu, X. Chu, K. Huang2026-03-09💻 cs

Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping

This paper introduces TomatoMAP, a comprehensive dataset of 64,464 multi-angle, multi-pose tomato images with detailed annotations for seven regions of interest and 50 growth stages, which was validated to demonstrate that a cascading deep learning framework achieves fine-grained phenotyping accuracy and speed comparable to human experts.

Yujie Zhang, Sabine Struckmeyer, Andreas Kolb + 1 more2026-03-09💻 cs

ExDD: Explicit Dual Distribution Learning for Surface Defect Detection via Diffusion Synthesis

The paper introduces ExDD, a novel framework for industrial surface defect detection that overcomes data scarcity and uniform outlier assumptions by explicitly modeling dual feature distributions via parallel memory banks and generating context-aware synthetic defects using latent diffusion models.

Muhammad Aqeel, Federico Leonardi, Francesco Setti2026-03-09🤖 cs.AI

Gaussian Set Surface Reconstruction through Per-Gaussian Optimization

The paper proposes Gaussian Set Surface Reconstruction (GSSR), a method that optimizes individual Gaussian placement and alignment to achieve accurate 3D geometry reconstruction and uniform surface distribution, thereby overcoming the geometric limitations of existing 3D Gaussian Splatting variants while preserving high-quality rendering.

Zhentao Huang, Di Wu, Zhenbang He, Minglun Gong2026-03-09💻 cs

A Multi-Agent System Enables Versatile Information Extraction from the Chemical Literature

This paper presents a multimodal large language model-based multi-agent system that significantly outperforms existing state-of-the-art methods in automatically extracting structured chemical information from diverse and complex literature graphics, thereby advancing AI-driven chemical research.

Yufan Chen, Ching Ting Leung, Bowen Yu, Jianwei Sun, Yong Huang, Linyan Li, Hao Chen, Hanyu Gao2026-03-09🤖 cs.AI

MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing

This paper introduces MAP, a training-free decoding method that mitigates hallucinations in Large Vision-Language Models by interpreting hidden states as a 2D semantic map and employing layer-wise criss-cross attention and global-local logit fusion to aggregate widely distributed factual information for improved factual consistency.

Chenxi Li, Yichen Guo, Benfang Qian, Jinhao You, Kai Tang, Yaosong Du, Zonghao Zhang, Xiande Huang2026-03-09🤖 cs.AI

VLMQ: Token Saliency-Driven Post-Training Quantization for Vision-language Models

This paper introduces VLMQ, a post-training quantization framework tailored for vision-language models that leverages a gradient-driven importance factor to address visual over-representation and modality gaps, thereby achieving state-of-the-art performance across various model sizes and low-bit settings.

Yufei Xue, Yushi Huang, Jiawei Shao, Lunjie Zhu, Chi Zhang, Xuelong Li, Jun Zhang2026-03-09🤖 cs.AI

SGDFuse: SAM-Guided Diffusion Model for High-Fidelity Infrared and Visible Image Fusion

The paper proposes SGDFuse, a novel two-stage conditional diffusion model guided by Segment Anything Model (SAM) semantic masks, which achieves high-fidelity infrared and visible image fusion by leveraging explicit semantic priors to preserve key targets and minimize artifacts for superior downstream task performance.

Xiaoyang Zhang, jinjiang Li, Guodong Fan, Yakun Ju, Linwei Fan, Jun Liu, Alex C. Kot2026-03-09🤖 cs.AI

Multivariate Fields of Experts for Convergent Image Reconstruction

This paper introduces Multivariate Fields of Experts, a new image prior framework that generalizes existing methods using multivariate potential functions to achieve fast, interpretable, and theoretically guaranteed convergence in various inverse problems, outperforming univariate models while approaching deep learning performance with significantly fewer parameters and data.

Stanislas Ducotterd, Michael Unser2026-03-09🤖 cs.LG

DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model

DianJin-OCR-R1 is a reasoning-enhanced vision-language model that improves OCR accuracy by interleaving its own recognition with expert tool outputs, guiding the model to iteratively re-examine images and integrate evidence to reduce hallucinations and enhance fine-grained perception.

Qian Chen, Xianyin Zhang, Lifan Guo, Feng Chen, Chi Zhang2026-03-09💻 cs

← Previous Next →