cs.CV papers | Gist.Science

CompBench: Benchmarking Complex Instruction-guided Image Editing

This paper introduces CompBench, a large-scale benchmark featuring fine-grained instructions and an MLLM-human collaborative framework to rigorously evaluate and expose the limitations of current models in complex, instruction-guided image editing tasks.

Bohan Jia, Wenxuan Huang, Yuntian Tang, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng Feng, Zhouhong Gu, Zhenfei Yin, Lei Bai, Wanli Ouyang, Lin Chen, Fei Zhao, Yao Hu, Zihan Wang, Yuan (…)2026-03-24💻 cs

SPKLIP: Aligning Spike Video Streams with Natural Language

The paper introduces SPKLIP, the first architecture designed to align sparse spike video streams with natural language through hierarchical feature extraction and contrastive learning, achieving state-of-the-art few-shot performance and enhanced energy efficiency for neuromorphic deployment.

Yongchang Gao, Meiling Jin, Zhaofei Yu, Tiejun Huang, Guozhang Chen2026-03-24💻 cs

Foresight Diffusion: Improving Sampling Consistency in Predictive Diffusion Models

This paper introduces Foresight Diffusion (ForeDiff), a framework that enhances sampling consistency in predictive diffusion models by decoupling condition understanding from target denoising through a separate deterministic predictive stream, thereby improving both accuracy and consistency in robot video and scientific spatiotemporal forecasting tasks.

Yu Zhang, Xingzhuo Guo, Haoran Xu, Jialong Wu, Mingsheng Long2026-03-24💻 cs

Frequency-Adaptive Discrete Cosine-ViT-ResNet Architecture for Sparse-Data Vision

This paper proposes a novel hybrid deep-learning framework that combines an adaptive Discrete Cosine Transform preprocessing module with ViT-B16 and ResNet50 backbones, along with a Bayesian linear classifier, to achieve state-of-the-art rare animal image classification performance under extreme data scarcity by effectively fusing learned frequency-domain cues with multi-scale spatial representations.

Ziyue Kang, Weichuan Zhang2026-03-24💻 cs

SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning

This paper introduces SynPO, a novel preference optimization framework that synergizes descriptiveness and preference learning to enhance fine-grained video captioning by constructing cost-effective preference pairs and eliminating the reference model, thereby outperforming DPO variants with improved training efficiency and preserved language capabilities.

Jisheng Dang, Yizhou Zhang, Hao Ye, Teng Wang, Siming Chen, Huicheng Zheng, Yulan Guo, Jianhuang Lai, Bin Hu2026-03-24🤖 cs.AI

ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing

ReSpace is a generative framework that leverages autoregressive next-token prediction and supervised fine-tuning with preference alignment to enable text-driven 3D indoor scene synthesis and editing with explicit room boundaries and superior spatial reasoning, outperforming state-of-the-art methods in object manipulation and human-perceived quality.

Martin JJ. Bucher, Iro Armeni2026-03-24💻 cs

Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments

This paper introduces MicroG-4M, the first benchmark dataset comprising over 4,700 clips from real space missions and simulations to address the critical gap in video understanding for microgravity environments by supporting action recognition, video captioning, and visual question answering tasks.

Di Wen, Lei Qi, Kunyu Peng, Kailun Yang, Fei Teng, Ao Luo, Jia Fu, Yufan Chen, Ruiping Liu, Yitian Shi, M. Saquib Sarfraz, Rainer Stiefelhagen2026-03-24💻 cs

From Explanations to Architecture: Explainability-Driven CNN Refinement for Brain Tumor Classification in MRI

This paper proposes an explainability-driven framework that utilizes Grad-CAM to iteratively refine and prune a CNN architecture for brain tumor classification, achieving high accuracy and strong generalization while ensuring model transparency and clinical trustworthiness.

Rajan Das Gupta, Md Imrul Hasan Showmick, Lei Wei, Mushfiqur Rahman Abir, Shanjida Akter, Md. Yeasin Rahat, Md. Jakir Hossen2026-03-24⚡ eess

Symmetrical Flow Matching: Unified Image Generation, Segmentation, and Classification with Score-Based Generative Models

This paper introduces SymmFlow, a unified Symmetrical Flow Matching framework that jointly learns forward and reverse transformations to achieve state-of-the-art performance in image generation, segmentation, and classification within a single model while enabling efficient one-step inference.

Francisco Caetano, Christiaan Viviers, Peter H. N. De With, Fons van der Sommen2026-03-24🤖 cs.AI

Segmenting Visuals With Querying Words: Language Anchors For Semi-Supervised Image Segmentation

This paper introduces HVLFormer, a semi-supervised image segmentation framework that leverages hierarchical, domain-aware textual object queries and cross-view consistency regularization to effectively align visual and textual representations from Vision Language Models, achieving state-of-the-art performance with less than 1% labeled data.

Numair Nadeem, Saeed Anwar, Muhammad Hamza Asad, Abdul Bais2026-03-24🤖 cs.AI

← Previous Next →