cs.CV papers | Gist.Science

GameVerse: Can Vision-Language Models Learn from Video-based Reflection?

The paper introduces GameVerse, a comprehensive benchmark featuring a novel reflect-and-retry paradigm and a hierarchical taxonomy across 15 games, demonstrating that Vision-Language Models can effectively improve their gameplay policies through video-based reflection by combining failure trajectories with expert tutorials.

Kuan Zhang, Dongchen Liu, Qiyue Zhao, Jinkun Hou, Xinran Zhang, Qinlei Xie, Miao Liu, Yiming Li2026-03-10💻 cs

ASMIL: Attention-Stabilized Multiple Instance Learning for Whole Slide Imaging

The paper introduces ASMIL, a unified framework that addresses unstable attention dynamics, overfitting, and over-concentrated attention in attention-based multiple instance learning for whole slide imaging by employing an anchor model with a normalized sigmoid function and token random dropping, resulting in significant performance improvements over state-of-the-art methods.

Linfeng Ye, Shayan Mohajer Hamidi, Zhixiang Chi, Guang Li, Mert Pilanci, Takahiro Ogawa, Miki Haseyama, Konstantinos N. Plataniotis2026-03-10💻 cs

EnsAug: Augmentation-Driven Ensembles for Human Motion Sequence Analysis

The paper introduces EnsAug, a novel training paradigm that improves human motion analysis by replacing the conventional single-model approach with an ensemble of specialists, each trained on data augmented by a distinct geometric transformation, thereby achieving state-of-the-art performance while respecting kinematic constraints.

Bikram De, Habib Irani, Vangelis Metsis2026-03-10🤖 cs.LG

HyperTokens: Controlling Token Dynamics for Continual Video-Language Understanding

The paper introduces HyperTokens, a transformer-based token generator augmented with meta-inspired regularizers and causal auxiliary supervision to enable efficient, low-forgetting continual Video-Language Understanding by dynamically producing task-specific prompts without increasing memory costs.

Toan Nguyen, Yang Liu, Celso De Melo, Flora D. Salim2026-03-10🤖 cs.LG

Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting

The paper proposes Graph-of-Mark (GoM), a novel pixel-level visual prompting technique that overlays scene graphs onto images to capture object relationships, thereby significantly enhancing the spatial reasoning and zero-shot performance of multimodal language models.

Giacomo Frisoni, Lorenzo Molfetta, Mattia Buzzoni, Gianluca Moro2026-03-10💻 cs

Accelerating Video Generation Inference with Sequential-Parallel 3D Positional Encoding Using a Global Time Index

This paper introduces a system-level inference optimization for Diffusion Transformer-based video generation that employs a sequence-parallel Causal-RoPE mechanism and operator fusion to overcome memory and latency bottlenecks, achieving near real-time speeds and sub-second first-frame latency on an eight-GPU cluster.

Chao Yuan, Pan Li2026-03-10💻 cs

Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

This paper reveals that Chain-of-Thought prompting often underperforms direct answering in medical visual question answering due to a "medical perception bottleneck," and proposes training-free grounding interventions to restore visual accuracy and improve model reasoning.

Yuan Wu, Zongxian Yang, Jiayu Qian, Songpan Gao, Guanxing Chen, Qiankun Li, Yu-An Huang, Zhi-An Huang2026-03-10💻 cs

SJD-PV: Speculative Jacobi Decoding with Phrase Verification for Autoregressive Image Generation

This paper introduces SJD-PV, a training-free acceleration framework for autoregressive image generation that leverages phrase-level speculative verification based on token co-occurrence statistics to jointly validate multiple correlated tokens, achieving up to 30% faster decoding without compromising visual fidelity.

Zhehao Yu, Baoquan Zhang, Bingqi Shan, Xinhao Liu, Dongliang Zhou, Guotao Liang, Guangming Ye, Yunming Ye2026-03-10💻 cs

calibfusion: Transformer-Based Differentiable Calibration for Radar-Camera Fusion Detection in Water-Surface Environments

The paper proposes CalibFusion, a Transformer-based differentiable calibration framework that learns to implicitly refine Radar-Camera extrinsics end-to-end to overcome the challenges of textureless, cluttered water-surface environments and significantly improve fusion-based 2D object detection.

Yuting Wan, Liguo Sun, Jiuwu Hao, Pin LV2026-03-10💻 cs

Does Semantic Noise Initialization Transfer from Images to Videos? A Paired Diagnostic Study

This paper investigates whether semantic noise initialization, known to improve image diffusion models, transfers to text-to-video generation, finding that while it shows a slight positive trend on temporal metrics, it does not significantly outperform standard Gaussian noise due to weak or unstable signals in the noise space.

Yixiao Jing, Chaoyu Zhang, Zixuan Zhong, Peizhou Huang2026-03-10💻 cs

Unmixing microinfrared spectroscopic images of cross-sections of historical oil paintings

This paper proposes an unsupervised CNN autoencoder with a novel weighted spectral angle distance loss to enable blind, automated unmixing of complex ATR- $\mu$ FTIR hyperspectral images from historical oil painting cross-sections, significantly improving the interpretability and scalability of material analysis compared to traditional manual methods.

Shivam Pande, Nicolas Nadisic, Francisco Mederos-Henry, Aleksandra Pizurica2026-03-10🤖 cs.LG

AutoFigure-Edit: Generating Editable Scientific Illustration

AutoFigure-Edit is an end-to-end system that generates fully editable, high-quality scientific illustrations from long-form text with flexible style adaptation via reference images, leveraging long-context understanding and native SVG support to overcome limitations in editability and efficiency found in existing automated tools.

Zhen Lin, Qiujie Xie, Minjun Zhu, Shichen Li, Qiyao Sun, Enhao Gu, Yiran Ding, Ke Sun, Fang Guo, Panzhong Lu, Zhiyuan Ning, Yixuan Weng, Yue Zhang2026-03-10💻 cs

XAI and Few-shot-based Hybrid Classification Model for Plant Leaf Disease Prognosis

This paper proposes a hybrid few-shot learning model integrating Siamese and Prototypical Networks with Grad-CAM-based Explainable AI to achieve high-accuracy, interpretable classification of maize, rice, and wheat leaf diseases under limited data conditions.

Diana Susan Joseph, Pranav M Pawar, Raja Muthalagu, Mithun Mukharjee2026-03-10🤖 cs.LG

Chart Deep Research in LVLMs via Parallel Relative Policy Optimization

This paper addresses the limitations of current Large Vision-Language Models in deep chart research by proposing Parallel Relative Policy Optimization (PRPO) to resolve training conflicts and constructing the MCDR-Bench evaluation framework to enable objective assessment of complex reasoning capabilities.

Jiajin Tang, Gaoyang, Wenjie Wang, Sibei Yang, Xing Chen2026-03-10🤖 cs.LG

MultiGen: Level-Design for Editable Multiplayer Worlds in Diffusion Game Engines

The paper introduces MultiGen, a novel diffusion-based game engine that incorporates an explicit, persistent external memory to enable user-editable world structures and support coherent, real-time multiplayer interactions, overcoming the limitations of conventional next-frame prediction models.

Ryan Po, David Junhao Zhang, Amir Hertz, Gordon Wetzstein, Neal Wadhwa, Nataniel Ruiz2026-03-10💻 cs

VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images

This paper introduces VB, a novel benchmark designed to evaluate vision-language models' ability to determine image visibility and appropriately abstain from answering when evidence is insufficient, utilizing controlled minimal edits and specialized metrics to reveal that top-tier models like GPT-4o and Gemini 3.1 Pro significantly outperform open-source alternatives in confidence-aware accuracy and perspective reasoning.

Neil Tripathi2026-03-10💻 cs

RADAR: A Multimodal Benchmark for 3D Image-Based Radiology Report Review

The paper introduces RADAR, a multimodal benchmark comprising expert-annotated 3D abdominal CT scans and radiology report edits that enables the systematic evaluation of AI models on fine-grained clinical reasoning tasks, specifically image-text alignment and discrepancy assessment during the radiology report review process.

Zhaoyi Sun, Minal Jagtiani, Wen-wai Yim, Fei Xia, Martin Gunn, Meliha Yetisgen, Asma Ben Abacha2026-03-10💻 cs

ECHO: Event-Centric Hypergraph Operations via Multi-Agent Collaboration for Multimedia Event Extraction

The paper proposes ECHO, a multi-agent framework that utilizes iterative hypergraph operations and a "Link-then-Bind" strategy to mitigate cascading errors in Multimedia Event Extraction, achieving significant performance improvements over state-of-the-art methods on the M2E2 benchmark.

Hailong Chu, Shuo Zhang, Yunlong Chu, Shutai Huang, Xingyue Zhang, Tinghe Yan, Jinsong Zhang, Lei Li2026-03-10💻 cs

Three-dimensional reconstruction and segmentation of an aggregate stockpile for size and shape analyses

This paper presents an innovative, mobile-device-based 3D imaging approach using Structure-from-Motion and segmentation algorithms to reconstruct and analyze aggregate size and shape from stockpiles, offering a convenient and affordable solution for onsite quality assurance in road construction and geotechnics.

Erol Tutumluer, Haohang Huang, Jiayi Luo, Issam Qamhia, John M. Hart2026-03-10💻 cs

TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

This paper introduces TimeSpot, a comprehensive benchmark comprising 1,455 real-world images from 80 countries designed to evaluate the limited geo-temporal reasoning capabilities of current vision-language models in predicting location, time, and environmental context from visual evidence alone.

Azmine Toushik Wasi, Shahriyar Zaman Ridoy, Koushik Ahamed Tonmoy, Kinga Tshering, S. M. Muhtasimul Hasan, Wahid Faisal, Tasnim Mohiuddin, Md Rizwan Parvez2026-03-10💬 cs.CL

← Previous Next →