cs.CV papers | Gist.Science

Does Semantic Noise Initialization Transfer from Images to Videos? A Paired Diagnostic Study

This paper investigates whether semantic noise initialization, known to improve image diffusion models, transfers to text-to-video generation, finding that while it shows a slight positive trend on temporal metrics, it does not significantly outperform standard Gaussian noise due to weak or unstable signals in the noise space.

Yixiao Jing, Chaoyu Zhang, Zixuan Zhong, Peizhou HuangTue, 10 Ma💻 cs

Unmixing microinfrared spectroscopic images of cross-sections of historical oil paintings

This paper proposes an unsupervised CNN autoencoder with a novel weighted spectral angle distance loss to enable blind, automated unmixing of complex ATR- $\mu$ FTIR hyperspectral images from historical oil painting cross-sections, significantly improving the interpretability and scalability of material analysis compared to traditional manual methods.

Shivam Pande, Nicolas Nadisic, Francisco Mederos-Henry, Aleksandra PizuricaTue, 10 Ma🤖 cs.LG

AutoFigure-Edit: Generating Editable Scientific Illustration

AutoFigure-Edit is an end-to-end system that generates fully editable, high-quality scientific illustrations from long-form text with flexible style adaptation via reference images, leveraging long-context understanding and native SVG support to overcome limitations in editability and efficiency found in existing automated tools.

Zhen Lin, Qiujie Xie, Minjun Zhu, Shichen Li, Qiyao Sun, Enhao Gu, Yiran Ding, Ke Sun, Fang Guo, Panzhong Lu, Zhiyuan Ning, Yixuan Weng, Yue ZhangTue, 10 Ma💻 cs

XAI and Few-shot-based Hybrid Classification Model for Plant Leaf Disease Prognosis

This paper proposes a hybrid few-shot learning model integrating Siamese and Prototypical Networks with Grad-CAM-based Explainable AI to achieve high-accuracy, interpretable classification of maize, rice, and wheat leaf diseases under limited data conditions.

Diana Susan Joseph, Pranav M Pawar, Raja Muthalagu, Mithun MukharjeeTue, 10 Ma🤖 cs.LG

Chart Deep Research in LVLMs via Parallel Relative Policy Optimization

This paper addresses the limitations of current Large Vision-Language Models in deep chart research by proposing Parallel Relative Policy Optimization (PRPO) to resolve training conflicts and constructing the MCDR-Bench evaluation framework to enable objective assessment of complex reasoning capabilities.

Jiajin Tang, Gaoyang, Wenjie Wang, Sibei Yang, Xing ChenTue, 10 Ma🤖 cs.LG

MultiGen: Level-Design for Editable Multiplayer Worlds in Diffusion Game Engines

The paper introduces MultiGen, a novel diffusion-based game engine that incorporates an explicit, persistent external memory to enable user-editable world structures and support coherent, real-time multiplayer interactions, overcoming the limitations of conventional next-frame prediction models.

Ryan Po, David Junhao Zhang, Amir Hertz, Gordon Wetzstein, Neal Wadhwa, Nataniel RuizTue, 10 Ma💻 cs

VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images

This paper introduces VB, a novel benchmark designed to evaluate vision-language models' ability to determine image visibility and appropriately abstain from answering when evidence is insufficient, utilizing controlled minimal edits and specialized metrics to reveal that top-tier models like GPT-4o and Gemini 3.1 Pro significantly outperform open-source alternatives in confidence-aware accuracy and perspective reasoning.

Neil TripathiTue, 10 Ma💻 cs

RADAR: A Multimodal Benchmark for 3D Image-Based Radiology Report Review

The paper introduces RADAR, a multimodal benchmark comprising expert-annotated 3D abdominal CT scans and radiology report edits that enables the systematic evaluation of AI models on fine-grained clinical reasoning tasks, specifically image-text alignment and discrepancy assessment during the radiology report review process.

Zhaoyi Sun, Minal Jagtiani, Wen-wai Yim, Fei Xia, Martin Gunn, Meliha Yetisgen, Asma Ben AbachaTue, 10 Ma💻 cs

ECHO: Event-Centric Hypergraph Operations via Multi-Agent Collaboration for Multimedia Event Extraction

The paper proposes ECHO, a multi-agent framework that utilizes iterative hypergraph operations and a "Link-then-Bind" strategy to mitigate cascading errors in Multimedia Event Extraction, achieving significant performance improvements over state-of-the-art methods on the M2E2 benchmark.

Hailong Chu, Shuo Zhang, Yunlong Chu, Shutai Huang, Xingyue Zhang, Tinghe Yan, Jinsong Zhang, Lei LiTue, 10 Ma💻 cs

Three-dimensional reconstruction and segmentation of an aggregate stockpile for size and shape analyses

This paper presents an innovative, mobile-device-based 3D imaging approach using Structure-from-Motion and segmentation algorithms to reconstruct and analyze aggregate size and shape from stockpiles, offering a convenient and affordable solution for onsite quality assurance in road construction and geotechnics.

Erol Tutumluer, Haohang Huang, Jiayi Luo, Issam Qamhia, John M. HartTue, 10 Ma💻 cs

TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

This paper introduces TimeSpot, a comprehensive benchmark comprising 1,455 real-world images from 80 countries designed to evaluate the limited geo-temporal reasoning capabilities of current vision-language models in predicting location, time, and environmental context from visual evidence alone.

Azmine Toushik Wasi, Shahriyar Zaman Ridoy, Koushik Ahamed Tonmoy, Kinga Tshering, S. M. Muhtasimul Hasan, Wahid Faisal, Tasnim Mohiuddin, Md Rizwan ParvezTue, 10 Ma💬 cs.CL

Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning

The paper introduces "Narrative Weaver," a novel framework that achieves controllable, long-range visual consistency in generative AI by integrating multimodal narrative planning with a dynamic memory bank, validated through extensive experiments and a newly released e-commerce advertising dataset.

Zhengjian Yao, Yongzhi Li, Xinyuan Gao, Quan Chen, Peng Jiang, Yanye LuTue, 10 Ma💻 cs

High-Resolution Image Reconstruction with Unsupervised Learning and Noisy Data Applied to Ion-Beam Dynamics for Particle Accelerators

This paper presents an unsupervised learning framework utilizing convolutional filtering and neural networks with optimized early-stopping to achieve robust, high-fidelity reconstruction of ion-beam emittance images from noisy data, enabling unprecedented halo resolution beyond seven standard deviations for particle accelerator diagnostics.

Francis Osswald (IPHC), Mohammed Chahbaoui (UNISTRA), Xinyi Liang (SU)Tue, 10 Ma🤖 cs.LG

Spectral Gaps and Spatial Priors: Studying Hyperspectral Downstream Adaptation Using TerraMind

This study evaluates the adaptability of the TerraMind geospatial foundation model to hyperspectral imaging tasks without native pretraining, finding that while band selection strategies allow for moderate performance, deep learning models with native spectral support remain superior, thereby highlighting the critical need for future architectures to incorporate native spectral tokenization.

Julia Anna Leonardi, Johannes Jakubik, Paolo Fraccaro, Maria Antonia BrovelliTue, 10 Ma💻 cs

One-Shot Badminton Shuttle Detection for Mobile Robots

This paper introduces a robust one-shot badminton shuttlecock detection framework for mobile robots, featuring a novel semi-automatically annotated dataset and a fine-tuned YOLOv8 model that achieves high performance in dynamic, egocentric viewpoints across diverse environments.

Florentin Dipner, William Talbot, Turcan Tuna, Andrei Cramariuc, Marco HutterTue, 10 Ma💻 cs

Soft Equivariance Regularization for Invariant Self-Supervised Learning

This paper proposes Soft Equivariance Regularization (SER), a lightweight, plug-in method that decouples invariance and equivariance objectives by enforcing equivariance on intermediate spatial features while preserving invariance on the final embedding, thereby improving both linear evaluation accuracy and robustness to geometric perturbations without requiring auxiliary heads or transformation labels.

Joohyung Lee, Changhun Kim, Hyunsu Kim, Kwanhyung Lee, Juho LeeTue, 10 Ma🤖 cs.LG

HARP: HARmonizing in-vivo diffusion MRI using Phantom-only training

This paper introduces HARP, a deep learning framework that harmonizes multi-site in-vivo diffusion MRI data by training exclusively on easily transportable phantom scans, thereby eliminating the need for impractical multi-site human cohorts while significantly reducing inter-scanner variability.

Hwihun Jeong, Qiang Liu, Kathryn E. Keenan, Elisabeth A. Wilde, Walter Schneider, Sudhir Pathak, Anthony Zuccolotto, Lauren J. O'Donnell, Lipeng Ning, Yogesh RathiTue, 10 Ma💻 cs

Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

This paper introduces a method that enhances medical Vision-Language Models by using sequential eye-tracking data as supervision to train dedicated gaze tokens, enabling the models to mimic radiologists' visual search patterns and achieve state-of-the-art performance in both in-domain and out-of-domain medical reasoning tasks.

Yiwei Li, Zihao Wu, Yanjun Lv, Hanqi Jiang, Weihang You, Zhengliang Liu, Dajiang Zhu, Xiang Li, Quanzheng Li, Tianming Liu, Lin ZhaoTue, 10 Ma💻 cs

Asymmetric Distillation and Information Retention in Capacity-Constrained Cross-Modal Transfer

This paper investigates the severe dimensional collapse and resulting robustness fragility that occur when distilling a large Vision Transformer into capacity-constrained CNNs, revealing that while larger student models pack information densely but lose noise immunity, extremely small models act as robust low-pass filters due to fundamental geometric limitations in asymmetric cross-modal transfer.

Kabir ThayaniTue, 10 Ma💻 cs

Multi-label Instance-level Generalised Visual Grounding in Agriculture

This paper introduces gRef-CW, the first benchmark dataset for generalised visual grounding in agriculture that includes negative expressions, and proposes Weed-VG, a modular framework designed to overcome the domain gap and effectively localise crop and weed instances under challenging field conditions.

Mohammadreza Haghighat, Alzayat Saleh, Mostafa Rahimi AzghadiTue, 10 Ma💻 cs

← Previous Next →