cs.CV papers | Gist.Science

An Effective Data Augmentation Method by Asking Questions about Scene Text Images

This paper proposes a VQA-inspired data augmentation framework that generates natural-language questions about character-level attributes to enhance scene and handwritten text recognition models, resulting in significant improvements in transcription accuracy on benchmark datasets.

Xu Yao, Lei Kang2026-03-05💻 cs

Hazard-Aware Traffic Scene Graph Generation

This paper introduces a novel Traffic Scene Graph Generation framework that leverages accident data and depth cues to model safety-relevant relations between hazards and the ego vehicle, thereby enhancing situational awareness in complex driving scenarios.

Yaoqi Huang, Julie Stephany Berrio, Mao Shan + 1 more2026-03-05💻 cs

DM-CFO: A Diffusion Model for Compositional 3D Tooth Generation with Collision-Free Optimization

This paper proposes DM-CFO, a diffusion model-based framework that integrates text and graph constraints for layout generation with collision-free optimization via 3D Gaussian updates and distance regularization to produce realistic, intersection-free compositional 3D tooth designs.

Yan Tian, Pengcheng Xue, Weiping Ding + 5 more2026-03-05💻 cs

Detection and Identification of Penguins Using Appearance and Motion Features

This paper proposes a framework that enhances penguin detection and identification in animal facilities by integrating motion cues into a modified YOLO11 detector for improved temporal consistency and employing tracklet-based contrastive learning to generate coherent feature embeddings for individual recognition.

Kasumi Seko, Hiroki Kinoshita, Raj Rajeshwar Malinda + 1 more2026-03-05💻 cs

Tracking Feral Horses in Aerial Video Using Oriented Bounding Boxes

This paper proposes a robust method for tracking feral horses in aerial video by employing oriented bounding boxes and a novel head-orientation estimation technique using multi-detector voting to resolve 180° flipping ambiguities, thereby achieving 99.3% accuracy in distinguishing head from tail for continuous trajectory analysis.

Saeko Takizawa, Tamao Maeda, Shinya Yamamoto + 1 more2026-03-05💻 cs

Parallax to Align Them All: An OmniParallax Attention Mechanism for Distributed Multi-View Image Compression

The paper proposes ParaHydra, a novel distributed multi-view image compression framework featuring an OmniParallax Attention Mechanism and a Parallax Multi Information Fusion Module that adaptively aligns and integrates inter-view correlations, enabling it to significantly outperform state-of-the-art multi-view codecs in both bitrate efficiency and computational speed.

Haotian Zhang, Feiyue Long, Yixin Yu + 7 more2026-03-05💻 cs

LeafInst - Unified Instance Segmentation Network for Fine-Grained Forestry Leaf Phenotype Analysis: A New UAV based Benchmark

This paper introduces LeafInst, a novel instance segmentation network designed for fine-grained forestry leaf analysis in open-field UAV imagery, and validates its superior performance on the newly constructed Poplar-leaf benchmark and the public PhenoBench dataset.

Taige Luo, Junru Xie, Chenyang Fan + 5 more2026-03-05💻 cs

RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation

This paper introduces RAGTrack, a novel Retrieval-Augmented Generation framework that enhances RGB-Thermal tracking by integrating textual descriptions via Multi-modal Large Language Models and employing adaptive token fusion with context-aware reasoning to overcome appearance variations and modality gaps.

Hao Li, Yuhao Wang, Wenning Hao + 3 more2026-03-05💻 cs

CoRe-BT: A Multimodal Radiology-Pathology-Text Benchmark for Robust Brain Tumor Typing

The paper introduces CoRe-BT, a multimodal benchmark comprising 310 patients with MRI, histopathology, and pathology reports, designed to evaluate robust brain tumor typing under realistic conditions of missing clinical data.

Juampablo E. Heras Rivera, Daniel K. Low, Xavier Xiong + 5 more2026-03-05💻 cs

Extending Neural Operators: Robust Handling of Functions Beyond the Training Set

This paper presents a rigorous framework that extends neural operators to robustly handle out-of-distribution input functions by leveraging kernel approximations and Reproducing Kernel Hilbert Space theory to ensure accurate prediction of both function values and derivatives, validated through solutions of elliptic partial differential equations on manifolds.

Blaine Quackenbush, Paul J. Atzberger2026-03-05🤖 cs.LG

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

This paper introduces Image-based Prompt Injection (IPI), a black-box attack that embeds adversarial instructions into natural images to hijack Multimodal Large Language Models, demonstrating a 64% success rate in manipulating model outputs while remaining visually imperceptible to humans.

Neha Nagaraja, Lan Zhang, Zhilong Wang + 2 more2026-03-05🤖 cs.AI

InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions

The paper presents InfinityStory, a novel framework, dataset, and model that overcome key limitations in long-form video generation by ensuring background and character consistency across shots while enabling seamless multi-subject transitions for hour-long narratives.

Mohamed Elmoghany, Liangbing Zhao, Xiaoqian Shen + 27 more2026-03-05💻 cs

One-Step Face Restoration via Shortcut-Enhanced Coupling Flow

The paper proposes SCFlowFR, a one-step face restoration method that leverages data-dependent coupling, conditional mean estimation, and a shortcut constraint to model low-to-high quality dependencies, thereby eliminating path crossovers and enabling high-quality, single-step inference.

Xiaohui Sun, Hanlin Wu2026-03-05💻 cs

Field imaging framework for morphological characterization of aggregates with computer vision: Algorithms and applications

This dissertation presents a comprehensive field imaging framework that leverages advanced computer vision algorithms, including 2D instance segmentation and an integrated 3D reconstruction-segmentation-completion approach, to overcome the limitations of traditional methods and enable accurate morphological characterization of construction aggregates across diverse field scenarios.

Haohang Huang2026-03-05🤖 cs.AI

InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models

This paper introduces InEdit-Bench, the first benchmark designed to evaluate the ability of multimodal generative models to reason over intermediate logical pathways in complex image editing tasks, revealing significant shortcomings in current models' capacity for dynamic reasoning and causal understanding.

Zhiqiang Sheng, Xumeng Han, Zhiwei Zhang + 6 more2026-03-05🤖 cs.AI

Machine Pareidolia: Protecting Facial Image with Emotional Editing

This paper introduces MAP, a novel facial privacy protection method that employs human emotion editing to disguise original identities as target identities, effectively overcoming the limitations of traditional countermeasures in black-box settings and across diverse demographics while maintaining high perceptual quality.

Binh M. Le, Simon S. Woo2026-03-05🤖 cs.LG

EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs

EvoPrune is an early-stage visual token pruning method that performs layer-wise pruning guided by token similarity, diversity, and attention importance during visual encoding, achieving a 2 $\times$ inference speedup with minimal performance degradation on high-resolution images and videos.

Yuhao Chen, Bin Shan, Xin Ye + 1 more2026-03-05🤖 cs.AI

Polyp Segmentation Using Wavelet-Based Cross-Band Integration for Enhanced Boundary Representation

This paper proposes a wavelet-based polyp segmentation model that integrates grayscale and RGB representations through complementary frequency-consistent interaction to overcome low-contrast challenges and achieve superior boundary precision, as validated by extensive experiments on four benchmark datasets.

Haesung Oh, Jaesung Lee2026-03-05💻 cs

Error as Signal: Stiffness-Aware Diffusion Sampling via Embedded Runge-Kutta Guidance

This paper proposes Embedded Runge-Kutta Guidance (ERK-Guid), a novel sampling method that leverages solver-induced local truncation errors as a guidance signal to detect stiffness and stabilize diffusion model generation, thereby outperforming state-of-the-art methods on benchmarks like ImageNet.

Inho Kong, Sojin Lee, Youngjoon Hong + 1 more2026-03-05🤖 cs.AI

MPFlow: Multi-modal Posterior-Guided Flow Matching for Zero-Shot MRI Reconstruction

MPFlow is a zero-shot multi-modal MRI reconstruction framework that leverages a self-supervised pretraining strategy (PAMRI) to guide rectified flow sampling with auxiliary structural scans, thereby significantly reducing hallucinations and improving anatomical fidelity compared to single-modality baselines while requiring fewer sampling steps.

Seunghoi Kim, Chen Jin, Henry F. J. Tregidgo + 2 more2026-03-05🤖 cs.AI

← Previous Next →