cs.CV papers | Gist.Science

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

This paper introduces IAG, the first input-aware backdoor attack on vision-language models for visual grounding, which utilizes a text-conditioned UNet to dynamically generate imperceptible, target-specific triggers that achieve high attack success rates across various models and datasets while maintaining stealth and robustness against defenses.

Junxian Li, Beining Xu, Simin Chen, Jiatong Li, Jingdi Lei, Haodong Zhao, Di ZhangTue, 10 Ma💬 cs.CL

UltraUPConvNet: A UPerNet- and ConvNeXt-Based Multi-Task Network for Ultrasound Tissue Segmentation and Disease Prediction

UltraUPConvNet is a computationally efficient, multi-task framework based on UPerNet and ConvNeXt that simultaneously performs ultrasound tissue segmentation and disease prediction, achieving state-of-the-art performance on a large-scale dataset with reduced computational overhead.

Zhi Chen, Le ZhangTue, 10 Ma💻 cs

MICA: Multi-Agent Industrial Coordination Assistant

This paper introduces MICA, a privacy-preserving, speech-interactive multi-agent system that leverages Adaptive Step Fusion and a safety-audited coordination topology to deliver robust, real-time industrial assistance for assembly and maintenance tasks on resource-constrained hardware.

Di Wen, Kunyu Peng, Junwei Zheng, Yufan Chen, Yitian Shi, Jiale Wei, Ruiping Liu, Kailun Yang, Rainer StiefelhagenTue, 10 Ma🤖 cs.LG

ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models

This paper introduces the ORIC framework and benchmark to evaluate and improve Large Vision-Language Models' object recognition capabilities under contextual incongruity, demonstrating that such scenarios significantly degrade performance and that targeted Visual Reinforcement Fine-Tuning can effectively mitigate these failures.

Zhaoyang Li, Zhan Ling, Yuchen Zhou, Litian Gong, Erdem Bıyık, Hao SuTue, 10 Ma🤖 cs.LG

Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper

This paper introduces "Jr. AI Scientist," an autonomous system that mimics a novice researcher's workflow to generate novel, scientifically valuable papers building on real academic works, while simultaneously evaluating its performance through rigorous automated and human assessments to identify both its capabilities and the significant risks and limitations of current AI-driven scientific exploration.

Atsuyuki Miyai, Mashiro Toyooka, Takashi Otonari, Zaiying Zhao, Kiyoharu AizawaTue, 10 Ma🤖 cs.LG

Angular Gradient Sign Method: Uncovering Vulnerabilities in Hyperbolic Networks

This paper introduces the Angular Gradient Sign Method, a novel adversarial attack for hyperbolic networks that leverages the geometric decomposition of gradients to apply perturbations solely along angular (semantic) directions, thereby achieving higher fooling rates and revealing unique vulnerabilities in hierarchical embeddings compared to conventional Euclidean-based methods.

Minsoo Jo, Dongyoon Yang, Taesup KimTue, 10 Ma🤖 cs.LG

Radiative-Structured Neural Operator for Continuous and Extrapolative Spectral Super-Resolution

The paper proposes the Radiative-Structured Neural Operator (RSNO), a novel deep learning framework that reconstructs hyperspectral images from multispectral observations by learning continuous spectral mappings under radiative priors and employing angular-consistent projection to ensure physical consistency and eliminate color distortion.

Ziye Zhang, Bin Pan, Zhenwei ShiTue, 10 Ma💻 cs

Shortcut Invariance: Targeted Jacobian Regularization in Disentangled Latent Space

This paper proposes "Shortcut Invariance," a targeted Jacobian regularization method that improves out-of-distribution generalization by injecting anisotropic noise into a disentangled latent space to flatten decision boundaries along shortcut-aligned axes, thereby eliminating the need for explicit shortcut labels or conflicting training samples.

Shivam Pal, Sakshi Varshney, Piyush RaiTue, 10 Ma🤖 cs.LG

ForamDeepSlice: A High-Accuracy Deep Learning Framework for Foraminifera Species Classification from 2D Micro-CT Slices

This study introduces ForamDeepSlice, a high-accuracy deep learning framework that combines an ensemble of ConvNeXt-Large and EfficientNetV2-Small models with a rigorous specimen-level split dataset to achieve 95.64% accuracy in classifying foraminifera species from 2D micro-CT slices, while also providing an interactive dashboard for real-time identification and 3D matching.

Abdelghafour Halimi, Ali Alibrahim, Didier Barradas-Bautista, Ronell Sicat, Abdulkader M. AfifiTue, 10 Ma🤖 cs.LG

Two-Step Data Augmentation for Masked Face Detection and Recognition: Turning Fake Masks to Real

This paper presents a two-step generative data augmentation framework combining rule-based mask warping and unpaired image-to-image translation to address the scarcity of masked face datasets, achieving performance improvements with minimal training data while explicitly noting its origins as a resource-constrained coursework project that lacked downstream quantitative evaluation.

Yan Yang, George Bebis, Mircea NicolescuTue, 10 Ma🤖 cs.LG

SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks

The paper introduces SALVE, a unified framework that combines sparse autoencoders and feature-level saliency mapping to discover, validate, and precisely edit neural network weights, enabling interpretable and robust control over both convolutional and transformer-based models.

Vegard FlovikTue, 10 Ma🤖 cs.LG

ReDepth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

Re-Depth Anything is a test-time self-supervised framework that enhances monocular depth estimation by fusing foundation models with large-scale 2D diffusion priors to perform label-free refinement via generative re-lighting and Score Distillation Sampling, achieving state-of-the-art results without direct depth tensor optimization.

Ananta R. Bhattarai, Helge RhodinTue, 10 Ma🤖 cs.LG

VOIC: Visible-Occluded Integrated Guidance for 3D Semantic Scene Completion

This paper introduces VOIC, a novel dual-decoder framework for monocular 3D Semantic Scene Completion that employs a Visible Region Label Extraction strategy to decouple visible-region perception from occluded-region reasoning, thereby mitigating feature dilution and achieving state-of-the-art performance on standard benchmarks.

Zaidao Han, Risa Higashita, Jiang LiuTue, 10 Ma💻 cs

Efficient Vision Mamba for MRI Super-Resolution via Hybrid Selective Scanning

This paper proposes "Efficient Vision Mamba," a computationally lightweight deep learning framework that combines multi-head selective state-space models with hybrid scanning to achieve state-of-the-art MRI super-resolution performance while drastically reducing parameters and computation compared to existing methods.

Mojtaba Safari, Shansong Wang, Vanessa L Wildman, Mingzhe Hu, Zach Eidex, Chih-Wei Chang, Erik H Middlebrooks, Richard L. J Qiu, Pretesh Patel, Ashesh B. Jani, Hui Mao, Zhen Tian, Xiaofeng YangTue, 10 Ma🔬 physics

A Two-Stage Multitask Vision-Language Framework for Explainable Crop Disease Visual Question Answering

This paper presents a lightweight, two-stage multitask vision-language framework that integrates a Swin Transformer encoder with sequence-to-sequence decoders to achieve state-of-the-art, explainable visual question answering for crop disease identification with near-perfect classification accuracy and strong generalization capabilities.

Md. Zahid Hossain, Most. Sharmin Sultana Samu, Md. Rakibul Islam, Md. Siam AnsaryTue, 10 Ma💬 cs.CL

MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference

MeanCache is a training-free framework that accelerates Flow Matching inference by replacing instantaneous velocity caching with an average-velocity approach using cached Jacobian-vector products and a trajectory-stability scheduling strategy, achieving significant speedups (up to 4.56X) while maintaining high generation quality across models like FLUX.1 and HunyuanVideo.

Huanlin Gao, Ping Chen, Fuyuan Shi, Ruijia Wu, Li YanTao, Qiang Hui, Yuren You, Ting Lu, Chao Tan, Shaoan Zhao, Zhaoxiang Liu, Fang Zhao, Kai Wang, Shiguo LianTue, 10 Ma🤖 cs.LG

Latent Equivariant Operators for Robust Object Recognition: Promise and Challenges

This paper demonstrates that neural networks learning equivariant operators in a latent space can effectively generalize to out-of-distribution symmetric transformations on simple datasets like rotated MNIST, while also highlighting the significant challenges involved in scaling this approach to more complex data.

Minh Dinh, Stéphane DenyTue, 10 Ma🤖 cs.LG

Self-Attention And Beyond the Infinite: Towards Linear Transformers with Infinite Self-Attention

This paper introduces Infinite Self-Attention (InfSA) and its linear-time variant, Linear-InfSA, a spectral reformulation of self-attention as a diffusion process on token graphs that achieves state-of-the-art ImageNet accuracy and enables efficient, memory-free inference at ultra-high resolutions (up to 9216×9216) by replacing the quadratic softmax cost with a Neumann series approximation.

Giorgio Roffo, Luke PalmerTue, 10 Ma💻 cs

A Detection-Gated Pipeline for Robust Glottal Area Waveform Extraction and Clinical Pathology Assessment

This paper presents a computationally efficient, detection-gated deep learning pipeline that achieves state-of-the-art robustness and cross-dataset generalization in glottal segmentation from high-speed videoendoscopy, enabling reliable extraction of clinical biomarkers for distinguishing healthy from pathological vocal function.

Harikrishnan UnnikrishnanTue, 10 Ma🤖 cs.LG

Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers

This paper introduces GramCol and a motion-feature selection algorithm to generate Interpretable Motion-Attentive Maps (IMAPs) that effectively localize both motion and non-motion concepts in Video Diffusion Transformers without requiring gradient calculations or parameter updates.

Youngjun Jun, Seil Kang, Woojung Han, Seong Jae HwangTue, 10 Ma🤖 cs.LG

← Previous Next →