cs.CV papers | Gist.Science

CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning

This paper introduces CGL, a continual GUI learning framework that mitigates catastrophic forgetting by dynamically balancing Supervised Fine-Tuning and Reinforcement Learning through an entropy-guided proportion adjustment mechanism and a specialized gradient surgery strategy, validated by a new AndroidControl-CL benchmark.

Zhenquan Yao, Zitong Huang, Yihan Zeng, Jianhua Han, Hang Xu, Chun-Mei Feng, Jianwei Ma, Wangmeng ZuoTue, 10 Ma🤖 cs.LG

A Novel Approach for Testing Water Safety Using Deep Learning Inference of Microscopic Images of Unincubated Water Samples

This paper introduces DeepScope, a deep learning-based system that analyzes microscopic images of unincubated water samples to detect fecal contamination in seconds with 93% accuracy and a cost of $0.44 per test, significantly outperforming traditional incubation methods in speed and affordability.

Sanjay SrinivasanTue, 10 Ma🤖 cs.LG

OptiRoulette Optimizer: A New Stochastic Meta-Optimizer for up to 5.3x Faster Convergence

The paper introduces OptiRoulette, a stochastic meta-optimizer that dynamically selects update rules from a pool during training, demonstrating significantly faster convergence and higher test accuracy across multiple image-classification benchmarks compared to a standard AdamW baseline.

Stamatis MastromichalakisTue, 10 Ma🤖 cs.LG

Correlation Analysis of Generative Models

This paper proposes a unified linear representation for diffusion models and flow matching to theoretically demonstrate that the often weak correlation between noisy data and predicted targets in existing methods may adversely impact the learning process.

Zhengguo Li, Chaobing Zheng, Wei WangTue, 10 Ma🤖 cs.LG

RECAP: Local Hebbian Prototype Learning as a Self-Organizing Readout for Reservoir Dynamics

RECAP is a bio-inspired image classification method that couples untrained reservoir dynamics with a self-organizing Hebbian prototype readout to achieve robust, backpropagation-free learning capable of generalizing to corrupted inputs without prior exposure.

Heng ZhangTue, 10 Ma🤖 cs.LG

Roots Beneath the Cut: Uncovering the Risk of Concept Revival in Pruning-Based Unlearning for Diffusion Models

This paper reveals that pruning-based unlearning in diffusion models is inherently insecure because the locations of pruned weights act as side-channel signals that enable a novel, data-free, and training-free attack to fully revive erased concepts, prompting a call for safer pruning mechanisms that conceal these locations.

Ci Zhang, Zhaojun Ding, Chence Yang, Jun Liu, Xiaoming Zhai, Shaoyi Huang, Beiwen Li, Xiaolong Ma, Jin Lu, Geng YuanTue, 10 Ma🤖 cs.LG

ObjChangeVR: Object State Change Reasoning from Continuous Egocentric Views in VR Environments

This paper introduces ObjChangeVR, a novel framework and corresponding dataset designed to enhance object state change reasoning in virtual reality by addressing the challenges of detecting background changes without direct interaction through viewpoint-aware retrieval and cross-view reasoning.

Shiyi Ding, Shaoen Wu, Ying ChenTue, 10 Ma💻 cs

Margin-Consistent Deep Subtyping of Invasive Lung Adenocarcinoma via Perturbation Fidelity in Whole-Slide Image Analysis

This paper proposes a margin-consistent deep subtyping framework for invasive lung adenocarcinoma that integrates attention-weighted aggregation, contrastive regularization, and a novel Perturbation Fidelity scoring mechanism to achieve robust, high-accuracy classification across multiple architectures and demonstrate cross-institutional generalizability on whole-slide images.

Meghdad Sabouri Rad (Vincent), Junze (Vincent), Huang, Mohammad Mehdi Hosseini, Rakesh Choudhary, Saverio J. Carello, Ola El-Zammar, Michel R. Nasr, Bardia RoddTue, 10 Ma💻 cs

PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

PaLMR is a novel framework that enhances the faithfulness of multimodal large language models by aligning both the reasoning process and outcomes through a perception-aligned data layer and a hierarchical reward fusion scheme, thereby significantly reducing visual hallucinations while achieving state-of-the-art performance on key benchmarks.

Yantao Li, Qiang Hui, Chenyang Yan, Kanzhi Cheng, Fang Zhao, Chao Tan, Huanling Gao, Jianbing Zhang, Kai Wang, Xinyu Dai, Shiguo LianTue, 10 Ma💻 cs

A Parameter-efficient Convolutional Approach for Weed Detection in Multispectral Aerial Imagery

This paper introduces FCBNet, a parameter-efficient convolutional model featuring a frozen ConvNeXt backbone and a Feature Correction Block that achieves superior weed segmentation accuracy (over 85% mIoU) and computational efficiency across RGB and multispectral aerial imagery compared to existing state-of-the-art models.

Leo Thomas Ramos, Angel D. SappaTue, 10 Ma💻 cs

GameVerse: Can Vision-Language Models Learn from Video-based Reflection?

The paper introduces GameVerse, a comprehensive benchmark featuring a novel reflect-and-retry paradigm and a hierarchical taxonomy across 15 games, demonstrating that Vision-Language Models can effectively improve their gameplay policies through video-based reflection by combining failure trajectories with expert tutorials.

Kuan Zhang, Dongchen Liu, Qiyue Zhao, Jinkun Hou, Xinran Zhang, Qinlei Xie, Miao Liu, Yiming LiTue, 10 Ma💻 cs

ASMIL: Attention-Stabilized Multiple Instance Learning for Whole Slide Imaging

The paper introduces ASMIL, a unified framework that addresses unstable attention dynamics, overfitting, and over-concentrated attention in attention-based multiple instance learning for whole slide imaging by employing an anchor model with a normalized sigmoid function and token random dropping, resulting in significant performance improvements over state-of-the-art methods.

Linfeng Ye, Shayan Mohajer Hamidi, Zhixiang Chi, Guang Li, Mert Pilanci, Takahiro Ogawa, Miki Haseyama, Konstantinos N. PlataniotisTue, 10 Ma💻 cs

EnsAug: Augmentation-Driven Ensembles for Human Motion Sequence Analysis

The paper introduces EnsAug, a novel training paradigm that improves human motion analysis by replacing the conventional single-model approach with an ensemble of specialists, each trained on data augmented by a distinct geometric transformation, thereby achieving state-of-the-art performance while respecting kinematic constraints.

Bikram De, Habib Irani, Vangelis MetsisTue, 10 Ma🤖 cs.LG

HyperTokens: Controlling Token Dynamics for Continual Video-Language Understanding

The paper introduces HyperTokens, a transformer-based token generator augmented with meta-inspired regularizers and causal auxiliary supervision to enable efficient, low-forgetting continual Video-Language Understanding by dynamically producing task-specific prompts without increasing memory costs.

Toan Nguyen, Yang Liu, Celso De Melo, Flora D. SalimTue, 10 Ma🤖 cs.LG

Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting

The paper proposes Graph-of-Mark (GoM), a novel pixel-level visual prompting technique that overlays scene graphs onto images to capture object relationships, thereby significantly enhancing the spatial reasoning and zero-shot performance of multimodal language models.

Giacomo Frisoni, Lorenzo Molfetta, Mattia Buzzoni, Gianluca MoroTue, 10 Ma💻 cs

Accelerating Video Generation Inference with Sequential-Parallel 3D Positional Encoding Using a Global Time Index

This paper introduces a system-level inference optimization for Diffusion Transformer-based video generation that employs a sequence-parallel Causal-RoPE mechanism and operator fusion to overcome memory and latency bottlenecks, achieving near real-time speeds and sub-second first-frame latency on an eight-GPU cluster.

Chao Yuan, Pan LiTue, 10 Ma💻 cs

Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

This paper reveals that Chain-of-Thought prompting often underperforms direct answering in medical visual question answering due to a "medical perception bottleneck," and proposes training-free grounding interventions to restore visual accuracy and improve model reasoning.

Yuan Wu, Zongxian Yang, Jiayu Qian, Songpan Gao, Guanxing Chen, Qiankun Li, Yu-An Huang, Zhi-An HuangTue, 10 Ma💻 cs

SJD-PV: Speculative Jacobi Decoding with Phrase Verification for Autoregressive Image Generation

This paper introduces SJD-PV, a training-free acceleration framework for autoregressive image generation that leverages phrase-level speculative verification based on token co-occurrence statistics to jointly validate multiple correlated tokens, achieving up to 30% faster decoding without compromising visual fidelity.

Zhehao Yu, Baoquan Zhang, Bingqi Shan, Xinhao Liu, Dongliang Zhou, Guotao Liang, Guangming Ye, Yunming YeTue, 10 Ma💻 cs

calibfusion: Transformer-Based Differentiable Calibration for Radar-Camera Fusion Detection in Water-Surface Environments

The paper proposes CalibFusion, a Transformer-based differentiable calibration framework that learns to implicitly refine Radar-Camera extrinsics end-to-end to overcome the challenges of textureless, cluttered water-surface environments and significantly improve fusion-based 2D object detection.

Yuting Wan, Liguo Sun, Jiuwu Hao, Pin LVTue, 10 Ma💻 cs

Does Semantic Noise Initialization Transfer from Images to Videos? A Paired Diagnostic Study

This paper investigates whether semantic noise initialization, known to improve image diffusion models, transfers to text-to-video generation, finding that while it shows a slight positive trend on temporal metrics, it does not significantly outperform standard Gaussian noise due to weak or unstable signals in the noise space.

Yixiao Jing, Chaoyu Zhang, Zixuan Zhong, Peizhou HuangTue, 10 Ma💻 cs

← Previous Next →