cs.CV papers | Gist.Science

Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation

The paper proposes DHVAE, a disentangled hierarchical variational autoencoder with contrastive learning and DDIM-based diffusion, to generate realistic 3D human-human interactions by explicitly separating global context from individual motion patterns to ensure physical plausibility and semantic alignment.

Zichen Geng, Zeeshan Hayder, Bo Miao + 3 more2026-03-03🤖 cs.AI

M-Gaussian: An Magnetic Gaussian Framework for Efficient Multi-Stack MRI Reconstruction

M-Gaussian introduces an efficient framework for multi-stack MRI reconstruction by adapting 3D Gaussian Splatting with physics-consistent primitives and neural residual refinement, achieving superior image quality and 14-fold speed improvements over existing implicit neural methods.

Kangyuan Zheng, Xuan Cai, Jiangqi Wang + 6 more2026-03-03🤖 cs.AI

Mechanistically Guided LoRA Improves Paraphrase Consistency in Medical Vision-Language Models

This paper demonstrates that a mechanistically guided LoRA fine-tuning approach, leveraging transferred Sparse Autoencoders to balance paraphrase consistency with answer accuracy, significantly reduces response flip rates and margin differences in medical Vision-Language Models while maintaining stable diagnostic performance.

Binesh Sadanandan, Vahid Behzadan2026-03-03💻 cs

Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction

The paper introduces ReMD, a physics-consistent diffusion framework that leverages multiscale residual correction via a multi-wavelet basis to achieve efficient, high-fidelity fluid super-resolution with reduced sampling steps and improved spectral accuracy compared to existing methods.

Zhihao Li, Shengwei Dong, Chuang Yi + 5 more2026-03-03🤖 cs.AI

Attention to Neural Plagiarism: Diffusion Models Can Plagiarize Your Copyrighted Images!

This paper reveals that diffusion models can plagiarize copyrighted images by employing a purely gradient-based "anchors and shims" method to perturb cross-attention mechanisms, thereby bypassing both visible and invisible copyright protections without requiring additional training.

Zihang Zou, Boqing Gong, Liqiang Wang2026-03-03💻 cs

Multiview Progress Prediction of Robot Activities

This paper proposes a multi-view architecture to overcome self-occlusion challenges and improve the prediction of action progress in robot manipulation tasks, demonstrating its effectiveness through experiments on Mobile ALOHA.

Elena Zoppellari, Federico Becattini, Marco Fiorucci + 1 more2026-03-03💻 cs

EfficientPosterGen: Semantic-aware Efficient Poster Generation via Token Compression and Accurate Violation Detection

EfficientPosterGen is an end-to-end framework that automates academic poster generation by integrating semantic-aware retrieval, visual-based context compression to reduce token usage, and a deterministic algorithm for reliable layout violation detection, thereby achieving high-quality, token-efficient, and layout-accurate results.

Wenxin Tang, Jingyu Xiao, Yanpei Gong + 6 more2026-03-03🤖 cs.AI

BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation

BiCLIP is a robust medical image segmentation framework that utilizes bidirectional multimodal fusion and augmentation consistency to achieve superior performance with minimal labeled data and high resilience against clinical artifacts.

Saivan Talaei, Fatemeh Daneshfar, Abdulhady Abas Abdullah + 1 more2026-03-03💻 cs

FujiView: Multimodal Late-Fusion for Predicting Scenic Visibility

FujiView introduces a multimodal late-fusion framework and a large-scale dataset that combines webcam imagery with meteorological data to accurately predict scenic visibility around Mount Fuji, achieving high accuracy across short-term and next-day horizons.

Bryceton Bible, Shah Md Nehal Hasnaeen, Hairong Qi2026-03-03💻 cs

FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation

FlowPortrait is a reinforcement learning framework that leverages a multimodal large language model-based evaluation system and Group Relative Policy Optimization to generate high-quality, lip-synced talking-head videos with improved motion expressiveness and temporal consistency.

Weiting Tan, Andy T. Liu, Ming Tu + 3 more2026-03-03🤖 cs.AI

DINOv3 Meets YOLO26 for Weed Detection in Vegetable Crops

This study proposes a robust precision weeding system by integrating a large-scale curated dataset with a DINOv3-finetuned ViT-small backbone into the YOLO26 architecture, achieving significant improvements in detection accuracy and cross-domain generalization while maintaining real-time performance.

Boyang Deng, Yuzhen Lu2026-03-03🤖 cs.AI

SKINOPATHY AI: Smartphone-Based Ophthalmic Screening and Longitudinal Tracking Using Lightweight Computer Vision

SKINOPATHY AI is a privacy-preserving, smartphone-based web application that utilizes lightweight computer vision algorithms to perform five complementary, non-diagnostic ophthalmic screening tasks and longitudinal tracking entirely on-device, thereby enabling accessible eye health monitoring in low-resource settings without specialized equipment or cloud inference.

S. Kalaycioglu, C. Hong, M. Zhu + 1 more2026-03-03🤖 cs.LG

GazeXPErT: An Expert Eye-tracking Dataset for Interpretable and Explainable AI in Oncologic FDG-PET/CT Scans

This paper introduces GazeXPErT, a comprehensive 4D eye-tracking dataset capturing expert search patterns on 346 oncologic FDG-PET/CT scans, which demonstrates that integrating human gaze data significantly enhances the performance and interpretability of AI models for tumor segmentation and localization.

Joy T Wu, Daniel Beckmann, Sarah Miller + 15 more2026-03-03⚡ eess

A Boundary-Metric Evaluation Protocol for Whiteboard Stroke Segmentation Under Extreme Imbalance

This paper proposes a comprehensive evaluation protocol for whiteboard stroke segmentation that addresses extreme class imbalance by integrating boundary-aware metrics and thin-subset equity analysis, revealing that overlap-based losses and higher resolution significantly outperform standard cross-entropy and classical binarization methods in both accuracy and worst-case reliability.

Nicholas Korcynski2026-03-03🤖 cs.LG

ConFoThinking: Consolidated Focused Attention Driven Thinking for Visual Question Answering

ConFoThinking is a novel framework for Visual Question Answering that enhances fine-grained perception by consolidating fragmented attention signals into a designated intermediate layer and utilizing concise semantic cues to accurately localize and zoom in on salient regions, thereby overcoming the limitations of existing tool-augmented and attention-driven methods.

Zhaodong Wu, Haochen Xue, Qi Cao + 5 more2026-03-03💻 cs

Exploring the AI Obedience: Why is Generating a Pure Color Image Harder than CyberPunk?

This paper introduces a hierarchical framework for AI obedience and the VIOLIN benchmark to demonstrate that generative models struggle with simple, deterministic tasks like pure color generation due to generative priors overriding logical constraints, despite their success with complex imagery.

Hongyu Li, Kuan Liu, Yuan Chen + 6 more2026-03-03🤖 cs.AI

Image-Based Classification of Olive Species Specific to Turkiye with Deep Neural Networks

This study demonstrates that a deep learning-based system utilizing the EfficientNetB0 model achieves 94.5% accuracy in automatically classifying five distinct olive species cultivated in Turkiye, offering a promising solution for agricultural quality control and identification.

Irfan Atabas, Hatice Karatas2026-03-03💻 cs

Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model

This paper presents a systematic account of the engineering challenges, design decisions, and key lessons learned in developing the Summer-22B video foundation model, emphasizing that dataset engineering and metadata-driven curation were more critical to success than architectural variations.

Simo Ryu, Chunghwan Han2026-03-03🤖 cs.LG

Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression

The paper proposes ST-Lite, a training-free KV cache compression framework that leverages the uniform high-sparsity of GUI attention patterns through a dual-branch scoring policy of spatial saliency and trajectory-aware semantic gating, achieving significant decoding acceleration with minimal performance loss in long-horizon GUI agents.

Bowen Zhou, Zhou Xu, Wanli Li + 2 more2026-03-03🤖 cs.LG

Task-Driven Subspace Decomposition for Knowledge Sharing and Isolation in LoRA-based Continual Learning

This paper proposes LoDA, a task-driven subspace decomposition method for LoRA-based continual learning that enhances knowledge sharing and isolation by decoupling general and task-specific directions through energy-based objectives and gradient-aligned optimization, thereby outperforming existing approaches.

Lingfeng He, De Cheng, Huaijie Wang + 3 more2026-03-03🤖 cs.LG

← Previous Next →