cs.CV papers | Gist.Science

Near--Real-Time Conflict-Related Fire Detection in Sudan Using Unsupervised Deep Learning

This study introduces a lightweight, unsupervised Variational Auto-Encoder model utilizing 3-meter 4-band Planet Labs imagery to detect conflict-related fires in Sudan within 24 to 30 hours, demonstrating superior performance in recall and F1-scores compared to traditional change detection methods for near-real-time war zone monitoring.

Kuldip Singh Atwal, Dieter Pfoser, Daniel Rothbart2026-03-03🤖 cs.AI

Family Matters: A Systematic Study of Spatial vs. Frequency Masking for Continual Test-Time Adaptation

This paper presents a systematic study isolating the impact of masking families in continual test-time adaptation, revealing that spatial masking generally outperforms frequency masking on patch-tokenized architectures by preserving structural coherence, while the optimal choice ultimately depends on the alignment between the specific architecture and task.

Chandler Timm C. Doloriel, Yunbei Zhang, Yeonguk Yu + 6 more2026-03-03💻 cs

Brain-Semantoks: Learning Semantic Tokens of Brain Dynamics with a Self-Distilled Foundation Model

Brain-Semantoks is a self-supervised foundation model for fMRI time series that leverages a semantic tokenizer and a self-distillation objective to learn robust, abstract representations of brain dynamics, enabling strong downstream performance and scalable out-of-distribution generalization without extensive fine-tuning.

Sam Gijsen, Marc-Andre Schulz, Kerstin Ritter2026-03-03🧬 q-bio

$β$ -CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment

This paper introduces $\beta$ -CLIP, a multi-granular text-conditioned contrastive learning framework that employs cross-attention and a novel $\beta$ -Contextualized Contrastive Alignment Loss to achieve state-of-the-art dense vision-language alignment by hierarchically matching textual descriptions of varying lengths to corresponding visual regions.

Fatimah Zohra, Chen Zhao, Hani Itani + 1 more2026-03-03💻 cs

CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives

CRISP is a novel method that recovers physically plausible, simulation-ready human motion and scene geometry from monocular video by fitting planar primitives to point clouds, leveraging contact modeling for occluded regions, and validating interactions through reinforcement learning, thereby significantly reducing motion tracking failures and accelerating real-to-sim applications.

Zihan Wang, Jiashun Wang, Jeff Tan + 4 more2026-03-03💻 cs

SoFlow: Solution Flow Models for One-Step Generative Modeling

SoFlow introduces a one-step generative modeling framework that leverages a novel Flow Matching loss and a Jacobian-free solution consistency loss to achieve superior ImageNet 256x256 generation performance compared to MeanFlow models while avoiding computationally expensive operations.

Tianze Luo, Haotian Yuan, Zhuang Liu2026-03-03🤖 cs.LG

AI-Powered Dermatological Diagnosis: From Interpretable Models to Clinical Implementation A Comprehensive Framework for Accessible and Trustworthy Skin Disease Detection

This research proposes a comprehensive, interpretable multi-modal AI framework that integrates deep learning image analysis with family history data to enhance the accuracy and personalization of dermatological diagnosis, with plans for prospective clinical trials to validate its real-world implementation.

Satya Narayana Panda, Vaishnavi Kukkala, Spandana Iyer2026-03-03🤖 cs.AI

GeoTeacher: Geometry-Guided Semi-Supervised 3D Object Detection

The paper proposes GeoTeacher, a semi-supervised 3D object detection framework that enhances student model performance on limited labeled data by employing a keypoint-based geometric relation supervision module and a distance-decay voxel-wise data augmentation strategy to better capture and understand object geometries, achieving state-of-the-art results on the ONCE and Waymo datasets.

Jingyu Li, Xiaolong Zhao, Zhe Liu + 2 more2026-03-03💻 cs

ForCM: Forest Cover Mapping from Multispectral Sentinel-2 Image by Integrating Deep Learning with Object-Based Image Analysis

This research proposes "ForCM," a novel forest cover mapping approach that integrates deep learning models with Object-Based Image Analysis (OBIA) on Sentinel-2 imagery, achieving higher accuracy (up to 95.64%) than traditional OBIA methods to support global environmental monitoring.

Maisha Haque, Israt Jahan Ayshi, Sadaf M. Anis + 8 more2026-03-03🤖 cs.AI

Plug-and-Play Fidelity Optimization for Diffusion Transformer Acceleration via Cumulative Error Minimization

This paper introduces CEM, a model-agnostic, plug-and-play plugin that utilizes a dynamic programming algorithm guided by cumulative error minimization to dynamically optimize caching strategies, thereby significantly enhancing the generation fidelity of accelerated Diffusion Transformer models without incurring additional computational overhead.

Tong Shao, Yusen Fu, Guoying Sun + 3 more2026-03-03💻 cs

Aligned explanations in neural networks

This paper introduces Pointwise-interpretable Networks (PiNets), a modeling framework that ensures "explanatory alignment" by combining statistical intelligence with a pseudo-linear structure to produce neural network explanations that directly underlie predictions rather than merely rationalizing them.

Corentin Lobet, Francesca Chiaromonte2026-03-03📊 stat

TP-Blend: Textual-Prompt Attention Pairing for Precise Object-Style Blending in Diffusion Models

TP-Blend is a lightweight, training-free framework that achieves precise object-style blending in diffusion models by combining Cross-Attention Object Fusion for spatially aware feature reassignment and Self-Attention Style Fusion for detail-sensitive texture modulation, enabling simultaneous high-fidelity object replacement and style transfer.

Xin Jin, Yichuan Zhong, Yapeng Tian2026-03-03🤖 cs.AI

Copy-Trasform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints

This paper presents a zero-shot 3D object alignment framework that optimizes relative pose using CLIP-driven gradients and geometry-aware constraints via a differentiable renderer, achieving semantically faithful and physically plausible results without requiring new model training.

Rotem Gatenyo, Ohad Fried2026-03-03💻 cs

Counterfactual Explanations on Robust Perceptual Geodesics

This paper introduces Perceptual Counterfactual Geodesics (PCG), a method that generates robust and semantically valid counterfactual explanations by tracing geodesics under a perceptually aligned Riemannian metric, thereby overcoming the off-manifold artifacts and adversarial vulnerabilities inherent in existing latent-space optimization approaches.

Eslam Zaher, Maciej Trzaskowski, Quan Nguyen + 1 more2026-03-03🤖 cs.LG

Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Vision-DeepResearch introduces a novel multimodal deep-research paradigm that leverages multi-turn, multi-entity, and multi-scale visual and textual search, trained via cold-start supervision and reinforcement learning, to significantly outperform existing models and strong closed-source foundation models in solving complex, noise-heavy real-world questions.

Wenxuan Huang, Yu Zeng, Qiuchen Wang + 13 more2026-03-03🤖 cs.AI

When Anomalies Depend on Context: Learning Conditional Compatibility for Anomaly Detection

This paper introduces the CAAD-3K benchmark and a conditional compatibility learning framework that leverages vision-language representations to detect anomalies based on subject-context compatibility, thereby addressing the limitations of traditional methods that treat abnormality as an intrinsic property independent of context.

Shashank Mishra, Didier Stricker, Jason Rambach2026-03-03🤖 cs.LG

Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning

This paper introduces HitEmotion, a Theory-of-Mind-grounded benchmark and a corresponding framework combining ToM-guided reasoning chains with TMPO reinforcement learning to significantly enhance the deep emotional understanding and reasoning capabilities of multimodal large language models.

Meng Luo, Bobo Li, Shanqing Xu + 8 more2026-03-03💻 cs

Gradient-Aligned Calibration for Post-Training Quantization of Diffusion Models

This paper proposes a novel post-training quantization method for diffusion models that optimizes calibration sample weights to align gradients across timesteps, thereby overcoming the sub-optimality of uniform weighting and significantly improving quantization performance.

Dung Anh Hoang, Cuong Pham anh Trung Le, Jianfei Cai + 1 more2026-03-03🤖 cs.LG

Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning

This paper introduces CaCoVID, a reinforcement learning-based token compression framework for video large language models that optimizes token selection by explicitly maximizing their contribution to correct predictions rather than relying on attention scores, thereby significantly reducing computational overhead while maintaining performance.

Yinchao Ma, Qiang Zhou, Zhibin Wang + 4 more2026-03-03🤖 cs.AI

CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions

This paper introduces CloDS, an unsupervised framework that learns cloth dynamics from multi-view visual observations under unknown conditions by employing a three-stage pipeline featuring a novel dual-position opacity modulation for robust video-to-geometry grounding.

Yuliang Zhan, Jian Li, Wenbing Huang + 3 more2026-03-03🤖 cs.AI

← Previous Next →

cs.CV