cs.CV papers | Gist.Science

CuriousBot: Interactive Mobile Exploration via Actionable 3D Relational Object Graph

This paper introduces CuriousBot, a mobile exploration system that utilizes a 3D relational object graph to enable active interaction with diverse objects in complex environments, outperforming vision-language model-based approaches in both effectiveness and generalization.

Yixuan Wang, Leonor Fermoselle, Tarik Kelestemur, Jiuguang Wang, Yunzhu LiWed, 11 Ma🤖 cs.LG

Directional Textual Inversion for Personalized Text-to-Image Generation

This paper introduces Directional Textual Inversion (DTI), a method that improves personalized text-to-image generation by constraining learned token embeddings to a fixed magnitude and optimizing only their direction on a hypersphere, thereby preventing norm inflation, enhancing text fidelity, and enabling smooth semantic interpolation between concepts.

Kunhee Kim, NaHyeon Park, Kibeom Hong, Hyunjung ShimWed, 11 Ma🤖 cs.LG

Kuramoto Orientation Diffusion Models

This paper introduces Kuramoto Orientation Diffusion Models, a score-based generative framework that leverages stochastic Kuramoto dynamics on periodic domains to effectively model orientation-rich images by replacing standard isotropic diffusion with synchronization-based forward processes and desynchronization-based reverse generation.

Yue Song, T. Anderson Keller, Sevan Brodjian, Takeru Miyato, Yisong Yue, Pietro Perona, Max WellingWed, 11 Ma🤖 cs.LG

Unsupervised Representation Learning from Sparse Transformation Analysis

This paper proposes an unsupervised representation learning framework that factorizes latent variable transformations into sparse rotational and potential flow fields, enabling the model to learn disentangled representations based on independent transformation primitives while achieving state-of-the-art performance in data likelihood and equivariance on sequence data.

Yue Song, Thomas Anderson Keller, Yisong Yue, Pietro Perona, Max WellingWed, 11 Ma🤖 cs.LG

From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding

The paper proposes C2FMAE, a coarse-to-fine masked autoencoder that resolves the tension between global semantics and local details in self-supervised learning by employing a cascaded decoder and progressive masking curriculum on a newly constructed multi-granular dataset to achieve hierarchical visual understanding and superior performance across various vision tasks.

Wenzhao Xiang, Yue Wu, Hongyang Yu, Feng Gao, Fan Yang, Xilin ChenWed, 11 Ma🤖 cs.LG

What is Missing? Explaining Neurons Activated by Absent Concepts

This paper identifies that deep neural networks frequently encode the absence of concepts to drive neuron activation—a phenomenon largely overlooked by standard explainable AI methods—and proposes simple extensions to attribution and feature visualization techniques to effectively reveal and leverage these "missing" concepts for better model interpretation and debiasing.

Robin Hesse, Simone Schaub-Meyer, Janina Hesse, Bernt Schiele, Stefan RothWed, 11 Ma🤖 cs.LG

MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

MM-Zero is the first RL-based framework to enable Vision Language Models to self-evolve from zero data by employing a multi-role system (Proposer, Coder, and Solver) trained with Group Relative Policy Optimization to generate visual concepts, render them via code, and solve multimodal reasoning tasks without any seed images.

Zongxia Li, Hongyang Du, Chengsong Huang, Xiyang Wu, Lantao Yu, Yicheng He, Jing Xie, Xiaomin Wu, Zhichao Liu, Jiarui Zhang, Fuxiao LiuWed, 11 Ma🤖 cs.LG

Vision-Language Models Encode Clinical Guidelines for Concept-Based Medical Reasoning

The paper introduces MedCBR, a novel framework that integrates clinical guidelines with vision-language models to enhance the interpretability and accuracy of medical image diagnosis by transforming visual features into guideline-conformant concepts and structured clinical narratives.

Mohamed Harmanani, Bining Long, Zhuoxin Guo, Paul F. R. Wilson, Amirhossein Sabour, Minh Nguyen Nhat To, Gabor Fichtinger, Purang Abolmaesumi, Parvin MousaviWed, 11 Ma🤖 cs.LG

Performance Analysis of Edge and In-Sensor AI Processors: A Comparative Review

This paper reviews the landscape of ultra-low-power edge and in-sensor AI processors and empirically benchmarks a segmentation model on GAP9, STM32N6, and Sony IMX500 platforms to demonstrate that while in-sensor processing offers superior energy-delay performance, different architectures provide distinct trade-offs between latency, energy efficiency, and power budgets.

Luigi Capogrosso, Pietro Bonazzi, Michele MagnoWed, 11 Ma🤖 cs.LG

An accurate flatness measure to estimate the generalization performance of CNN models

This paper proposes an exact, parameterization-aware flatness measure tailored to the geometric structure of convolutional neural networks with global average pooling, demonstrating its effectiveness as a robust proxy for estimating and comparing generalization performance across various CNN architectures.

Rahman Taleghani, Maryam Mohammadi, Francesco MarchettiWed, 11 Ma🤖 cs.LG

The Coupling Within: Flow Matching via Distilled Normalizing Flows

This paper introduces Normalized Flow Matching (NFM), a novel method that distills quasi-deterministic couplings from pretrained auto-regressive normalizing flow models to train student flow models, achieving superior performance over both traditional flow matching approaches and the teacher models themselves.

David Berthelot, Tianrong Chen, Jiatao Gu, Marco Cuturi, Laurent Dinh, Bhavik Chandna, Michal Klein, Josh Susskind, Shuangfei ZhaiWed, 11 Ma🤖 cs.LG

Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation

Pri4R is a simple yet effective method that enhances Vision-Language-Action models with an implicit understanding of world dynamics by training them to predict 3D point tracks using privileged 4D information, thereby significantly improving physical manipulation performance without adding inference overhead.

Jisoo Kim, Jungbin Cho, Sanghyeok Chu, Ananya Bal, Jinhyung Kim, Gunhee Lee, Sihaeng Lee, Seung Hwan Kim, Bohyung Han, Hyunmin Lee, Laszlo A. Jeni, Seungryong KimWed, 11 Ma🤖 cs.AI

Zero-Shot and Supervised Bird Image Segmentation Using Foundation Models: A Dual-Pipeline Approach with Grounding DINO~1.5, YOLOv11, and SAM~2.1

This paper proposes a dual-pipeline framework for bird image segmentation that leverages the frozen SAM 2.1 backbone with either a zero-shot Grounding DINO 1.5 detector or a supervised fine-tuned YOLOv11 detector, achieving state-of-the-art performance on the CUB-200-2011 dataset while eliminating the need for retraining the segmentation model across different species or domains.

Abhinav MunagalaWed, 11 Ma🤖 cs.AI

OrthoAI: A Neurosymbolic Framework for Evidence-Grounded Biomechanical Reasoning in Clear Aligner Orthodontics

OrthoAI is a neurosymbolic framework that bridges 3D tooth segmentation and clinical reasoning for clear aligner orthodontics by combining sparse-supervision learning, knowledge-grounded biomechanical constraint inference, and multi-criteria treatment evaluation to enable fast, evidence-based automated decision support.

Edouard Lansiaux, Margaux Leman, Mehdi AmmiWed, 11 Ma🤖 cs.AI

B-DENSE: Branching For Dense Ensemble Network Supervision Efficiency

The paper proposes B-DENSE, a novel distillation framework that leverages multi-branch trajectory alignment to enforce dense intermediate supervision, thereby overcoming the structural information loss and discretization errors of existing methods to achieve superior image generation quality with reduced inference latency.

Cherish Puniani, Tushar Kumar, Arnav Bendre, Gaurav Kumar, Shree SinghiWed, 11 Ma🤖 cs.AI

Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision

This paper proposes an energy-aware spike budgeting framework that integrates experience replay, learnable neuron parameters, and an adaptive scheduler to effectively mitigate catastrophic forgetting while optimizing both accuracy and energy efficiency in Spiking Neural Networks across diverse frame-based and event-based neuromorphic vision benchmarks.

Anika Tabassum Meem, Muntasir Hossain Nadid, Md Zesun Ahmed MiaWed, 11 Ma🤖 cs.AI

Monocular Normal Estimation via Shading Sequence Estimation

This paper introduces RoSE, a novel approach that reformulates monocular normal estimation as shading sequence estimation using image-to-video generative models to overcome 3D misalignment issues and achieve state-of-the-art performance on real-world benchmarks.

Zongrui Li, Xinhua Ma, Minghui Hu, Yunqing Zhao, Yingchen Yu, Qian Zheng, Chang Liu, Xudong Jiang, Song BaiWed, 11 Ma🤖 cs.AI

WebAccessVL: Violation-Aware VLM for Web Accessibility

The paper introduces WebAccessVL, a violation-aware vision-language model that automatically edits website HTML to fix WCAG2 accessibility violations while preserving visual design, achieving a 96% reduction in violations and outperforming GPT-5 through a supervised image-conditioned program synthesis approach enhanced by a checker-in-the-loop refinement strategy.

Amber Yijia Zheng, Jae Joong Lee, Bedrich Benes, Raymond A. YehWed, 11 Ma🤖 cs.AI

CLEAR-Mamba:Towards Accurate, Adaptive and Trustworthy Multi-Sequence Ophthalmic Angiography Classification

The paper introduces CLEAR-Mamba, an enhanced MedMamba framework featuring a hypernetwork-based adaptive conditioning layer and a reliability-aware prediction scheme, which achieves superior accuracy and trustworthiness in multi-sequence ophthalmic angiography classification by addressing challenges in generalization and confidence estimation.

Zhuonan Wang, Wenjie Yan, Wenqiao Zhang, Xiaohui Song, Jian Ma, Ke Yao, Yibo Yu, Beng Chin OoiWed, 11 Ma🤖 cs.AI

When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models

This paper introduces UPA-RFAS, a unified framework that generates universal and transferable physical adversarial patches to effectively attack diverse Vision-Language-Action (VLA) models across unknown architectures, finetuned variants, and sim-to-real shifts by leveraging robust feature alignment, a two-phase min-max optimization, and VLA-specific attention and semantic losses.

Hui Lu, Yi Yu, Yiming Yang, Chenyu Yi, Qixin Zhang, Bingquan Shen, Alex C. Kot, Xudong JiangWed, 11 Ma🤖 cs.AI

← Previous Next →