cs.CV papers | Gist.Science

From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding

The paper proposes C2FMAE, a coarse-to-fine masked autoencoder that resolves the tension between global semantics and local details in self-supervised learning by employing a cascaded decoder and progressive masking curriculum on a newly constructed multi-granular dataset to achieve hierarchical visual understanding and superior performance across various vision tasks.

Wenzhao Xiang, Yue Wu, Hongyang Yu, Feng Gao, Fan Yang, Xilin Chen2026-03-11🤖 cs.LG

BEACON: Language-Conditioned Navigation Affordance Prediction under Occlusion

This paper introduces BEACON, a language-conditioned navigation system that overcomes the limitations of existing 2D image-space methods by predicting an occlusion-aware Bird's-Eye View affordance heatmap from surround-view RGB-D observations, thereby significantly improving the accuracy of inferring traversable targets in occluded regions.

Xinyu Gao, Gang Chen, Javier Alonso-Mora2026-03-11🤖 cs.AI

ReCoSplat: Autoregressive Feed-Forward Gaussian Splatting Using Render-and-Compare

ReCoSplat is an autoregressive feed-forward Gaussian Splatting model that overcomes the training-inference pose mismatch dilemma through a novel Render-and-Compare module and achieves state-of-the-art online novel view synthesis with efficient long-sequence handling via hybrid KV cache compression.

Freeman Cheng, Botao Ye, Xueting Li, Junqi You, Fangneng Zhan, Ming-Hsuan Yang2026-03-11💻 cs

From Data Statistics to Feature Geometry: How Correlations Shape Superposition

This paper challenges the standard view of superposition in neural networks by demonstrating that, unlike in idealized uncorrelated settings where interference is merely noise, realistic feature correlations allow models to arrange features so that interference becomes constructive, thereby naturally forming the semantic clusters and cyclical structures observed in real language models.

Lucas Prieto, Edward Stevinson, Melih Barsbey, Tolga Birdal, Pedro A. M. Mediano2026-03-11🤖 cs.AI

Differentiable Microscopy Designs an All Optical Phase Retrieval Microscope

This paper introduces "differentiable microscopy" ( $\partial\mu$ ), a data-driven, top-down design framework that automatically optimizes optical systems for phase retrieval, demonstrating superior performance over existing methods and experimentally validating its effectiveness on biological samples.

Kithmini Herath, Hasindu Kariyawasam, Ramith Hettiarachchi, Udith Haputhanthri, Dineth Jayakody, Raja N. Ahmad, Azeem Ahmad, Balpreet S. Ahluwalia, Chamira U. S. Edussooriya, Dushan N. Wadduwage2026-03-10🔬 physics.optics

Class Overwhelms: Mutual Conditional Blended-Target Domain Adaptation

This paper proposes a mutual conditional blended-target domain adaptation framework that aligns categorical distributions and rectifies classifier bias through uncertainty-guided discrimination and low-level feature augmentation, achieving state-of-the-art performance even without explicit domain labels and under label distribution shifts.

Pengcheng Xu, Boyu Wang, Charles Ling2026-03-10💻 cs

altiro3D: Scene representation from single image and novel view synthesis

The paper introduces altiro3D, a free library that synthesizes realistic 3D experiences and novel views from a single RGB image or video by combining monocular depth estimation, inpainting, and optimized projection algorithms to generate multi-viewpoint light fields for free-view displays.

E. Canessa, L. Tenze2026-03-10💻 cs

Multi-Scale Distillation for RGB-D Anomaly Detection on the PD-REAL Dataset

This paper introduces PD-REAL, a novel large-scale RGB-D dataset for unsupervised anomaly detection based on Play-Doh models, and proposes a multi-scale teacher-student framework with hierarchical distillation that leverages 3D information to achieve superior detection accuracy compared to existing methods.

Jianjian Qin, Chao Zhang, Chunzhi Gu, Zi Wang, Jun Yu, Yijin Wei, Hui Xiao, Xin Yua2026-03-10💻 cs

DivCon: Divide and Conquer for Complex Numerical and Spatial Reasoning in Text-to-Image Generation

DivCon introduces a divide-and-conquer framework that decouples text-to-image generation into simplified numerical/spatial reasoning and progressive object synthesis steps, enabling lightweight models to achieve superior layout accuracy and perceptual quality for complex multi-object prompts without relying on closed-source large language models.

Yuhao Jia, Wenhan Tan2026-03-10💻 cs

Deepfake Generation and Detection: A Benchmark and Survey

This paper presents a comprehensive survey and benchmark of deepfake generation and detection, unifying task definitions, reviewing state-of-the-art methods across four key generation fields and forgery detection, and analyzing current challenges and future research directions.

Gan Pei, Jiangning Zhang, Menghan Hu, Zhenyu Zhang, Chengjie Wang, Yunsheng Wu, Guangtao Zhai, Jian Yang, Dacheng Tao2026-03-10💻 cs

Goldilocks Test Sets for Face Verification

This paper proposes three high-quality, controlled test sets (Hadrian, Eclipse, and ND-Twins) designed to challenge face verification models on natural variations in facial attributes and similar-looking identities, while introducing "Goldilocks" rules to ensure balanced difficulty and demographic fairness without artificially degrading image quality.

Haiyu Wu, Sicong Tian, Aman Bhatta, Jacob Gutierrez, Grace Bezold, Genesis Argueta, Karl Ricanek Jr., Michael C. King, Kevin W. Bowyer2026-03-10💻 cs

Exploring Diffusion Models' Corruption Stage in Few-Shot Fine-tuning and Mitigating with Bayesian Neural Networks

This paper identifies a "corruption stage" in few-shot fine-tuned diffusion models caused by a narrowed learning distribution and proposes a Bayesian Neural Network approach with variational inference to broaden this distribution, thereby mitigating corruption and improving image fidelity, quality, and diversity without additional inference costs.

Xiaoyu Wu, Jiaru Zhang, Yang Hua, Bohan Lyu, Hao Wang, Tao Song, Haibing Guan2026-03-10🤖 cs.LG

RDM: Recurrent Diffusion Model for Human Motion Generation

This paper proposes RDM, a recurrent diffusion model that leverages Normalizing Flows to condition generation on preceding noisy frames, enabling efficient, long-duration human motion synthesis with reduced computational costs while maintaining high alignment with text prompts.

Mirgahney Mohamed, Harry Jake Cunningham, Marc P. Deisenroth, Lourdes Agapito2026-03-10💻 cs

Improving Visual Object Tracking through Visual Prompting

The paper proposes PiVOT, a visual prompting mechanism that leverages a pretrained CLIP foundation model to automatically generate and refine online visual prompts, thereby enhancing generic object tracking by effectively suppressing distractors through contrastive guidance.

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin2026-03-10💻 cs

ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance

ExpGest is a novel diffusion-based framework that generates expressive, controllable full-body gestures by leveraging synchronized audio and text guidance, along with a specialized noise emotion classifier, to overcome the limitations of existing methods that often produce stiff, upper-body-only movements.

Yongkang Cheng, Mingjiang Liang, Shaoli Huang, Gaoge Han, Jifeng Ning, Wei Liu2026-03-10💻 cs

Autoassociative Learning of Structural Representations for Modeling and Classification in Medical Imaging

This paper introduces a neurosymbolic system that reconstructs medical images using visual primitives to generate high-level structural explanations, achieving superior classification accuracy and transparency compared to conventional deep learning models in diagnosing histological abnormalities.

Zuzanna Buchnajzer, Kacper Dobek, Stanisław Hapke, Daniel Jankowski, Krzysztof Krawiec2026-03-10🤖 cs.LG

Input-Adaptive Generative Dynamics in Diffusion Models

This paper proposes an input-adaptive framework for diffusion models that dynamically adjusts the generative trajectory and sampling steps for each sample based on its complexity, thereby maintaining generation quality while reducing the average number of required steps.

Yucheng Xing, Xiaodong Liu, Xin Wang2026-03-10🤖 cs.LG

Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models

This paper introduces HarmonicEval, a reference-free, multi-criteria evaluation metric for vision-language models that aggregates criterion-wise scores to better align with human judgments across diverse multi-modal tasks, supported by the newly constructed MMHE benchmark containing 18,000 expert human evaluations.

Masanari Ohi, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue2026-03-10💬 cs.CL

From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models

This paper proposes a method that leverages pretrained vision-language models to learn compact, abstract symbolic world models from limited visual demonstrations, enabling zero-shot generalization and long-horizon planning for complex robotic tasks across novel objects, environments, and goals.

Ashay Athalye, Nishanth Kumar, Tom Silver, Yichao Liang, Jiuguang Wang, Tomás Lozano-Pérez, Leslie Pack Kaelbling2026-03-10🤖 cs.LG

Efficient Semi-Supervised Adversarial Training via Latent Clustering-Based Data Reduction

This paper proposes efficient data reduction strategies for semi-supervised adversarial training that utilize latent clustering techniques to select or generate critical boundary-adjacent samples, significantly reducing data requirements and computational costs while maintaining state-of-the-art robustness.

Somrita Ghosh, Yuelin Xu, Xiao Zhang2026-03-10🤖 cs.LG

← Previous Next →