cs.CV papers | Gist.Science

WAFFLE: Finetuning Multi-Modal Models for Automated Front-End Development

The paper introduces Waffle, a novel fine-tuning strategy that employs structure-aware attention and contrastive learning to significantly enhance multi-modal models' ability to convert UI designs into functional HTML code, outperforming existing methods on both new and established benchmarks.

Shanchao Liang, Nan Jiang, Shangshu Qian + 1 more2026-03-04💬 cs.CL

RealOSR: Latent Guidance Boosts Diffusion-based Real-world Omnidirectional Image Super-Resolutions

The paper proposes RealOSR, a diffusion-based framework for real-world omnidirectional image super-resolution that utilizes a novel Latent Gradient Alignment Routing (LaGAR) module to enable efficient one-step denoising, achieving significant visual quality improvements and over 200 $\times$ inference acceleration compared to existing methods.

Xuhan Sheng, Runyi Li, Bin Chen + 3 more2026-03-04⚡ eess

Slot-BERT: Self-supervised Object Discovery in Surgical Video

The paper presents Slot-BERT, a bidirectional long-range self-supervised model that overcomes the temporal coherence and computational limitations of existing methods to achieve robust, scalable object discovery and zero-shot domain adaptation in long surgical videos.

Guiqiu Liao, Matjaz Jogan, Marcel Hussing + 5 more2026-03-04⚡ eess

Weight Space Representation Learning on Diverse NeRF Architectures

This paper introduces the first framework capable of learning architecture-agnostic representations for diverse Neural Radiance Fields (NeRFs) by training a Graph Meta-Network with a contrastive objective, enabling robust inference across multiple and unseen NeRF architectures for tasks like classification and retrieval.

Francesco Ballerini, Pierluigi Zama Ramirez, Luigi Di Stefano + 1 more2026-03-04💻 cs

Cycle-Consistent Multi-Graph Matching for Self-Supervised Annotation of C.Elegans

This paper introduces a novel, fully unsupervised cycle-consistent multi-graph matching approach that achieves state-of-the-art accuracy in semantic cell annotation for *C. elegans* 3D microscopy images, enabling the creation of the first unsupervised cell atlas without requiring ground truth labels.

Christoph Karg, Sebastian Stricker, Lisa Hutschenreiter + 2 more2026-03-04💻 cs

GAN-Based Single-Stage Defense for Traffic Sign Classification Under Adversarial Patch

This paper proposes a computationally efficient, model-agnostic, single-stage GAN-based defense strategy that significantly improves the robustness and accuracy of traffic sign classification in autonomous vehicles against adversarial patch attacks without requiring prior knowledge of the patch design.

Abyad Enan, Mashrur Chowdhury2026-03-04💻 cs

Language-guided Open-world Video Anomaly Detection under Weak Supervision

This paper introduces LaGoVAD, a novel language-guided open-world video anomaly detection framework that dynamically adapts to variable anomaly definitions via natural language prompts under weak supervision, supported by the newly proposed PreVAD dataset and validated by state-of-the-art zero-shot performance across seven benchmarks.

Zihao Liu, Xiaoyu Wu, Jianqin Wu + 2 more2026-03-04💻 cs

Scale-wise Distillation of Diffusion Models

This paper introduces SwD, a scale-wise diffusion distillation framework that combines a progressive generation strategy with a novel Maximum Mean Discrepancy-based patch-level objective to significantly accelerate sampling in large-scale text-to-image and video models while outperforming existing methods under the same compute budget.

Nikita Starodubcev, Ilya Drobyshevskiy, Denis Kuznedelev + 2 more2026-03-04💻 cs

Differentially Private 2D Human Pose Estimation

This paper introduces the first comprehensive framework for differentially private 2D human pose estimation that combines Projected DP-SGD and Feature Differential Privacy to effectively balance formal privacy guarantees with high model accuracy, achieving a mean PCKh@0.5 of 82.61% at $\epsilon=0.8$ on the MPII dataset.

Kaushik Bhargav Sivangi, Paul Henderson, Fani Deligianni2026-03-04💻 cs

Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model

The paper proposes ANSE, a model-aware framework that leverages a Bayesian attention-based uncertainty metric (BANSA) to automatically select optimal initial noise seeds for video diffusion models, thereby improving generation quality and temporal coherence with minimal inference overhead.

Kwanyoung Kim, Sanghyun Kim2026-03-04🤖 cs.AI

SABER: Spatially Consistent 3D Universal Adversarial Objects for BEV Detectors

This paper introduces SABER, a novel framework that generates spatially consistent, universal 3D adversarial objects to realistically and effectively attack Bird's-Eye-View detectors in autonomous driving by optimizing non-invasive environmental manipulations that maintain multi-view and temporal consistency.

Aixuan Li, Mochu Xiang, Bosen Hou + 3 more2026-03-04💻 cs

Interaction Field Matching: Overcoming Limitations of Electrostatic Models

This paper introduces Interaction Field Matching (IFM), a generalized framework that overcomes the modeling complexities of Electrostatic Field Matching by leveraging a novel interaction field inspired by strong quark-antiquark interactions to improve data generation and transfer performance.

Stepan I. Manukhov, Alexander Kolesov, Vladimir V. Palyulin + 1 more2026-03-04🤖 cs.AI

HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

This paper introduces HSSBench, a comprehensive multilingual benchmark featuring over 13,000 samples generated through a novel expert-agent collaboration pipeline, designed to evaluate and address the current limitations of Multimodal Large Language Models in handling the interdisciplinary and abstract reasoning tasks characteristic of the Humanities and Social Sciences.

Zhaolu Kang, Junhao Gong, Jiaxu Yan + 15 more2026-03-04🤖 cs.AI

Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models

This paper introduces Frame Guidance, a training-free method that enables fine-grained, frame-level control over video generation in diffusion models through efficient latent processing and optimization, eliminating the need for costly fine-tuning while supporting diverse tasks like keyframe guidance, stylization, and looping.

Sangwon Jang, Taekyung Ki, Jaehyeong Jo + 4 more2026-03-04🤖 cs.AI

Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward

Perception-R1 addresses the limitation of existing RLVR methods in enhancing multimodal perception by introducing a novel visual perception reward derived from Chain-of-Thought annotations, which effectively boosts both perception and reasoning capabilities of Multimodal Large Language Models to achieve state-of-the-art performance with minimal training data.

Tong Xiao, Xin Xu, Zhenya Huang + 4 more2026-03-04🤖 cs.AI

StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams

StreamSplat is a fully feed-forward framework that enables real-time, online reconstruction of dynamic 3D scenes from uncalibrated video streams into 3D Gaussian Splatting representations, achieving state-of-the-art quality with a 1200x speedup over traditional optimization-based methods through probabilistic sampling, bidirectional deformation, and adaptive Gaussian fusion.

Zike Wu, Qi Yan, Xuanyu Yi + 2 more2026-03-04🤖 cs.LG

Evolutionary Caching to Accelerate Your Off-the-Shelf Diffusion Model

The paper proposes ECAD, a genetic algorithm-based evolutionary caching method that learns optimal, model-specific inference schedules to significantly accelerate off-the-shelf diffusion models while maintaining high image quality and generalizing across resolutions and architectures without requiring parameter modifications.

Anirud Aggarwal, Abhinav Shrivastava, Matthew Gwilliam2026-03-04💻 cs

Synthetic Perception: Can Generated Images Unlock Latent Visual Prior for Text-Centric Reasoning?

This paper demonstrates that generating images on-the-fly via Text-to-Image models can unlock latent visual priors to significantly enhance text-centric reasoning, provided there is strong semantic alignment, task visual groundability, and high generative fidelity.

Yuesheng Huang, Peng Zhang, Xiaoxin Wu + 2 more2026-03-04💻 cs

SceneStreamer: Continuous Scenario Generation as Next Token Group Prediction

SceneStreamer is a unified autoregressive transformer framework that generates continuous, long-horizon traffic scenarios by predicting sequences of tokens representing dynamic elements like agents and traffic signals, thereby enabling the creation of realistic, diverse, and adaptive environments that significantly improve the robustness and generalization of autonomous driving policies.

Zhenghao Peng, Yuxin Liu, Bolei Zhou2026-03-04💻 cs

Navigating with Annealing Guidance Scale in Diffusion Space

This paper proposes a novel, memory-efficient annealing guidance scheduler that dynamically adjusts the guidance scale during diffusion sampling based on conditional noisy signals, thereby significantly improving both image quality and text alignment without requiring additional activations.

Shai Yehezkel, Omer Dahary, Andrey Voynov + 1 more2026-03-04🤖 cs.AI

← Previous Next →