cs.CV papers | Gist.Science

Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP

This paper introduces Dyslexify, a training-free defense mechanism that selectively ablates specific attention heads in CLIP vision encoders to neutralize typographic attacks, significantly improving robustness against text-based manipulations while preserving standard recognition accuracy.

Lorenz Hufe, Constantin Venhoff, Erblina Purelku + 3 more2026-02-27🤖 cs.AI

Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

This paper addresses the limitations of current risk-oriented methods in constructing multimodal safety datasets by proposing a novel image-oriented self-adaptive pipeline that automatically generates a 35k real-world safety dataset and introduces a standardized evaluation metric to validate its effectiveness across various tasks.

Jingen Qu, Lijun Li, Bo Zhang + 2 more2026-02-27💬 cs.CL

Loc $^2$ : Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching

This paper proposes Loc $^2$ , an interpretable and lightweight cross-view localization method that estimates ground-level camera pose by learning direct ground-aerial feature correspondences, lifting them to bird's-eye-view space via monocular depth, and applying scale-aware Procrustes alignment without requiring pixel-level annotations.

Zimin Xia, Chenghao Xu, Alexandre Alahi2026-02-27💻 cs

ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting

This paper proposes ST-GS, a novel framework that enhances vision-based 3D semantic occupancy prediction for autonomous driving by introducing a guidance-informed spatial aggregation strategy and a geometry-aware temporal fusion scheme to achieve state-of-the-art performance and superior temporal consistency on the nuScenes benchmark.

Xiaoyang Yan, Muleilan Pei, Shaojie Shen2026-02-27💻 cs

Visual Instruction Pretraining for Domain-Specific Foundation Models

This paper introduces Visual Instruction Pretraining (ViTP), a novel paradigm that leverages high-level reasoning to enhance low-level perceptual features through end-to-end pretraining of a Vision Transformer within a Vision-Language Model, achieving state-of-the-art performance across diverse remote sensing and medical imaging benchmarks.

Yuxuan Li, Yicheng Zhang, Wenhao Tang + 4 more2026-02-27💻 cs

PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data

PartSAM is the first promptable 3D part segmentation model trained natively on a large-scale dataset of over five million shape-part pairs, utilizing a triplane-based encoder-decoder architecture to achieve superior open-world generalization and accurate decomposition of both surface and internal structures compared to existing 2D-transfer methods.

Zhe Zhu, Le Wan, Rui Xu + 6 more2026-02-27💻 cs

Secure and reversible face anonymization with diffusion models

This paper introduces the first diffusion-based framework for secure and reversible face anonymization that utilizes secret-key conditioning to enable high-quality identity protection and authorized reconstruction while preventing unauthorized de-anonymization.

Pol Labarbarie, Vincent Itier, William Puech2026-02-27🤖 cs.LG

Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation

This paper proposes Asynchronous Denoising Diffusion Models, a novel framework that assigns distinct timesteps to individual pixels to enable prompt-related regions to leverage clearer contextual information from unrelated areas, thereby significantly improving text-to-image alignment.

Zijing Hu, Yunze Tong, Fengda Zhang + 3 more2026-02-27💻 cs

Detection and Measurement of Hailstones with Multimodal Large Language Models

This study demonstrates that pre-trained multimodal large language models, particularly when enhanced with two-stage prompting strategies that leverage reference objects, can effectively detect and measure hailstone diameters from crowdsourced social media images with an average error of 1.12cm, offering a promising complement to traditional hail sensors for rapid severe weather assessment.

Moritz Alker, David C. Schedl, Andreas Stöckl2026-02-27🤖 cs.AI

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

The paper proposes FlowRVS, a novel one-stage generative framework that reformulates Referring Video Object Segmentation as a language-guided continuous flow deformation problem, leveraging pretrained text-to-video models to achieve state-of-the-art performance by directly mapping video representations to target masks while overcoming the limitations of traditional cascaded approaches.

Zanyi Wang, Dengyang Jiang, Liuzhuozheng Li + 6 more2026-02-27💻 cs

G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior

G4Splat is a novel 3D reconstruction method that leverages accurate metric-scale geometry derived from planar structures to guide a generative prior, effectively resolving multi-view inconsistencies and enabling high-quality scene completion in both observed and unobserved regions.

Junfeng Ni, Yixin Chen, Zhifei Yang + 4 more2026-02-27💻 cs

PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

This paper introduces PoSh, a scene graph-guided LLM-as-a-Judge metric for evaluating detailed image descriptions, and validates it through the new DOCENT benchmark, demonstrating superior correlation with human judgments and robustness across diverse image types compared to existing metrics.

Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford + 7 more2026-02-27💬 cs.CL

Learning with less: label-efficient land cover classification at very high spatial resolution using self-supervised deep learning

This study demonstrates that self-supervised deep learning, specifically the "Bootstrap Your Own Latent" strategy, enables highly accurate statewide 1-meter land cover classification using only 1,000 annotated patches, effectively overcoming the data scarcity barrier for large-scale, high-resolution mapping.

Dakota Hester, Vitor S. Martins, Lucas B. Ferreira + 1 more2026-02-27💻 cs

Q $^2$ : Quantization-Aware Gradient Balancing and Attention Alignment for Low-Bit Quantization

This paper introduces Q $^2$ , a training-only framework that addresses performance degradation in low-bit quantization for complex visual tasks by mitigating gradient imbalance at feature fusion stages through dynamic gradient balancing and attention distribution alignment, thereby significantly improving object detection and image segmentation accuracy without inference-time overhead.

Zhaoyang Wang, Dong Wang2026-02-27🤖 cs.AI

USF-Net: A Unified Spatiotemporal Fusion Network for Ground-Based Remote Sensing Cloud Image Sequence Extrapolation

This paper proposes USF-Net, a unified spatiotemporal fusion network that integrates adaptive large-kernel convolutions and low-complexity attention mechanisms to overcome limitations in existing cloud image extrapolation methods, achieving superior accuracy and efficiency while introducing the new ASI-CIS dataset.

Penghui Niu, Taotao Cai, Suqi Zhang + 4 more2026-02-27💻 cs

Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

This paper identifies and addresses the "visual shortcuts" plaguing existing Multimodal Knowledge-Based Visual Question Answering benchmarks by introducing the RETINA dataset, which forces models to reason about related entities, and proposing the MIMIR model that leverages multi-image retrieval to overcome these limitations.

Dosung Lee, Sangwon Jung, Boyoung Kim + 4 more2026-02-27💻 cs

Diffusion Model in Latent Space for Medical Image Segmentation Task

The paper proposes MedSegLatDiff, an efficient latent-space diffusion framework that combines a VAE with a weighted cross-entropy loss to generate diverse, uncertainty-aware medical image segmentation hypotheses while achieving state-of-the-art performance on multiple clinical datasets.

Huynh Trinh Ngoc, Toan Nguyen Hai, Ba Luong Son + 1 more2026-02-27🤖 cs.AI

ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data

This paper introduces ClimaDrive, a framework for generating physically realistic and weather-diverse synthetic anomaly data, and leverages it to build the ClimaOoD benchmark, which significantly enhances the generalization and robustness of anomaly segmentation models in open-world autonomous driving scenarios.

Yuxing Liu, Zheng Li, Huanhuan Liang + 3 more2026-02-27💻 cs

VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm

VLM-Pruner is a training-free token pruning algorithm that enhances efficient Vision-Language Model inference by introducing a centrifugal selection paradigm and a Buffering for Spatial Sparsity criterion to balance redundancy reduction with spatial coverage, while selectively fusing discarded token information to maintain performance.

Zhenkai Wu, Xiaowen Ma, Zhenliang Ni + 4 more2026-02-27🤖 cs.LG

Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

The paper introduces TIMAR, a causal turn-level framework that models interleaved audio-visual contexts to generate expressive and temporally coherent 3D conversational head dynamics, significantly outperforming existing methods on the DualTalk benchmark.

Junjie Chen, Fei Wang, Zhihao Huang + 5 more2026-02-27💻 cs

← Previous Next →

cs.CV