cs.CV 편의 논문 | Gist.Science

Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP

이 논문은 CLIP 모델의 시각 인코더 내 타이포그래픽 공격 정보를 전달하는 특정 어텐션 헤드를 선택적으로 제거하는 'Dyslexify'라는 훈련 없는 방어 기법을 제안하여, 미세 조정 없이도 타이포그래픽 공격에 대한 내성을 크게 향상시키면서도 표준 성능은 거의 유지함을 보여줍니다.

Lorenz Hufe, Constantin Venhoff, Erblina Purelku + 3 more2026-02-27🤖 cs.AI

Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

이 논문은 실제 세계의 다중 모달 안전 시나리오를 포괄하는 35,000 개의 이미지 - 텍스트 쌍으로 구성된 데이터셋을 자동 생성하고, 안전 저지 모델을 파인튜닝하여 평가하는 표준화된 지표를 도입함으로써 기존 위험 중심 방식의 한계를 극복하는 새로운 적응형 데이터 구축 방법을 제시합니다.

Jingen Qu, Lijun Li, Bo Zhang + 2 more2026-02-27💬 cs.CL

Loc $^2$ : Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching

이 논문은 약한 지도 학습을 통해 지상 및 항공 이미지 간의 국소 특징 매칭을 학습하고, 단안 깊이 추정과 Procrustes 정렬을 결합하여 3 자유도 카메라 자세를 추정하는 동시에 해석 가능한 국소화 성능을 제공하는 새로운 방법론을 제안합니다.

Zimin Xia, Chenghao Xu, Alexandre Alahi2026-02-27💻 cs

ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting

이 논문은 기존 가우시안 기반 방법의 공간적 상호작용 부족과 시간적 일관성 한계를 극복하기 위해, 이중 모드 어텐션 메커니즘을 통한 공간적 집계 전략과 기하학적 인식을 활용한 시간적 융합 방식을 도입한 'ST-GS' 프레임워크를 제안하여 자율주행의 3D 점유율 예측 성능과 시간적 일관성을 획기적으로 개선함을 보여줍니다.

Xiaoyang Yan, Muleilan Pei, Shaojie Shen2026-02-27💻 cs

Visual Instruction Pretraining for Domain-Specific Foundation Models

이 논문은 고수준 추론이 저수준 지각 특징 학습에 미치는 영향을 규명하기 위해 도메인별 시각 지시 데이터를 활용한 '시각 지시 사전 학습 (ViTP)'을 제안하고, 이를 통해 원격 탐사 및 의료 영상 분야에서 새로운 최고 성능을 달성했음을 보여줍니다.

Yuxuan Li, Yicheng Zhang, Wenhao Tang + 4 more2026-02-27💻 cs

PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data

이 논문은 2D 모델의 간접적 전수가 아닌 대규모 3D 데이터와 모델 기반 주석 파이프라인을 활용하여 학습된 최초의 프롬프트 기반 3D 부분 분할 모델인 PartSAM 을 제안하며, 이를 통해 기존 방법론을 크게 능가하는 정밀한 표면 및 내부 구조 분해 능력을 입증합니다.

Zhe Zhu, Le Wan, Rui Xu + 6 more2026-02-27💻 cs

Secure and reversible face anonymization with diffusion models

이 논문은 확산 모델을 기반으로 비밀 키를 조건부로 주입하여 고화질 얼굴 익명화와 권한이 있는 사용자에 의한 정확한 원상복구를 동시에 보장하는 최초의 보안 가능하고 가역적인 얼굴 익명화 프레임워크를 제안합니다.

Pol Labarbarie, Vincent Itier, William Puech2026-02-27🤖 cs.LG

Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation

이 논문은 모든 픽셀이 동시에 노이즈를 제거하는 기존 동기식 방식의 한계를 극복하기 위해, 각 픽셀에 서로 다른 시간 단계를 할당하여 프롬프트 관련 영역이 더 명확한 맥락을 활용할 수 있도록 하는 비동기식 확산 모델 (Asynchronous Diffusion Models) 을 제안하고 텍스트-이미지 생성의 정합성을 크게 향상시킨다는 내용을 담고 있습니다.

Zijing Hu, Yunze Tong, Fengda Zhang + 3 more2026-02-27💻 cs

Detection and Measurement of Hailstones with Multimodal Large Language Models

이 논문은 오스트리아의 2022 년부터 2024 년까지 발생한 우박 사건에 대한 소셜 미디어 이미지를 활용하여 사전 학습된 멀티모달 대규모 언어 모델을 통해 우박 크기를 평균 1.12cm 오차로 추정하는 방법을 제시하며, 특히 참조 객체를 활용한 2 단계 프롬핑 전략이 기존 센서를 보완할 수 있는 잠재력을 입증했습니다.

Moritz Alker, David C. Schedl, Andreas Stöckl2026-02-27🤖 cs.AI

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

이 논문은 자연어 설명에 기반한 비디오 내 특정 객체 분할 (RVOS) 과제를 기존 파이프라인 방식의 한계를 극복하고 사전 학습된 텍스트 - 비디오 생성 모델의 강점을 활용하여, 비디오의 전체적 표현을 언어에 따라 직접 변형하여 마스크를 생성하는 새로운 생성형 프레임워크 'FlowRVS'로 재정의하고 모든 주요 벤치마크에서 최첨단 성능을 달성했다고 요약할 수 있습니다.

Zanyi Wang, Dengyang Jiang, Liuzhuozheng Li + 6 more2026-02-27💻 cs

G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior

이 논문은 생성 모델의 전제 조건으로 정확한 기하학을 확보하기 위해 평면 구조 기반의 깊이 지도를 활용하고, 이를 비디오 확산 모델을 통한 생성 파이프라인 전반에 적용하여 관측되지 않은 영역을 포함한 고품질의 3D 장면 재구성을 가능하게 하는 'G4Splat'을 제안합니다.

Junfeng Ni, Yixin Chen, Zhifei Yang + 4 more2026-02-27💻 cs

PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

이 논문은 상세한 이미지 설명 평가를 위해 장면 그래프를 구조적 기준서로 활용하여 LLM 을 지시자로 안내하는 새로운 지표 'PoSh'와 예술 작품에 대한 전문 평가 데이터셋 'DOCENT'를 제안하며, 기존 평가 방법보다 인간 평가와 높은 상관관계를 보임으로써 VLM 의 발전과 보조 텍스트 생성 분야에 기여함을 보여줍니다.

Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford + 7 more2026-02-27💬 cs.CL

Learning with less: label-efficient land cover classification at very high spatial resolution using self-supervised deep learning

본 논문은 1000 개의 주석된 데이터만으로도 자기지도학습을 활용해 미시시피 주의 1m 고해상도 토지피복 분류를 성공적으로 수행하여 대규모 수동 주석 데이터의 필요성을 줄이는 효과적인 전략을 제시합니다.

Dakota Hester, Vitor S. Martins, Lucas B. Ferreira + 1 more2026-02-27💻 cs

← 이전 다음 →

cs.CV

Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP

Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

Loc $^2$ : Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching

ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting

Visual Instruction Pretraining for Domain-Specific Foundation Models

PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data

Secure and reversible face anonymization with diffusion models

Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation

Detection and Measurement of Hailstones with Multimodal Large Language Models

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior

PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Learning with less: label-efficient land cover classification at very high spatial resolution using self-supervised deep learning

Q $^2$ : Quantization-Aware Gradient Balancing and Attention Alignment for Low-Bit Quantization

USF-Net: A Unified Spatiotemporal Fusion Network for Ground-Based Remote Sensing Cloud Image Sequence Extrapolation

Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

Diffusion Model in Latent Space for Medical Image Segmentation Task

ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data

VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm

Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

cs.CV

Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP

Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

Loc2^22: Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching

ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting

Visual Instruction Pretraining for Domain-Specific Foundation Models

PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data

Secure and reversible face anonymization with diffusion models

Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation

Detection and Measurement of Hailstones with Multimodal Large Language Models

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior

PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Learning with less: label-efficient land cover classification at very high spatial resolution using self-supervised deep learning

Q2^22: Quantization-Aware Gradient Balancing and Attention Alignment for Low-Bit Quantization

USF-Net: A Unified Spatiotemporal Fusion Network for Ground-Based Remote Sensing Cloud Image Sequence Extrapolation

Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

Diffusion Model in Latent Space for Medical Image Segmentation Task

ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data

VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm

Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

Loc $^2$ : Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching

Q $^2$ : Quantization-Aware Gradient Balancing and Attention Alignment for Low-Bit Quantization