cs.CV papers | Gist.Science

Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

VIST3A is a novel framework that seamlessly integrates pretrained text-to-video generators with feedforward 3D reconstruction networks through model stitching and direct reward finetuning, enabling high-quality text-to-3D and text-to-pointmap generation that surpasses existing Gaussian splat-based approaches.

Hyojun Go, Dominik Narnhofer, Goutam Bhat + 3 more2026-03-06💻 cs

DRBD-Mamba for Robust and Efficient Brain Tumor Segmentation with Analytical Insights

This paper introduces DRBD-Mamba, an efficient 3D brain tumor segmentation model that leverages a dual-resolution bi-directional Mamba architecture with space-filling curves and gated fusion to achieve superior accuracy and robustness across diverse BraTS2023 data partitions while significantly reducing computational overhead compared to existing state-of-the-art methods.

Danish Ali, Ajmal Mian, Naveed Akhtar + 1 more2026-03-06💻 cs

Pursuing Minimal Sufficiency in Spatial Reasoning

The paper introduces MSSR, a dual-agent framework that enhances Vision-Language Models' spatial reasoning by programmatically curating a Minimal Sufficient Set of 3D perception results from expert models to overcome bottlenecks caused by inadequate 3D understanding and redundant information, thereby achieving state-of-the-art performance on challenging benchmarks.

Yejie Guo, Yunzhong Hou, Wufei Ma + 2 more2026-03-06💻 cs

SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

This paper introduces SceneCOT, a novel framework that achieves grounded question-answering in 3D scenes by decoupling complex reasoning into manageable steps with visual clues, supported by the newly created SCENECOT-185K dataset, which demonstrates state-of-the-art performance and represents the first successful application of Chain-of-Thought reasoning to 3D scene understanding.

Xiongkun Linghu, Jiangyong Huang, Ziyu Zhu + 2 more2026-03-06💻 cs

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

The paper introduces Grasp Any Region (GAR), a multimodal large language model that leverages RoI-aligned feature replay to integrate global context for precise, interactive region-level understanding and compositional reasoning, achieving state-of-the-art performance on both image and video benchmarks.

Haochen Wang, Yuhao Wang, Tao Zhang + 13 more2026-03-06💻 cs

FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding

FLoC is a training-free, model-agnostic framework that leverages the facility location function and a lazy greedy algorithm to efficiently select a compact, diverse subset of visual tokens for long video understanding, significantly reducing computational costs while maintaining near-optimal performance across diverse benchmarks.

Janghoon Cho, Jungsoo Lee, Munawar Hayat + 3 more2026-03-06💻 cs

MotionStream: Real-Time Video Generation with Interactive Motion Controls

MotionStream is a real-time video generation framework that distills a bidirectional teacher model into a causal student using Self Forcing with Distribution Matching Distillation and sliding-window attention with attention sinks, enabling sub-second, infinite-length streaming generation with interactive motion controls on a single GPU.

Joonghyuk Shin, Zhengqi Li, Richard Zhang + 4 more2026-03-06💻 cs

SASG-DA: Sparse-Aware Semantic-Guided Diffusion Augmentation For Myoelectric Gesture Recognition

This paper proposes SASG-DA, a novel diffusion-based data augmentation framework that leverages semantic guidance and sparse-aware sampling to generate faithful and diverse sEMG data, thereby significantly improving the generalization and performance of myoelectric gesture recognition models on benchmark datasets.

Chen Liu, Can Han, Weishi Xu + 2 more2026-03-06💻 cs

DeiTFake: Deepfake Detection Model using DeiT Multi-Stage Training

The paper introduces DeiTFake, a DeiT-based deepfake detection model that utilizes a novel two-stage progressive training strategy with increasing augmentation complexity to achieve state-of-the-art accuracy and robustness on the OpenForensics dataset.

Saksham Kumar, Ashish Singh, Srinivasarao Thota + 2 more2026-03-06💻 cs

Fully Automatic Data Labeling for Ultrasound Screen Detection

This paper presents a fully automatic pipeline that generates labeled data to train a screen detector, enabling the extraction and rectification of ultrasound images from monitor photographs without human annotation, thereby bypassing the DICOM bottleneck while maintaining sufficient visual fidelity for downstream classification tasks.

Alberto Gomez, Jorge Oliveira, Ramon Casero + 1 more2026-03-06💻 cs

DAP: A Discrete-token Autoregressive Planner for Autonomous Driving

DAP is a compact, discrete-token autoregressive planner that jointly forecasts BEV semantics and ego trajectories with reinforcement learning fine-tuning, achieving state-of-the-art performance on autonomous driving benchmarks despite a limited 160M parameter budget.

Bowen Ye, Bin Zhang, Hang Zhao2026-03-06💻 cs

CCSD: Cross-Modal Compositional Self-Distillation for Robust Brain Tumor Segmentation with Missing Modalities

This paper proposes Cross-Modal Compositional Self-Distillation (CCSD), a novel framework utilizing a shared-specific architecture and dual self-distillation strategies to achieve robust, state-of-the-art brain tumor segmentation performance across arbitrary missing MRI modality scenarios.

Dongqing Xie, Yonghuang Wu, Zisheng Ai + 4 more2026-03-06💻 cs

Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach

This paper introduces FlashCache, a frequency-domain-guided KV cache compression framework that identifies and preserves critical "Outlier KVs" while leveraging low-pass filtering and dynamic budget allocation to achieve significant inference speedups and memory reduction in multimodal large language models without compromising performance.

Yaoxin Yang, Peng Ye, Xudong Tan + 4 more2026-03-06💻 cs

MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection

This paper presents MambaTAD, an end-to-end one-stage temporal action detection model that leverages a Diagonal-Masked Bidirectional State-Space module and a global feature fusion head to overcome the limitations of existing state-space models and traditional methods in detecting long-span action instances with linear computational complexity.

Hui Lu, Yi Yu, Shijian Lu + 4 more2026-03-06💻 cs

Observer-Actor: Active Vision Imitation Learning with Sparse-View Gaussian Splatting

The paper introduces Observer-Actor (ObAct), a novel active vision imitation learning framework for dual-arm robots that dynamically assigns one arm to construct a 3D Gaussian Splatting representation and identify optimal viewing angles for the other arm, thereby significantly enhancing policy robustness and performance by reducing occlusions compared to static-camera setups.

Yilong Wang, Cheng Qian, Ruomeng Fan + 1 more2026-03-06💻 cs

STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction

STAvatar is a novel framework for monocular 3D head avatar reconstruction that overcomes the limitations of rigid skinning and poor occlusion handling by introducing a UV-Adaptive Soft Binding mechanism and a Temporal Adaptive Density Control strategy to achieve state-of-the-art high-fidelity results with enhanced detail in frequently occluded regions.

Jiankuo Zhao, Xiangyu Zhu, Zidu Wang + 1 more2026-03-06💻 cs

RadarVLM: A Vision-Language Model Approach for Radar Scene Understanding

RadarVLM introduces a unified vision-language framework for radar scene understanding that leverages a large-scale dataset of radar-caption pairs and a novel Spatially-Grounded CLIP objective to achieve significant improvements in spatial reasoning and segmentation performance across diverse driving scenarios.

Pushkal Mishra, Kshitiz Bansal, Dinesh Bharadia2026-03-06💻 cs

PowerCLIP: Powerset Alignment for Contrastive Pre-Training

PowerCLIP is a novel contrastive pre-training framework that enhances compositional understanding by exhaustively aligning image region powersets with textual parse trees, utilizing efficient non-linear aggregators to overcome the exponential computational cost of naive powerset construction while achieving state-of-the-art performance in zero-shot vision-language tasks.

Masaki Kawamura, Nakamasa Inoue, Rintaro Yanagi + 2 more2026-03-06💻 cs

DPAC: Distribution-Preserving Adversarial Control for Diffusion Sampling

This paper introduces DPAC, a diffusion guidance method that projects adversarial gradients onto the tangent space of iso-density surfaces to minimize path-space KL divergence and control energy, thereby theoretically and empirically achieving higher sample quality (lower FID) while maintaining target classification success.

Han-Jin Lee, Han-Ju Lee, Jin-Seong Kim + 1 more2026-03-06💻 cs

Fairness-Aware Fine-Tuning of Vision-Language Models for Medical Glaucoma Diagnosis

This paper introduces fairness-aware Low-Rank Adaptation methods, specifically FR-LoRA, GR-LoRA, and Hybrid-LoRA, which utilize a differentiable MaxAccGap loss and inverse frequency weighting to significantly reduce diagnostic accuracy disparities in glaucoma detection across demographic groups while maintaining high overall accuracy with minimal trainable parameters.

Zijian Gu, Yuxi Liu, Zhenhao Zhang + 1 more2026-03-06💻 cs

← Previous Next →