cs.CV papers | Gist.Science

SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning

The paper introduces SSR, a 7B-parameter framework that achieves state-of-the-art spatial intelligence by integrating 2D and 3D representations through lightweight alignment and a novel scene graph generation pipeline, enabling precise geometric reasoning without costly large-scale pre-training.

Yi Zhang, Youya Xia, Yong Wang + 7 more2026-03-03💻 cs

PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models

To overcome the scarcity of 3D-text data and the resulting loss of geometric information in existing 3D Vision-Language Models, PointAlign introduces a lightweight feature-level alignment regularization that explicitly supervises intermediate point cloud tokens to preserve fine-grained 3D geometric-semantic details, significantly improving performance on classification and captioning tasks.

Yuanhao Su, Shaofeng Zhang, Xiaosong Jia + 1 more2026-03-03💻 cs

DiffTrans: Differentiable Geometry-Materials Decomposition for Reconstructing Transparent Objects

This paper presents DiffTrans, a differentiable rendering framework that utilizes FlexiCubes for initial geometry and a recursive CUDA-based ray tracer to jointly optimize geometry, refractive index, and absorption, enabling high-quality reconstruction of transparent objects with diverse topologies and complex textures in intricate scenes.

Changpu Li, Shuang Wu, Songlin Tang + 3 more2026-03-03💻 cs

Station2Radar: query conditioned gaussian splatting for precipitation field

The paper proposes Query-Conditioned Gaussian Splatting (QCGS), a novel framework that fuses sparse weather station data with satellite imagery to efficiently generate high-resolution precipitation fields by selectively rendering only rainfall regions, achieving over 50% improvement in RMSE compared to conventional products.

Doyi Kim, Minseok Seo, Changick Kim2026-03-03💻 cs

An Interpretable Local Editing Model for Counterfactual Medical Image Generation

This paper introduces InstructX2X, an interpretable local editing model that leverages region-specific editing and a new expert-verified dataset (MIMIC-EDIT-INSTRUCTION) to generate high-quality counterfactual medical images while preventing unintended demographic changes and providing visual explanations for the editing process.

Hyungi Min, Taeseung You, Hangyeul Lee + 2 more2026-03-03🤖 cs.AI

LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation

The paper introduces Fact-Flow, a novel framework that enhances the factual accuracy of MLLM-based medical report generation by decoupling visual fact identification from text generation and utilizing an LLM-bootstrapped pipeline to create labeled training data without manual annotation.

Cunyuan Yang, Dejuan Song, Xiaotao Pang + 7 more2026-03-03💬 cs.CL

Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models

This paper proposes Taxonomy-Aware Representation Alignment (TARA), a method that enhances Large Multimodal Models' hierarchical visual recognition capabilities for both known and novel categories by aligning their visual representations with biology foundation models and ground-truth labels to enforce taxonomic consistency.

Hulingxiao He, Zhi Tan, Yuxin Peng2026-03-03🤖 cs.AI

TAP-SLF: Parameter-Efficient Adaptation of Vision Foundation Models for Multi-Task Ultrasound Image Analysis

This paper proposes TAP-SLF, a parameter-efficient framework that combines task-aware soft prompts and selective fine-tuning of top encoder layers to effectively adapt Vision Foundation Models for multi-task ultrasound image analysis while minimizing overfitting and computational costs.

Hui Wan, Libin Lan2026-03-03🤖 cs.AI

Self-Correction Inside the Model: Leveraging Layer Attention to Mitigate Hallucinations in Large Vision Language Models

This paper introduces ICLA, an internal self-correction mechanism that leverages a diagonal cross-layer attention mechanism to enable Large Vision-Language Models to refine their own hidden states and mitigate hallucinations without external signals, demonstrating consistent improvements across benchmarks with minimal additional parameters.

April Fu2026-03-03💻 cs

Mamba-CAD: State Space Model For 3D Computer-Aided Design Generative Modeling

Mamba-CAD is a self-supervised generative modeling framework that leverages a Mamba-based encoder-decoder architecture and a new large-scale dataset to effectively generate complex, long-sequence parametric CAD models for industrial applications.

Xueyang Li, Yunzhong Lou, Yu Song + 1 more2026-03-03🤖 cs.AI

SesaHand: Enhancing 3D Hand Reconstruction via Controllable Generation with Semantic and Structural Alignment

SesaHand is a novel framework that enhances 3D hand reconstruction by generating diverse, high-quality synthetic hand images through a controllable generation pipeline that ensures semantic alignment via Chain-of-Thought reasoning and structural alignment via hierarchical fusion and attention mechanisms.

Zhuoran Zhao, Xianghao Kong, Linlin Yang + 3 more2026-03-03💻 cs

Improved Adversarial Diffusion Compression for Real-World Video Super-Resolution

This paper proposes an improved adversarial diffusion compression method that distills a heavy 3D diffusion Transformer into a lightweight 2D-based model with 1D temporal convolutions and a dual-head adversarial scheme, achieving a 95% reduction in parameters and 8 $\times$ speedup while effectively balancing spatial detail and temporal consistency for real-world video super-resolution.

Bin Chen, Weiqi Li, Shijie Zhao + 4 more2026-03-03💻 cs

Explainable Continuous-Time Mask Refinement with Local Self-Similarity Priors for Medical Image Segmentation

The paper introduces LSS-LTCNet, an efficient and explainable framework that combines Local Self-Similarity texture priors with continuous-time neural dynamics to achieve state-of-the-art foot ulcer segmentation and boundary precision on the MICCAI FUSeg dataset.

Rajdeep Chatterjee, Sudip Chakrabarty, Trishaani Acharjee2026-03-03💻 cs

ReMoT: Reinforcement Learning with Motion Contrast Triplets

This paper introduces ReMoT, a unified training paradigm that combines a rule-based framework for generating a large-scale motion-contrast dataset with Group Relative Policy Optimization to significantly enhance VLMs' spatio-temporal consistency and reasoning capabilities, achieving state-of-the-art performance on both new and standard benchmarks.

Cong Wan, Zeyu Guo, Jiangyang Li + 5 more2026-03-03💻 cs

OPGAgent: An Agent for Auditable Dental Panoramic X-ray Interpretation

This paper introduces OPGAgent, a multi-tool agentic system that enhances the accuracy and audibility of dental panoramic X-ray interpretation by coordinating specialized perception modules through a hierarchical evidence gathering process and a consensus mechanism, while also proposing the OPG-Bench benchmark for comprehensive evaluation beyond standard VQA metrics.

Zhaolin Yu, Litao Yang, Ben Babicka + 7 more2026-03-03🤖 cs.AI

DreamWorld: Unified World Modeling in Video Generation

DreamWorld introduces a unified framework that integrates complementary world knowledge into video generation through a Joint World Modeling Paradigm, employing Consistent Constraint Annealing and Multi-Source Inner-Guidance to overcome visual instability and achieve superior temporal, spatial, and semantic consistency compared to existing models.

Boming Tan, Xiangdong Zhang, Ning Liao + 5 more2026-03-03💻 cs

High Dynamic Range Imaging Based on an Asymmetric Event-SVE Camera System

This paper presents a hardware-algorithm co-designed HDR imaging system that integrates an asymmetric event-SVE camera with a novel two-stage alignment framework and a cross-modal reconstruction network to achieve superior highlight recovery and edge fidelity in extreme illumination conditions.

Pengju Sun, Banglei Guan, Jing Tao + 4 more2026-03-03💻 cs

Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols

This paper introduces FEWTRANS, a comprehensive benchmark and the Hyperparameter Ensemble (HPE) evaluation protocol to rigorously assess few-shot transfer learning, revealing that pre-trained model selection and full-parameter fine-tuning often outperform sophisticated adaptation methods due to their ability to make distributed micro-adjustments without overfitting.

Xu Luo, Ji Zhang, Lianli Gao + 2 more2026-03-03🤖 cs.LG

U-VLM: Hierarchical Vision Language Modeling for Report Generation

The paper introduces U-VLM, a hierarchical vision-language model that combines progressive multi-stage training with multi-layer visual feature injection to achieve state-of-the-art radiology report generation on 3D medical imaging, demonstrating that specialized encoder pretraining can outperform massive language models.

Pengcheng Shi, Minghui Zhang, Kehan Song + 3 more2026-03-03💻 cs

Analyzing Physical Adversarial Example Threats to Machine Learning in Election Systems

This paper presents a probabilistic framework to quantify the number of adversarial ballots required to flip a U.S. election and empirically demonstrates through 144,000 physical print-and-scan experiments that the most effective adversarial attacks in the physical voting domain differ significantly from those in the digital domain.

Khaleque Md Aashiq Kamal, Surya Eada, Aayushi Verma + 4 more2026-03-03🤖 cs.LG

← Previous Next →