cs.CV papers | Gist.Science

Deep Learning-Based Meat Freshness Detection with Segmentation and OOD-Aware Classification

This study presents a deep learning framework for meat freshness detection that combines U-Net-based segmentation with OOD-aware classification, demonstrating that EfficientNet-B0 achieves the highest accuracy (98.10%) on RGB images while supporting practical on-device deployment via TensorFlow Lite.

Hutama Arif Bramantyo, Mukarram Ali Faridi, Rui Chen + 2 more2026-03-03⚡ eess

Unsupervised Semantic Segmentation in Synchrotron Computed Tomography with Self-Correcting Pseudo Labels

This paper presents a novel unsupervised framework for segmenting large-scale synchrotron computed tomography datasets that generates initial pseudo labels via voxel clustering and refines them using an Unbiased Teacher approach, thereby eliminating the need for manual annotation while significantly improving segmentation accuracy.

Austin Yunker, Peter Kenesei, Hemant Sharma + 3 more2026-03-03💻 cs

DiffSOS: Acoustic Conditional Diffusion Model for Speed-of-Sound Reconstruction in Ultrasound Computed Tomography

DiffSOS is a novel acoustic conditional diffusion model that achieves high-fidelity, near real-time Speed-of-Sound reconstruction in Ultrasound Computed Tomography by leveraging a physics-grounded ControlNet and stochastic sampling to overcome the oversmoothing and computational limitations of existing methods while providing pixel-wise uncertainty estimates.

Yujia Wu, Shuoqi Chen, Shiru Wang + 3 more2026-03-03💻 cs

SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning

The paper introduces SSR, a 7B-parameter framework that achieves state-of-the-art spatial intelligence by integrating 2D and 3D representations through lightweight alignment and a novel scene graph generation pipeline, enabling precise geometric reasoning without costly large-scale pre-training.

Yi Zhang, Youya Xia, Yong Wang + 7 more2026-03-03💻 cs

PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models

To overcome the scarcity of 3D-text data and the resulting loss of geometric information in existing 3D Vision-Language Models, PointAlign introduces a lightweight feature-level alignment regularization that explicitly supervises intermediate point cloud tokens to preserve fine-grained 3D geometric-semantic details, significantly improving performance on classification and captioning tasks.

Yuanhao Su, Shaofeng Zhang, Xiaosong Jia + 1 more2026-03-03💻 cs

DiffTrans: Differentiable Geometry-Materials Decomposition for Reconstructing Transparent Objects

This paper presents DiffTrans, a differentiable rendering framework that utilizes FlexiCubes for initial geometry and a recursive CUDA-based ray tracer to jointly optimize geometry, refractive index, and absorption, enabling high-quality reconstruction of transparent objects with diverse topologies and complex textures in intricate scenes.

Changpu Li, Shuang Wu, Songlin Tang + 3 more2026-03-03💻 cs

Station2Radar: query conditioned gaussian splatting for precipitation field

The paper proposes Query-Conditioned Gaussian Splatting (QCGS), a novel framework that fuses sparse weather station data with satellite imagery to efficiently generate high-resolution precipitation fields by selectively rendering only rainfall regions, achieving over 50% improvement in RMSE compared to conventional products.

Doyi Kim, Minseok Seo, Changick Kim2026-03-03💻 cs

An Interpretable Local Editing Model for Counterfactual Medical Image Generation

This paper introduces InstructX2X, an interpretable local editing model that leverages region-specific editing and a new expert-verified dataset (MIMIC-EDIT-INSTRUCTION) to generate high-quality counterfactual medical images while preventing unintended demographic changes and providing visual explanations for the editing process.

Hyungi Min, Taeseung You, Hangyeul Lee + 2 more2026-03-03🤖 cs.AI

LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation

The paper introduces Fact-Flow, a novel framework that enhances the factual accuracy of MLLM-based medical report generation by decoupling visual fact identification from text generation and utilizing an LLM-bootstrapped pipeline to create labeled training data without manual annotation.

Cunyuan Yang, Dejuan Song, Xiaotao Pang + 7 more2026-03-03💬 cs.CL

Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models

This paper proposes Taxonomy-Aware Representation Alignment (TARA), a method that enhances Large Multimodal Models' hierarchical visual recognition capabilities for both known and novel categories by aligning their visual representations with biology foundation models and ground-truth labels to enforce taxonomic consistency.

Hulingxiao He, Zhi Tan, Yuxin Peng2026-03-03🤖 cs.AI

TAP-SLF: Parameter-Efficient Adaptation of Vision Foundation Models for Multi-Task Ultrasound Image Analysis

This paper proposes TAP-SLF, a parameter-efficient framework that combines task-aware soft prompts and selective fine-tuning of top encoder layers to effectively adapt Vision Foundation Models for multi-task ultrasound image analysis while minimizing overfitting and computational costs.

Hui Wan, Libin Lan2026-03-03🤖 cs.AI

Self-Correction Inside the Model: Leveraging Layer Attention to Mitigate Hallucinations in Large Vision Language Models

This paper introduces ICLA, an internal self-correction mechanism that leverages a diagonal cross-layer attention mechanism to enable Large Vision-Language Models to refine their own hidden states and mitigate hallucinations without external signals, demonstrating consistent improvements across benchmarks with minimal additional parameters.

April Fu2026-03-03💻 cs

Mamba-CAD: State Space Model For 3D Computer-Aided Design Generative Modeling

Mamba-CAD is a self-supervised generative modeling framework that leverages a Mamba-based encoder-decoder architecture and a new large-scale dataset to effectively generate complex, long-sequence parametric CAD models for industrial applications.

Xueyang Li, Yunzhong Lou, Yu Song + 1 more2026-03-03🤖 cs.AI

SesaHand: Enhancing 3D Hand Reconstruction via Controllable Generation with Semantic and Structural Alignment

SesaHand is a novel framework that enhances 3D hand reconstruction by generating diverse, high-quality synthetic hand images through a controllable generation pipeline that ensures semantic alignment via Chain-of-Thought reasoning and structural alignment via hierarchical fusion and attention mechanisms.

Zhuoran Zhao, Xianghao Kong, Linlin Yang + 3 more2026-03-03💻 cs

Improved Adversarial Diffusion Compression for Real-World Video Super-Resolution

This paper proposes an improved adversarial diffusion compression method that distills a heavy 3D diffusion Transformer into a lightweight 2D-based model with 1D temporal convolutions and a dual-head adversarial scheme, achieving a 95% reduction in parameters and 8 $\times$ speedup while effectively balancing spatial detail and temporal consistency for real-world video super-resolution.

Bin Chen, Weiqi Li, Shijie Zhao + 4 more2026-03-03💻 cs

Explainable Continuous-Time Mask Refinement with Local Self-Similarity Priors for Medical Image Segmentation

The paper introduces LSS-LTCNet, an efficient and explainable framework that combines Local Self-Similarity texture priors with continuous-time neural dynamics to achieve state-of-the-art foot ulcer segmentation and boundary precision on the MICCAI FUSeg dataset.

Rajdeep Chatterjee, Sudip Chakrabarty, Trishaani Acharjee2026-03-03💻 cs

ReMoT: Reinforcement Learning with Motion Contrast Triplets

This paper introduces ReMoT, a unified training paradigm that combines a rule-based framework for generating a large-scale motion-contrast dataset with Group Relative Policy Optimization to significantly enhance VLMs' spatio-temporal consistency and reasoning capabilities, achieving state-of-the-art performance on both new and standard benchmarks.

Cong Wan, Zeyu Guo, Jiangyang Li + 5 more2026-03-03💻 cs

OPGAgent: An Agent for Auditable Dental Panoramic X-ray Interpretation

This paper introduces OPGAgent, a multi-tool agentic system that enhances the accuracy and audibility of dental panoramic X-ray interpretation by coordinating specialized perception modules through a hierarchical evidence gathering process and a consensus mechanism, while also proposing the OPG-Bench benchmark for comprehensive evaluation beyond standard VQA metrics.

Zhaolin Yu, Litao Yang, Ben Babicka + 7 more2026-03-03🤖 cs.AI

DreamWorld: Unified World Modeling in Video Generation

DreamWorld introduces a unified framework that integrates complementary world knowledge into video generation through a Joint World Modeling Paradigm, employing Consistent Constraint Annealing and Multi-Source Inner-Guidance to overcome visual instability and achieve superior temporal, spatial, and semantic consistency compared to existing models.

Boming Tan, Xiangdong Zhang, Ning Liao + 5 more2026-03-03💻 cs

High Dynamic Range Imaging Based on an Asymmetric Event-SVE Camera System

This paper presents a hardware-algorithm co-designed HDR imaging system that integrates an asymmetric event-SVE camera with a novel two-stage alignment framework and a cross-modal reconstruction network to achieve superior highlight recovery and edge fidelity in extreme illumination conditions.

Pengju Sun, Banglei Guan, Jing Tao + 4 more2026-03-03💻 cs

← Previous Next →