cs.CV papers | Gist.Science

Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning

This paper proposes Prompt-Driven Noise Generation (PNG), a novel framework that synthesizes realistic sRGB noise using prompt-driven feature learning to eliminate dependency on camera metadata and enhance generalizability for real-world image denoising.

Jaekyun Ko, Dongjin Kim, Soomin Lee + 2 more2026-03-06💻 cs

Interpretable Pre-Release Baseball Pitch Type Anticipation from Broadcast 3D Kinematics

This paper presents a scalable, interpretable framework that achieves 80.4% accuracy in classifying eight professional baseball pitch types using only monocular 3D body kinematics, revealing that upper-body mechanics—particularly wrist position and trunk tilt—are the primary predictors while establishing an empirical ceiling for grip-based distinctions.

Jerrin Bright, Michelle Lu, John Zelek2026-03-06🤖 cs.AI

Structure Observation Driven Image-Text Contrastive Learning for Computed Tomography Report Generation

This paper proposes a novel two-stage framework for Computed Tomography Report Generation that leverages structure-wise image-text contrastive learning with learnable visual queries and a dynamic negative queue to effectively capture anatomical correspondences and achieve state-of-the-art performance.

Hong Liu, Dong Wei, Qiong Peng + 4 more2026-03-06💻 cs

DeformTrace: A Deformable State Space Model with Relay Tokens for Temporal Forgery Localization

This paper proposes DeformTrace, a hybrid architecture combining State Space Models with deformable dynamics and relay tokens to achieve state-of-the-art temporal forgery localization by addressing challenges in boundary ambiguity, sparse forgeries, and long-range modeling.

Xiaodong Zhu, Suting Wang, Yuanming Zheng + 5 more2026-03-06🤖 cs.AI

Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation

This paper proposes FedMEPD, a novel federated learning framework that addresses intermodal heterogeneity and the need for personalization in multimodal brain tumor segmentation by employing federated modality-specific encoders, a server-side fusion decoder for global optimization, and partially personalized decoders enhanced by cross-attention mechanisms to handle clients with incomplete imaging modalities.

Hong Liu, Dong Wei, Qian Dai + 3 more2026-03-06💻 cs

FedAFD: Multimodal Federated Learning via Adversarial Fusion and Distillation

FedAFD is a unified multimodal federated learning framework that enhances both client and server performance by employing a bi-level adversarial alignment and granularity-aware fusion for personalized local learning, alongside a similarity-guided ensemble distillation mechanism to effectively handle model heterogeneity and modality discrepancies.

Min Tan, Junchao Ma, Yinfu Feng + 6 more2026-03-06🤖 cs.AI

Locality-Attending Vision Transformer

This paper introduces Locality-Attending Vision Transformer (LocAtViT), a simple add-on that enhances vision transformer segmentation performance by modulating self-attention with a learnable Gaussian kernel to prioritize local spatial details, achieving significant gains on benchmarks without compromising classification accuracy or altering the training regime.

Sina Hajimiri, Farzad Beizaee, Fereshteh Shakeri + 3 more2026-03-06💻 cs

FC-VFI: Faithful and Consistent Video Frame Interpolation for High-FPS Slow Motion Video Generation

The paper proposes FC-VFI, a novel video frame interpolation method that leverages latent temporal modeling, semantic matching lines, and a temporal difference loss to achieve high-fidelity, motion-consistent 4x and 8x frame rate upscaling from 30 FPS to 120/240 FPS at 2560×1440 resolution, overcoming the fidelity and consistency limitations of existing diffusion-based approaches.

Ganggui Ding, Hao Chen, Xiaogang Xu2026-03-06💻 cs

AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM

The paper proposes AdaIAT, an adaptive method that dynamically increases attention to generated text based on layer-wise thresholds and head-specific characteristics, effectively reducing hallucinations in Large Vision-Language Models while preventing repetitive descriptions and preserving linguistic coherence.

Li'an Zhong, Ziqiang He, Jibin Zheng + 3 more2026-03-06💻 cs

Beyond the Patch: Exploring Vulnerabilities of Visuomotor Policies via Viewpoint-Consistent 3D Adversarial Object

This paper proposes a viewpoint-consistent 3D adversarial texture optimization method using differentiable rendering, Expectation over Transformation with a Coarse-to-Fine curriculum, and saliency-guided perturbations to effectively expose and exploit vulnerabilities in robot visuomotor policies under dynamic camera viewpoints.

Chanmi Lee, Minsung Yoon, Woojae Kim + 2 more2026-03-06💻 cs

Person Detection and Tracking from an Overhead Crane LiDAR

This paper addresses the challenge of person detection and tracking from an overhead crane LiDAR by curating a new annotated dataset, evaluating adapted 3D detectors like VoxelNeXt and SECOND with integrated tracking algorithms, and demonstrating high accuracy and real-time feasibility to bridge the gap between standard driving benchmarks and industrial overhead sensing.

Nilusha Jayawickrama, Henrik Toikka, Risto Ojala2026-03-06🤖 cs.LG

Adaptive Prototype-based Interpretable Grading of Prostate Cancer

This paper proposes a novel adaptive prototype-based weakly-supervised framework that enhances the interpretability and reliability of automated prostate cancer grading by mimicking pathologists' workflow through explicit reasoning and dynamic prototype selection, achieving robust performance on benchmark datasets.

Riddhasree Bhattacharyya, Pallabi Dutta, Sushmita Mitra2026-03-06💻 cs

TimeWarp: Evaluating Web Agents by Revisiting the Past

The paper introduces TimeWarp, a benchmark that evaluates web agents across evolving UI versions to expose their vulnerability to design changes, and proposes TimeTraj, a plan distillation algorithm that significantly improves agent robustness by training on trajectories collected from multiple web versions.

Md Farhan Ishmam, Kenneth Marino2026-03-06🤖 cs.AI

Location-Aware Pretraining for Medical Difference Visual Question Answering

This paper introduces a location-aware pretraining framework utilizing automatic referring expressions and grounded captioning to enhance vision encoders for fine-grained spatial reasoning, achieving state-of-the-art performance in medical difference visual question answering on chest X-rays.

Denis Musinguzi, Caren Han, Prasenjit Mitra2026-03-06🤖 cs.AI

VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters

VisionPangu is a compact 1.7B-parameter multimodal model that leverages an InternVL-derived vision encoder, an OpenPangu language backbone, and dense human-authored supervision from the DOCCI dataset to achieve competitive, detailed image captioning without relying on large-scale architectures.

Jiaxin Fan, Wenpo Song2026-03-06💬 cs.CL

Revisiting an Old Perspective Projection for Monocular 3D Morphable Models Regression

This paper introduces a novel camera model that extends orthographic projection with a shrinkage parameter to effectively capture perspective distortion in close-up monocular 3D Morphable Model regression, enabling stable and accurate fitting for head-mounted camera footage.

Toby Chong, Ryota Nakajima2026-03-06💻 cs

BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement

BiEvLight is a bi-level learning framework that addresses the noise coupling challenge in low-light image enhancement by dynamically optimizing event denoising as a task-aware prior, thereby significantly improving enhancement quality on real-world noisy datasets.

Zishu Yao, Xiang-Xiang Su, Shengning Zhou + 3 more2026-03-06💻 cs

3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

This paper introduces 3D-RFT, the first framework to apply Reinforcement Learning with Verifiable Rewards (RLVR) to video-based 3D scene understanding, which outperforms existing supervised fine-tuning methods and larger models by directly optimizing evaluation metrics like 3D IoU and F1-Score through Group Relative Policy Optimization (GRPO).

Xiongkun Linghu, Jiangyong Huang, Baoxiong Jia + 1 more2026-03-06🤖 cs.AI

Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding

The paper introduces VideoHV-Agent, a multi-agent framework that improves long video understanding by replacing reactive retrieval with a structured "think-then-verify" process where hypotheses are formulated, clues are derived, and evidence is grounded before generating a final answer, achieving state-of-the-art accuracy with enhanced interpretability and lower computational cost.

Zheng Wang, Haoran Chen, Haoxuan Qin + 3 more2026-03-06💻 cs

A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction

The paper introduces Wallaroo, a simple autoregressive model that unifies multi-modal understanding, image generation, and editing through next-token prediction, featuring multi-resolution support and bilingual capabilities while achieving competitive performance via a decoupled visual encoding and four-stage training strategy.

Jie Zhu, Hanghang Ma, Jia Wang + 6 more2026-03-06💻 cs

← Previous Next →