cs.CV papers | Gist.Science

Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach

This paper addresses the limitations of existing visual emotion evaluation methods for Multimodal Large Language Models (MLLMs) by proposing an open-vocabulary, automated Emotion Statement Judgment framework that reveals current models' strengths in context-based interpretation but highlights significant gaps in understanding subjective perception compared to humans.

Daiqing Wu, Dongbao Yang, Sicheng Zhao + 2 more2026-03-03💻 cs

COMPASS: Robust Feature Conformal Prediction for Medical Segmentation Metrics

The paper introduces COMPASS, a robust framework that generates efficient and valid conformal prediction intervals for medical segmentation metrics by calibrating directly in the model's feature space rather than treating the segmentation-to-metric pipeline as a black box.

Matt Y. Cheung, Ashok Veeraraghavan, Guha Balakrishnan2026-03-03⚡ eess

CircuitSense: A Hierarchical MLLM Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process

The paper introduces CircuitSense, a hierarchical benchmark of over 8,000 circuit problems that evaluates Multi-modal Large Language Models across perception, analysis, and design tasks, revealing a critical performance gap where models excel at visual recognition but struggle significantly with deriving symbolic equations and performing mathematical reasoning essential for engineering design.

Arman Akbari, Jian Gao, Yifei Zou + 6 more2026-03-03💻 cs

Towards Interpretable Visual Decoding with Attention to Brain Representations

This paper introduces NeuroAdapter, a visual decoding framework that directly conditions latent diffusion models on brain representations to achieve competitive image reconstruction while enabling interpretable analysis of how specific cortical areas influence the generative process through a novel bidirectional interpretability framework.

Pinyuan Feng, Hossein Adeli, Wenxuan Guo + 3 more2026-03-03💻 cs

DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation

DiffInk is a novel latent diffusion Transformer framework that employs a dual-regularized VAE to disentangle glyph content and writing style, enabling efficient and high-fidelity full-line online handwriting generation that outperforms existing state-of-the-art methods.

Wei Pan, Huiguo He, Hiuyi Cheng + 2 more2026-03-03💻 cs

Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning

This paper introduces SMART-R1, a novel R1-style reinforcement fine-tuning framework that combines metric-oriented policy optimization with an iterative SFT-RFT-SFT training strategy to significantly enhance the realism and generalization of multi-agent traffic simulation, achieving state-of-the-art performance on the Waymo Open Sim Agents Challenge.

Muleilan Pei, Shaoshuai Shi, Shaojie Shen2026-03-03💻 cs

EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

The paper introduces EditReward, a human-aligned reward model trained on a new large-scale human preference dataset that achieves state-of-the-art performance in evaluating instruction-guided image editing and effectively filters high-quality data to significantly improve the training of open-source editing models.

Keming Wu, Sicong Jiang, Max Ku + 3 more2026-03-03💬 cs.CL

Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting

Stylos is a single-forward 3D Gaussian framework that achieves geometry-aware, view-consistent 3D style transfer on unposed content by leveraging a Transformer backbone with dual attention pathways and a voxel-based 3D style loss, enabling high-quality zero-shot stylization without per-scene optimization or precomputed poses.

Hanzhou Liu, Jia Huang, Mi Lu + 2 more2026-03-03💻 cs

Culture In a Frame: C $^3$ B as a Comic-Based Benchmark for Multimodal Culturally Awareness

This paper introduces C $^3$ B, a novel multilingual and multicultural benchmark based on comics that evaluates Multimodal Large Language Models across three progressively difficult tasks, revealing significant performance gaps between current models and human capabilities in cultural awareness.

Yuchen Song, Andong Chen, Wenxin Zhu + 4 more2026-03-03🤖 cs.AI

LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration

This paper introduces LVTINO, a novel zero-shot inverse solver that leverages Video Consistency Models to achieve high-definition video restoration with superior temporal consistency and computational efficiency compared to existing frame-by-frame image-based methods.

Alessio Spagnoletti, Andrés Almansa, Marcelo Pereyra2026-03-03📊 stat

DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing

DragFlow introduces a novel region-based editing framework that leverages the strong generative priors of DiT models like FLUX to overcome the distortions and supervision limitations of traditional point-based drag editing, achieving state-of-the-art performance through affine transformations, personalization adapters, and multimodal guidance.

Zihan Zhou, Shilin Lu, Shuli Leng + 4 more2026-03-03🤖 cs.AI

ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations

The paper proposes ChainMPQ, a training-free method that mitigates relation hallucinations in Large Vision-Language Models by constructing an interleaved chain of multi-perspective questions and accumulated visual-textual memories to guide progressive relational reasoning.

Yike Wu, Yiwei Wang, Yujun Cai2026-03-03🤖 cs.AI

VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance

To overcome the challenges of individual variability in echocardiography probe guidance, the authors propose VA-Adapter, a lightweight module that integrates vision-action sequences into an ultrasound foundation model to enable online inference of individual 3D cardiac structures, achieving superior performance with significantly fewer parameters than existing methods.

Teng Wang, Haojun Jiang, Yuxuan Wang + 4 more2026-03-03💻 cs

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

TTOM is a training-free framework that enhances compositional video generation by optimizing new parameters guided by layout-attention objectives and employing a parametric memory mechanism for streaming inference, thereby achieving superior text-image alignment and generalization without direct latent intervention.

Leigang Qu, Ziyang Wang, Na Zheng + 3 more2026-03-03💬 cs.CL

Splat the Net: Radiance Fields with Splattable Neural Primitives

The paper introduces "Splat the Net," a novel volumetric representation that combines the high expressivity of neural radiance fields with the real-time efficiency of 3D Gaussian Splatting by using neural primitives with exact analytical line integrals, achieving comparable quality and speed with significantly fewer primitives and parameters.

Xilong Zhou, Bao-Huy Nguyen, Loïc Magne + 3 more2026-03-03💻 cs

LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution

This paper introduces LinearSR, a holistic framework that enables stable and efficient photorealistic image super-resolution by overcoming linear attention's historical training instability and perception-distortion trade-off through novel strategies like ESGF, SNR-based MoE, and TAG, achieving state-of-the-art quality with exceptional computational efficiency.

Xiaohui Li, Shaobin Zhuang, Shuo Cao + 6 more2026-03-03💻 cs

PHyCLIP: $\ell_1$ -Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning

The paper proposes PHyCLIP, a vision-language model that unifies hierarchical and compositional semantic structures by employing an $\ell_1$ -product metric on a Cartesian product of hyperbolic factors, thereby outperforming existing single-space approaches in various representation learning tasks.

Daiki Yoshikawa, Takashi Matsubara2026-03-03🤖 cs.LG

Incomplete Multi-Label Image Recognition by Co-learning Semantic-Aware Features and Label Recovery

This paper proposes a Co-learning framework (CSL) for incomplete multi-label image recognition that unifies semantic-aware feature learning and label recovery through a collaborative mechanism to simultaneously enhance feature discriminability and infer missing labels, achieving state-of-the-art performance on benchmark datasets.

Zhi-Fen He, Ren-Dong Xie, Bo Li + 2 more2026-03-03💻 cs

UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation

UniFlow introduces a unified pixel flow tokenizer that resolves the inherent trade-off between visual understanding and generation by leveraging layer-wise adaptive self-distillation on pretrained encoders and a lightweight patch-wise pixel flow decoder, achieving superior performance across diverse benchmarks without sacrificing fidelity.

Zhengrong Yue, Haiyu Zhang, Xiangyu Zeng + 7 more2026-03-03💻 cs

There is No VAE: End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-training

This paper introduces a novel two-stage self-supervised pre-training framework that enables end-to-end pixel-space generative modeling, achieving state-of-the-art performance on ImageNet with significantly improved efficiency and quality compared to both prior pixel-space methods and latent-space VAE-based counterparts.

Jiachen Lei, Keli Liu, Julius Berner + 4 more2026-03-03💻 cs

← Previous Next →

cs.CV