cs.CV papers | Gist.Science

TrueSkin: Towards Fair and Accurate Skin Tone Recognition and Generation

This paper introduces TrueSkin, a comprehensive dataset of 7,299 images across six skin tone classes, to benchmark and improve the fairness and accuracy of existing large multimodal and generative models, which currently struggle with systematic biases in skin tone recognition and synthesis.

Haoming Lu2026-03-03💻 cs

BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching

This paper proposes BWCache, a training-free method that accelerates DiT-based video generation by dynamically caching and reusing block features across diffusion timesteps based on a similarity threshold, achieving up to a 6 $\times$ speedup while maintaining visual fidelity.

Hanshuai Cui, Zhiqing Tang, Zhifei Xu + 3 more2026-03-03🤖 cs.AI

Brain-HGCN: A Hyperbolic Graph Convolutional Network for Brain Functional Network Analysis

The paper proposes Brain-HGCN, a hyperbolic graph convolutional network that leverages negatively curved space and a signed aggregation mechanism to accurately model the hierarchical topology of brain functional networks, achieving superior performance in psychiatric disorder classification compared to standard Euclidean methods.

Junhao Jia, Yunyou Liu, Cheng Yang + 4 more2026-03-03💻 cs

Person Identification from Egocentric Human-Object Interactions using 3D Hand Pose

This paper introduces I2S, a lightweight, real-time framework that achieves state-of-the-art user identification (97.52% F1-score) in AR-based security systems by analyzing 3D hand poses and human-object interactions through a novel multi-stage feature extraction process.

Muhammad Hamza, Danish Hamid, Muhammad Tahir Akram2026-03-03🤖 cs.LG

Geodesic Prototype Matching via Diffusion Maps for Interpretable Fine-Grained Recognition

This paper introduces GeoProto, a novel prototype-based recognition framework that leverages diffusion maps and differentiable Nyström interpolation to model the intrinsic nonlinear geometry of deep features, thereby significantly improving the interpretability and accuracy of fine-grained classification compared to traditional Euclidean methods.

Junhao Jia, Yunyou Liu, Yifei Sun + 4 more2026-03-03💻 cs

Does FLUX Already Know How to Perform Physically Plausible Image Composition?

The paper proposes SHINE, a training-free framework that leverages pretrained diffusion models like FLUX to achieve physically plausible, high-fidelity image composition with accurate lighting and reflections, while introducing the ComplexCompo benchmark to rigorously evaluate performance in challenging scenarios.

Shilin Lu, Zhuming Lian, Zihan Zhou + 3 more2026-03-03🤖 cs.AI

QuadGPT: Native Quadrilateral Mesh Generation with Autoregressive Models

This paper introduces QuadGPT, the first end-to-end autoregressive framework that generates native quadrilateral meshes with superior geometric and topological quality by employing a unified tokenization method for mixed topologies and a specialized Reinforcement Learning fine-tuning strategy.

Jian Liu, Chunshi Wang, Song Guo + 9 more2026-03-03💻 cs

DistillKac: Few-Step Image Generation via Damped Wave Equations

DistillKac is a fast, few-step image generation framework that leverages damped wave equations and stochastic Kac representation to enforce finite-speed probability transport, enabling stable classifier-free guidance and efficient endpoint-only distillation for high-quality sample synthesis.

Weiqiao Han, Chenlin Meng, Christopher D. Manning + 1 more2026-03-03📊 stat

Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach

This paper addresses the limitations of existing visual emotion evaluation methods for Multimodal Large Language Models (MLLMs) by proposing an open-vocabulary, automated Emotion Statement Judgment framework that reveals current models' strengths in context-based interpretation but highlights significant gaps in understanding subjective perception compared to humans.

Daiqing Wu, Dongbao Yang, Sicheng Zhao + 2 more2026-03-03💻 cs

COMPASS: Robust Feature Conformal Prediction for Medical Segmentation Metrics

The paper introduces COMPASS, a robust framework that generates efficient and valid conformal prediction intervals for medical segmentation metrics by calibrating directly in the model's feature space rather than treating the segmentation-to-metric pipeline as a black box.

Matt Y. Cheung, Ashok Veeraraghavan, Guha Balakrishnan2026-03-03⚡ eess

CircuitSense: A Hierarchical MLLM Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process

The paper introduces CircuitSense, a hierarchical benchmark of over 8,000 circuit problems that evaluates Multi-modal Large Language Models across perception, analysis, and design tasks, revealing a critical performance gap where models excel at visual recognition but struggle significantly with deriving symbolic equations and performing mathematical reasoning essential for engineering design.

Arman Akbari, Jian Gao, Yifei Zou + 6 more2026-03-03💻 cs

Towards Interpretable Visual Decoding with Attention to Brain Representations

This paper introduces NeuroAdapter, a visual decoding framework that directly conditions latent diffusion models on brain representations to achieve competitive image reconstruction while enabling interpretable analysis of how specific cortical areas influence the generative process through a novel bidirectional interpretability framework.

Pinyuan Feng, Hossein Adeli, Wenxuan Guo + 3 more2026-03-03💻 cs

DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation

DiffInk is a novel latent diffusion Transformer framework that employs a dual-regularized VAE to disentangle glyph content and writing style, enabling efficient and high-fidelity full-line online handwriting generation that outperforms existing state-of-the-art methods.

Wei Pan, Huiguo He, Hiuyi Cheng + 2 more2026-03-03💻 cs

Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning

This paper introduces SMART-R1, a novel R1-style reinforcement fine-tuning framework that combines metric-oriented policy optimization with an iterative SFT-RFT-SFT training strategy to significantly enhance the realism and generalization of multi-agent traffic simulation, achieving state-of-the-art performance on the Waymo Open Sim Agents Challenge.

Muleilan Pei, Shaoshuai Shi, Shaojie Shen2026-03-03💻 cs

EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

The paper introduces EditReward, a human-aligned reward model trained on a new large-scale human preference dataset that achieves state-of-the-art performance in evaluating instruction-guided image editing and effectively filters high-quality data to significantly improve the training of open-source editing models.

Keming Wu, Sicong Jiang, Max Ku + 3 more2026-03-03💬 cs.CL

Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting

Stylos is a single-forward 3D Gaussian framework that achieves geometry-aware, view-consistent 3D style transfer on unposed content by leveraging a Transformer backbone with dual attention pathways and a voxel-based 3D style loss, enabling high-quality zero-shot stylization without per-scene optimization or precomputed poses.

Hanzhou Liu, Jia Huang, Mi Lu + 2 more2026-03-03💻 cs

Culture In a Frame: C $^3$ B as a Comic-Based Benchmark for Multimodal Culturally Awareness

This paper introduces C $^3$ B, a novel multilingual and multicultural benchmark based on comics that evaluates Multimodal Large Language Models across three progressively difficult tasks, revealing significant performance gaps between current models and human capabilities in cultural awareness.

Yuchen Song, Andong Chen, Wenxin Zhu + 4 more2026-03-03🤖 cs.AI

LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration

This paper introduces LVTINO, a novel zero-shot inverse solver that leverages Video Consistency Models to achieve high-definition video restoration with superior temporal consistency and computational efficiency compared to existing frame-by-frame image-based methods.

Alessio Spagnoletti, Andrés Almansa, Marcelo Pereyra2026-03-03📊 stat

DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing

DragFlow introduces a novel region-based editing framework that leverages the strong generative priors of DiT models like FLUX to overcome the distortions and supervision limitations of traditional point-based drag editing, achieving state-of-the-art performance through affine transformations, personalization adapters, and multimodal guidance.

Zihan Zhou, Shilin Lu, Shuli Leng + 4 more2026-03-03🤖 cs.AI

ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations

The paper proposes ChainMPQ, a training-free method that mitigates relation hallucinations in Large Vision-Language Models by constructing an interleaved chain of multi-perspective questions and accumulated visual-textual memories to guide progressive relational reasoning.

Yike Wu, Yiwei Wang, Yujun Cai2026-03-03🤖 cs.AI

← Previous Next →

cs.CV