cs.CV papers | Gist.Science

Incomplete Multi-Label Image Recognition by Co-learning Semantic-Aware Features and Label Recovery

This paper proposes a Co-learning framework (CSL) for incomplete multi-label image recognition that unifies semantic-aware feature learning and label recovery through a collaborative mechanism to simultaneously enhance feature discriminability and infer missing labels, achieving state-of-the-art performance on benchmark datasets.

Zhi-Fen He, Ren-Dong Xie, Bo Li + 2 more2026-03-03💻 cs

UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation

UniFlow introduces a unified pixel flow tokenizer that resolves the inherent trade-off between visual understanding and generation by leveraging layer-wise adaptive self-distillation on pretrained encoders and a lightweight patch-wise pixel flow decoder, achieving superior performance across diverse benchmarks without sacrificing fidelity.

Zhengrong Yue, Haiyu Zhang, Xiangyu Zeng + 7 more2026-03-03💻 cs

There is No VAE: End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-training

This paper introduces a novel two-stage self-supervised pre-training framework that enables end-to-end pixel-space generative modeling, achieving state-of-the-art performance on ImageNet with significantly improved efficiency and quality compared to both prior pixel-space methods and latent-space VAE-based counterparts.

Jiachen Lei, Keli Liu, Julius Berner + 4 more2026-03-03💻 cs

Fly-CL: A Fly-Inspired Framework for Enhancing Efficient Decorrelation and Reduced Training Time in Pre-trained Model-based Continual Representation Learning

Inspired by the fly olfactory circuit, Fly-CL is a bio-inspired framework that enhances pre-trained model-based continual representation learning by efficiently resolving multicollinearity to reduce training time while maintaining or exceeding state-of-the-art performance.

Heming Zou, Yunliang Zang, Wutong Xu + 1 more2026-03-03🤖 cs.AI

Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos

The paper introduces Mono4DGS-HDR, a novel two-stage Gaussian Splatting framework that reconstructs renderable 4D high dynamic range scenes from unposed monocular videos with alternating exposures, achieving superior quality and speed through pose-free initial learning, joint refinement, and temporal luminance regularization.

Jinfeng Liu, Lingtong Kong, Mi Zhou + 2 more2026-03-03💻 cs

LightMem: Lightweight and Efficient Memory-Augmented Generation

LightMem is a lightweight, efficient memory-augmented generation system inspired by the Atkinson-Shiffrin human memory model that utilizes a three-stage process of sensory filtering, topic-aware short-term consolidation, and offline long-term updates to significantly improve QA accuracy while drastically reducing token usage and API calls compared to existing baselines.

Jizhan Fang, Xinle Deng, Haoming Xu + 9 more2026-03-03💬 cs.CL

BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models

This paper introduces BioCAP, a biological foundation model that leverages synthetic, instance-specific captions generated by multimodal large language models to enhance species classification and text-image retrieval by capturing rich semantic traits beyond simple labels.

Ziheng Zhang, Xinyue Ma, Arpita Chowdhury + 9 more2026-03-03💬 cs.CL

VoMP: Predicting Volumetric Mechanical Property Fields

VoMP is a fast, feed-forward deep learning method that predicts spatially-varying volumetric mechanical properties (Young's modulus, Poisson's ratio, and density) for 3D objects by aggregating multi-view features through a Geometry Transformer and decoding them via a physically plausible material manifold learned from a novel, multi-source annotated dataset.

Rishit Dagli, Donglai Xiang, Vismay Modi + 7 more2026-03-03🤖 cs.LG

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Concerto is a minimalist self-supervised learning framework that synergistically combines 2D-3D joint embedding and 3D intra-modal self-distillation to emerge superior spatial representations, achieving state-of-the-art performance in 3D scene perception and enabling open-world perception through language alignment.

Yujia Zhang, Xiaoyang Wu, Yixing Lao + 4 more2026-03-03💻 cs

Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance

This paper introduces ProMoE, a Mixture-of-Experts framework for Diffusion Transformers that overcomes the limitations of existing vision MoE approaches by employing a two-step router with explicit guidance to partition tokens by function and refine assignments via prototypes, thereby achieving state-of-the-art performance on ImageNet.

Yujie Wei, Shiwei Zhang, Hangjie Yuan + 8 more2026-03-03💻 cs

Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer

The paper introduces "Brain-IT," a novel framework utilizing a Brain Interaction Transformer to model functional brain-voxel clusters for predicting complementary semantic and structural image features, thereby achieving highly faithful fMRI-to-image reconstructions that surpass state-of-the-art methods while requiring significantly less training data.

Roman Beliy, Amit Zalcher, Jonathan Kogman + 2 more2026-03-03🧬 q-bio

See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement

This paper presents a novel method for generating high-resolution, high-quality talking face videos exclusively from a single speech input by utilizing a speech-conditioned diffusion model with statistical facial priors, region-enhanced lip synchronization, and a Transformer-based discrete codebook for end-to-end detail refinement.

Jinting Wang, Jun Wang, Hei Victor Cheng + 1 more2026-03-03⚡ eess

ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

The paper introduces ThinkMorph, a unified model fine-tuned on high-quality interleaved reasoning traces that treats text and image thoughts as complementary modalities, achieving significant performance gains on vision-centric benchmarks and demonstrating emergent multimodal intelligence such as adaptive reasoning and unseen visual manipulation skills.

Jiawei Gu, Yunzhuo Hao, Huichen Will Wang + 5 more2026-03-03💻 cs

Revisiting Data Scaling in Medical Image Segmentation via Topology-Aware Augmentation

This study reveals that medical image segmentation follows a geometry-limited power-law scaling behavior characterized by early performance saturation, which can be improved through topology-aware augmentation that enhances sample efficiency by expanding effective topological coverage without altering the fundamental scaling law.

Yuetan Chu, Zhongyi Han, Gongning Luo + 1 more2026-03-03💻 cs

VeCoR -- Velocity Contrastive Regularization for Flow Matching

This paper proposes VeCoR, a velocity contrastive regularization method that enhances Flow Matching models by introducing a two-sided attract-repel training scheme to prevent off-manifold errors and significantly improve image quality and stability, particularly in low-step and lightweight configurations.

Zong-Wei Hong, Jing-lun Li, Lin-Ze Li + 2 more2026-03-03💻 cs

UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers

This paper introduces UltraViCo, a training-free method that overcomes video length extrapolation limits in Diffusion Transformers by identifying and suppressing attention dispersion, thereby eliminating both periodic repetition and quality degradation to achieve up to 4x extrapolation with significant performance gains.

Min Zhao, Hongzhou Zhu, Yingze Wang + 6 more2026-03-03💻 cs

ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images

The paper proposes ReSAM, a point-supervised self-prompting framework that adapts the Segment Anything Model to remote sensing images through a Refine-Requery-Reinforce loop, achieving superior segmentation performance without requiring dense mask annotations.

M. Naseer Subhani2026-03-03💻 cs

InnoGym: Benchmarking the Innovation Potential of AI Agents

This paper introduces InnoGym, the first benchmark and framework designed to evaluate the innovation potential of AI agents by measuring both the performance gains over existing solutions and the novelty of their methodologies across 18 real-world engineering and scientific tasks.

Jintian Zhang, Kewei Xu, Jingsheng Zheng + 10 more2026-03-03💬 cs.CL

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

AdaptVision is an efficient Vision-Language Model paradigm that mimics human active vision by using a reinforcement learning framework with Decoupled Turn Policy Optimization to autonomously determine and acquire the minimum necessary visual tokens via a coarse-to-fine process, thereby achieving superior performance with significantly reduced computational overhead compared to existing methods.

Zichuan Lin, Yicheng Liu, Yang Yang + 2 more2026-03-03💬 cs.CL

Fourier-Attentive Representation Learning: A Fourier-Guided Framework for Few-Shot Generalization in Vision-Language Models

This paper proposes Fourier-Attentive Representation Learning (FARL), a novel framework that enhances few-shot generalization in Vision-Language Models by explicitly disentangling image structure and style via Fourier analysis and a dual cross-attention mechanism to guide robust vision-language alignment.

Hieu Dinh Trung Pham, Huy Minh Nhat Nguyen, Cuong Tuan Nguyen2026-03-03💻 cs

← Previous Next →