cs.CV papers | Gist.Science

UniComp: Rethinking Video Compression Through Informational Uniqueness

UniComp is an information uniqueness-driven video compression framework that optimizes visual fidelity under constrained budgets by minimizing conditional entropy through semantic frame grouping, adaptive resource allocation, and fine-grained spatial compression.

Chao Yuan, Shimin Chen, Minliang Lin + 3 more2026-03-06💻 cs

NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation

NeuralRemaster introduces Phase-Preserving Diffusion (Ï-PD), a model-agnostic method that replaces standard Gaussian noise with phase-preserving, magnitude-randomized noise to enable structure-aligned image and video generation without architectural changes or inference-time costs.

Yu Zeng, Charles Ochoa, Mingyuan Zhou + 3 more2026-03-06💻 cs

Revolutionizing Mixed Precision Quantization: Towards Training-free Automatic Proxy Discovery via Large Language Models

This paper proposes TAP, a novel training-free framework that leverages Large Language Models and evolutionary search to automatically discover superior mixed-precision quantization proxies, eliminating the need for human expertise or costly optimization while achieving state-of-the-art performance.

Haidong Kang, Jun Du, Lihong Lin2026-03-06💻 cs

EgoCampus: Egocentric Pedestrian Eye Gaze Model and Dataset

This paper introduces the EgoCampus dataset and the EgoCampusNet model to address the challenge of predicting egocentric pedestrian eye gaze in real-world outdoor navigation by leveraging a large-scale, diverse collection of gaze-annotated videos captured via Meta's Project Aria glasses.

Ronan John, Aditya Kesari, Vincenzo DiMatteo + 1 more2026-03-06💻 cs

DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance

This paper introduces DriverGaze360, a large-scale 360-degree driver attention dataset and a corresponding panoramic prediction network (DriverGaze360-Net) that leverages object-level guidance to overcome the limitations of existing frontal-view methods and achieve state-of-the-art performance in modeling omnidirectional driver gaze behavior.

Shreedhar Govil, Didier Stricker, Jason Rambach2026-03-06💻 cs

ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking

The paper proposes ViRC, a framework that enhances multimodal mathematical reasoning by introducing a Reason Chunking mechanism to structure problem-solving into Critical Reasoning Units, supported by the CRUX dataset and a progressive training strategy, resulting in significant performance improvements over existing models.

Lihong Wang, Liangqi Li, Weiwei Feng + 6 more2026-03-06💻 cs

FluenceFormer: Transformer-Driven Multi-Beam Fluence Map Regression for Radiotherapy Planning

This paper introduces FluenceFormer, a transformer-driven, two-stage framework that leverages a physics-informed Fluence-Aware Regression loss to achieve superior, geometry-aware fluence map prediction for radiotherapy planning, significantly outperforming existing CNN and single-stage methods in energy conservation and structural fidelity.

Ujunwa Mgboh, Rafi Ibn Sultan, Joshua Kim + 2 more2026-03-06💻 cs

Parallel Diffusion Solver via Residual Dirichlet Policy Optimization

This paper introduces the EPD-Solver, a novel parallel ODE solver that accelerates diffusion model sampling by leveraging independent parallel gradient evaluations and a parameter-efficient reinforcement learning framework to significantly reduce truncation errors while preserving image quality in text-to-image generation.

Ruoyu Wang, Ziyu Li, Beier Zhu + 5 more2026-03-06💻 cs

PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

This paper introduces PhyGDPO, a physics-aware groupwise direct preference optimization framework supported by a large-scale physics-augmented dataset (PhyVidGen-135K) and novel training schemes, to significantly enhance the physical consistency of text-to-video generation.

Yuanhao Cai, Kunpeng Li, Menglin Jia + 11 more2026-03-06💻 cs

MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing

MorphAny3D is a training-free framework that achieves high-quality, semantically consistent, and temporally smooth 3D morphing across categories by leveraging Structured Latent representations through novel Morphing Cross-Attention and Temporal-Fused Self-Attention mechanisms.

Xiaokun Sun, Zeyu Cai, Hao Tang + 3 more2026-03-06💻 cs

EmboTeam: Grounding LLM Reasoning into Reactive Behavior Trees via PDDL for Embodied Multi-Robot Collaboration

EmboTeam is a novel framework that enhances embodied multi-robot collaboration by cascading LLM-based instruction parsing into formal PDDL planning and reactive behavior tree execution, achieving significantly higher task success rates on the new MACE-THOR benchmark compared to existing baselines.

Haishan Zeng, Mengna Wang, Peng Li2026-03-06💻 cs

Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments

This paper introduces a new dataset derived from football highlight reels to evaluate foundation models' ability to identify contextually important video moments, revealing that current state-of-the-art models perform near chance levels due to their reliance on single dominant modalities and failure to effectively synthesize cross-modal information.

Aditya K Surikuchi, Raquel Fernández, Sandro Pezzelle2026-03-06💻 cs

Agentic Very Long Video Understanding

This paper introduces EGAgent, an agentic framework that leverages entity scene graphs and hybrid search tools to enable state-of-the-art compositional reasoning and recall over continuous, multi-day egocentric video streams, addressing the limitations of existing models in long-horizon video understanding.

Aniket Rege, Arka Sadhu, Yuliang Li + 5 more2026-03-06💻 cs

MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-k Activations

This paper proposes MiTA attention, a unified framework that efficiently scales fast weights in Transformers by compressing the N-width MLP into a narrower one and constructing deformable experts via a Mixture of Top-k Activations strategy, thereby enabling effective handling of extremely long sequences.

Qishuai Wen, Zhiyuan Huang, Xianghan Meng + 2 more2026-03-06💻 cs

DDP-WM: Disentangled Dynamics Prediction for Efficient World Models

DDP-WM is a novel, efficient world model that addresses the computational bottlenecks of dense Transformer-based approaches by employing Disentangled Dynamics Prediction to separate sparse primary physical interactions from secondary background updates, thereby achieving significant inference speedups and improved planning success rates across diverse robotic tasks.

Shicheng Yin, Kaixuan Yin, Weixing Chen + 3 more2026-03-06💻 cs

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

This paper introduces Rolling Sink, a training-free method that bridges the train-test gap in autoregressive video diffusion models by leveraging systematic cache maintenance analysis to enable stable, high-fidelity open-ended video generation far beyond the model's limited training horizon.

Haodong Li, Shaoteng Liu, Zhe Lin + 1 more2026-03-06💻 cs

Learning to Select Like Humans: Explainable Active Learning for Medical Imaging

This paper proposes an explainability-guided active learning framework that improves medical image analysis by strategically selecting samples based on both predictive uncertainty and attention misalignment with expert-defined regions, thereby achieving superior data efficiency and clinical interpretability compared to traditional methods.

Ifrat Ikhtear Uddin, Longwei Wang, Xiao Qin + 2 more2026-03-06💻 cs

Pailitao-VL: Unified Embedding and Reranker for Real-Time Multi-Modal Industrial Search

Pailitao-VL is a unified multi-modal retrieval system that achieves state-of-the-art, real-time industrial search performance by replacing traditional contrastive embeddings with an absolute ID-recognition paradigm and evolving reranking into a compare-and-calibrate listwise policy, thereby overcoming granularity, noise, and latency challenges in large-scale production environments.

Lei Chen, Chen Ju, Xu Chen + 13 more2026-03-06💻 cs

Bidirectional Temporal Dynamics Modeling for EEG-based Driving Fatigue Recognition

The paper proposes DeltaGateNet, a novel framework that enhances EEG-based driving fatigue recognition by explicitly modeling bidirectional temporal dynamics through a Bidirectional Delta module and a Gated Temporal Convolution module, achieving superior and robust performance across diverse datasets and evaluation settings.

Yip Tin Po, Jianming Wang, Yutao Miao + 5 more2026-03-06💻 cs

EA-Swin: An Embedding-Agnostic Swin Transformer for AI-Generated Video Detection

The paper proposes EA-Swin, an embedding-agnostic Swin Transformer that achieves state-of-the-art accuracy and generalization in detecting AI-generated videos by modeling spatiotemporal dependencies on pretrained embeddings, supported by a new large-scale benchmark dataset.

Hung Mai, Loi Dinh, Duc Hai Nguyen + 6 more2026-03-06💻 cs

← Previous Next →