Interpretable Debiasing of Vision-Language Models for Social Fairness
This paper introduces DeBiasLens, an interpretable, model-agnostic framework that utilizes sparse autoencoders to identify and selectively deactivate social attribute neurons within Vision-Language models, thereby mitigating social biases without compromising semantic knowledge.