Causal Interpretation of Sparse Autoencoder Features in Vision

Abstract

Understanding what sparse auto-encoder (SAE) features in vision transformers truly represent is usually done by inspecting the patches where a feature’s activation is highest. However, self-attention mixes information across the entire image, so an activated patch often co-occurs with—but does not cause—the feature’s firing. We propose Causal Feature Explanation(CaFE) which levarages Effective Receptive Field (ERF): treat each SAE feature’s activation as a target and apply input-attribution methods (AttnLRP, Integrated Gradients) to identify the image patches that causally drive that activation. Across CLIP-ViT features, ERF maps frequently diverge from naïve activation maps, revealing hidden context dependencies (e.g., a “roaring face” feature that requires eyes + nose, not just an open mouth). Patch insertion tests confirm that CaFE recover or suppress feature activations far more effectively than activation-ranked patches. Our results show that CaFE yields more faithful, semantically precise explanations of vision-SAE features, highlighting the risk of misinterpretation when relying solely on activation location.

Publication
Non-Proceedings Track