Academic

Fundamental Limits of Neural Network Sparsification: Evidence from Catastrophic Interpretability Collapse

arXiv:2603.18056v1 Announce Type: new Abstract: Extreme neural network sparsification (90% activation reduction) presents a critical challenge for mechanistic interpretability: understanding whether interpretable features survive aggressive compression. This work investigates feature survival under severe capacity constraints in hybrid Variational Autoencoder--Sparse Autoencoder (VAE-SAE) architectures. We introduce an adaptive sparsity scheduling framework that progressively reduces active neurons from 500 to 50 over 50 training epochs, and provide empirical evidence for fundamental limits of the sparsification-interpretability relationship. Testing across two benchmark datasets -- dSprites and Shapes3D -- with both Top-k and L1 sparsification methods, our key finding reveals a pervasive paradox: while global representation quality (measured by Mutual Information Gap) remains stable, local feature interpretability collapses systematically. Under Top-k sparsification, dead neuron rate

D
Dip Roy, Rajiv Misra, Sanjay Kumar Singh
· · 1 min read · 4 views

arXiv:2603.18056v1 Announce Type: new Abstract: Extreme neural network sparsification (90% activation reduction) presents a critical challenge for mechanistic interpretability: understanding whether interpretable features survive aggressive compression. This work investigates feature survival under severe capacity constraints in hybrid Variational Autoencoder--Sparse Autoencoder (VAE-SAE) architectures. We introduce an adaptive sparsity scheduling framework that progressively reduces active neurons from 500 to 50 over 50 training epochs, and provide empirical evidence for fundamental limits of the sparsification-interpretability relationship. Testing across two benchmark datasets -- dSprites and Shapes3D -- with both Top-k and L1 sparsification methods, our key finding reveals a pervasive paradox: while global representation quality (measured by Mutual Information Gap) remains stable, local feature interpretability collapses systematically. Under Top-k sparsification, dead neuron rates reach $34.4\pm0.9\%$ on dSprites and $62.7\pm1.3\%$ on Shapes3D at k=50. L1 regularization -- a fundamentally different "soft constraint" paradigm -- produces equal or worse collapse: $41.7\pm4.4\%$ on dSprites and $90.6\pm0.5\%$ on Shapes3D. Extended training for 100 additional epochs fails to recover dead neurons, and the collapse pattern is robust across all tested threshold definitions. Critically, the collapse scales with dataset complexity: Shapes3D (RGB, 6 factors) shows $1.8\times$ more dead neurons than dSprites (grayscale, 5 factors) under Top-k and $2.2\times$ under L1. These findings establish that interpretability collapse under sparsification is intrinsic to the compression process rather than an artifact of any particular algorithm, training duration, or threshold choice.

Executive Summary

This study investigates the fundamental limits of neural network sparsification on the survival of interpretable features under extreme capacity constraints. The authors propose an adaptive sparsity scheduling framework and test two benchmark datasets with various sparsification methods. Their key finding reveals a paradox: while global representation quality remains stable, local feature interpretability collapses systematically. This collapse is intrinsic to the compression process, rather than an artifact of any particular algorithm, training duration, or threshold choice. The study highlights the importance of understanding the relationship between sparsification and interpretability in neural networks.

Key Points

  • The authors propose an adaptive sparsity scheduling framework to investigate feature survival under severe capacity constraints.
  • The study demonstrates a systematic collapse of local feature interpretability under extreme sparsification, contradicting the assumption of stable representation quality.
  • The collapse is found to be intrinsic to the compression process, rather than an artifact of any particular algorithm or threshold choice.

Merits

Methodological Innovation

The authors introduce an adaptive sparsity scheduling framework, providing a novel approach to investigating the sparsification-interpretability relationship.

Empirical Rigor

The study utilizes two benchmark datasets and multiple sparsification methods to provide robust empirical evidence for the collapse of interpretability under sparsification.

Insights into Neural Network Interpretability

The study sheds light on the fundamental limits of neural network sparsification, highlighting the importance of understanding the relationship between sparsification and interpretability.

Demerits

Limited Generalizability

The study focuses on a specific type of neural network architecture (hybrid VAE-SAE) and two benchmark datasets, which may limit the generalizability of the findings to other architectures and datasets.

Lack of Theoretical Analysis

The study relies on empirical evidence, and a more comprehensive theoretical analysis of the sparsification-interpretability relationship would provide a deeper understanding of the phenomenon.

Expert Commentary

The study provides robust empirical evidence for the collapse of interpretability under extreme sparsification, highlighting the importance of understanding the relationship between sparsification and interpretability in neural networks. The authors' adaptive sparsity scheduling framework is a methodological innovation that provides a novel approach to investigating this relationship. However, the study's focus on a specific type of neural network architecture and two benchmark datasets may limit the generalizability of the findings. A more comprehensive theoretical analysis of the sparsification-interpretability relationship would provide a deeper understanding of the phenomenon and shed light on the underlying mechanisms.

Recommendations

  • Future studies should aim to develop more comprehensive theoretical analyses of the sparsification-interpretability relationship and investigate the generalizability of the findings to other neural network architectures and datasets.
  • The development of explainable AI techniques that take into account the interpretability of compressed models is crucial for ensuring the transparency and accountability of AI decision-making processes.

Sources