A longlist of theories of impact for interpretability

Neel Nanda

AI Alignment Forum, March 11, 2022

Abstract

This article proposes a long list of reasons why interpretability research is important in reducing risks of misaligned artificial general intelligence (x-risk). It lists 19 theories with supporting points and elaborates on the most interesting disagreement in their evaluation. Theories 1-6 focus on interpretability’s role in identifying misalignment, enabling coordination, auditing, and facilitating empirical evidence for/against threat models. Theories 7-16 explore the role of interpretability in optimizing alignment by improving human feedback, incorporating interpretability tools into training loops, regulating aligned AI systems, and inspiring a cultural shift towards better understanding of models. Theories 17-19 relate to predicting and intervening in training to prevent deceptive alignment. The authors acknowledge that there is no consensus on which theory is most impactful but emphasize the importance of interpretability research in reducing x-risk. – AI-generated abstract.

A longlist of theories of impact for interpretability

Abstract

PDF