Interpretability will not reliably find deceptive AI

Neel Nanda

Effective Altruism Forum, May 29, 2025

Abstract

Current interpretability research paradigms are unlikely to produce highly reliable methods for evaluating or monitoring the safety of superintelligent systems. Interpretability remains a valuable tool and should be part of a broader defense-in-depth strategy, but it is not a silver bullet. Both interpretability and black-box methods face fundamental limitations. Interpretability methods are susceptible to error, lack a ground truth for comparison, and face challenges in proving the absence of deception. Black-box methods can be circumvented by sufficiently intelligent systems. Despite these limitations, a pragmatic approach involves developing the best possible portfolio of monitoring and evaluation tools. Interpretability can provide a valuable signal, even if imperfect, and can be used in conjunction with black-box methods to create a more robust system. For example, interpretability can be used to enhance black-box evaluations, such as by manipulating whether a model believes it is being evaluated. It can also be used to debug mysterious behavior and generate hypotheses that can be verified by other means. While high reliability may be unattainable, maximizing the chances of catching misalignment remains a worthwhile objective. – AI-generated abstract.

Interpretability will not reliably find deceptive AI

Abstract

PDF