A transparency and interpretability tech tree

Evan Hubinger

AI Alignment Forum, June 17, 2022

Abstract

Several transparency and interpretability tools for AI alignment and safety have been developed, but a clear vision is necessary to see how these tools might lead to solving the alignment problem from AI models. The author proposes a tech tree of eight levels to achieve transparency and interpretability in AI that goes from “best-case inspection transparency” to “worst-case robust-to-training transparency for deceptive models.” This draws important distinctions between transparency via inspection vs. via training process, worst-case vs. best-case transparency, and deceptive vs. non-deceptive alignment. The author claims that worst-case scenarios, especially those involving deception, are more difficult to solve than best-case ones. Each level depends on the one preceding it, though the dependencies are not purely binary, and two levels of transparency are not mutually exclusive. – AI-generated abstract.

A transparency and interpretability tech tree

Abstract

PDF