On how various plans miss the hard bits of the alignment challenge
AI Alignment Forum, July 12, 2022
Abstract
Many current AI alignment approaches propose to route around the key problem of alignment, by aiming to have humans + narrow AI services perform pivotal acts. Yet, the central alignment difficulty rears its ugly head by default, in a pretty robust way. Addressing it just requires gradient descent (or some other general-purpose optimizer) to be able to optimize over different concepts of our choosing. The problem is that an AI system’s conceptual repertoire won’t, in general, smoothly extrapolate beyond its training distribution, as capabilities generalize. Consequently, an AI will not reliably follow our goals outside of its training distribution, even if it was previously well-aligned within it. This alignment challenge seems fundamentally different from, and more difficult than, training an AI to succeed at new capabilities. – AI-generated abstract.
