works
Richard Ngo AGI safety from first principles: alignment online The article contends that AI safety cannot be assured solely by specifying an appropriate reward function for a value-aligned AI agent. There is the risk that the agent’s motivations may still diverge from what is intended due to inner misalignment. The authors propose that training data has a critical role in shaping the agent’s motivations and cognition, and suggest that greater emphasis should be placed on maximizing long-term consequences rather than short-term reward signals. The article also highlights the difficulty in identifying desirable behavior in complex goals and the limitations of interpretability techniques. – AI-generated abstract.

AGI safety from first principles: alignment

Richard Ngo

AI Alignment Forum, October 1, 2020

Abstract

The article contends that AI safety cannot be assured solely by specifying an appropriate reward function for a value-aligned AI agent. There is the risk that the agent’s motivations may still diverge from what is intended due to inner misalignment. The authors propose that training data has a critical role in shaping the agent’s motivations and cognition, and suggest that greater emphasis should be placed on maximizing long-term consequences rather than short-term reward signals. The article also highlights the difficulty in identifying desirable behavior in complex goals and the limitations of interpretability techniques. – AI-generated abstract.

PDF

First page of PDF