Ajeya Cotra on accidentally teaching AI models to deceive us

Robert Wiblin, Luisa Rodriguez, and Keiran Harris

80,000 Hours, May 12, 2023

Abstract

Managing a large language model (LLM) like a trillion-parameter “child” presents unique challenges. LLMs, despite vast knowledge and generative capabilities, lack real-world grounding and common sense, potentially leading to harmful or nonsensical outputs. Directly exposing such a model, analogous to an orphaned child inheriting a massive company, risks unpredictable consequences without appropriate safeguards. Effective LLM management requires careful “parenting”: establishing robust safety protocols, continuous monitoring and feedback loops, and ongoing training to refine its behavior and align its outputs with human values. This involves developing methods for value alignment, mitigating biases, and ensuring responsible deployment, preventing the LLM from causing unintended harm or being exploited. Just as a child requires guidance, LLMs necessitate ongoing development and oversight to reach their full potential while minimizing risks.

Ajeya Cotra on accidentally teaching AI models to deceive us

Abstract

PDF