works
Paul Christiano Hazards for formal specifications online Humans are often the subject of attempts to specify them mathematically: Overall value systems and behavior are areas that have received particular attention. Formal approaches to this problem often rely on, explicitly or implicitly, measures of simplicity based on the Kolmogorov complexity, the minimal length of a Turing machine program that outputs a given result. However, there are a number of problems with this. If a deterministic universe has a succinct description, then the best specification of human behavior will essentially consist of this description, and we will have failed to learn anything about the human alone. Similarly, if that agent is being simulated by another agent that has far greater resources, then that agent may well be able to determine what the first agent will say by simulating the experiment it is performing in a way the first agent cannot predict. This problem is compounded if that agent has the power to alter the first agent’s environment: then it can control the experiment’s results without the first agent even realizing it. Finally, if the agent being simulated has a temporally deep architecture (TDT) agent, then it may be able to determine that other agents are trying to understand its behavior and intervene to shape its output using this knowledge. While some modifications to our complexity measures or use of more specific models of computation may be able to solve one or another of these problems, they do not appear to be readily solvable in general – AI-generated abstract.

Hazards for formal specifications

Paul Christiano

Ordinary Ideas, December 15, 2011

Abstract

Humans are often the subject of attempts to specify them mathematically: Overall value systems and behavior are areas that have received particular attention. Formal approaches to this problem often rely on, explicitly or implicitly, measures of simplicity based on the Kolmogorov complexity, the minimal length of a Turing machine program that outputs a given result. However, there are a number of problems with this. If a deterministic universe has a succinct description, then the best specification of human behavior will essentially consist of this description, and we will have failed to learn anything about the human alone. Similarly, if that agent is being simulated by another agent that has far greater resources, then that agent may well be able to determine what the first agent will say by simulating the experiment it is performing in a way the first agent cannot predict. This problem is compounded if that agent has the power to alter the first agent’s environment: then it can control the experiment’s results without the first agent even realizing it. Finally, if the agent being simulated has a temporally deep architecture (TDT) agent, then it may be able to determine that other agents are trying to understand its behavior and intervene to shape its output using this knowledge. While some modifications to our complexity measures or use of more specific models of computation may be able to solve one or another of these problems, they do not appear to be readily solvable in general – AI-generated abstract.

PDF

First page of PDF