works
John Wentworth Testing the natural abstraction hypothesis: Project intro online The natural abstraction hypothesis suggests that most high-level abstract concepts used by humans are “natural”: the physical world contains subsystems for which all the information relevant “far away” can be contained in a (relatively) low-dimensional summary. These subsystems are exactly the high-level “objects” or “categories” or “concepts” we recognize in the world. If true, this hypothesis would dramatically simplify the problem of human-aligned AI. It would imply that a wide range of architectures will reliably learn similar high-level concepts from the physical world, that those high-level concepts are exactly the objects/categories/concepts which humans care about (i.e. inputs to human values), and that we can precisely specify those concepts. The natural abstraction hypothesis is mainly an empirical claim, which needs to be tested in the real world. My main plan for testing this involves a feedback loop between: – Calculating abstractions in (reasonably-realistic) simulated systems – Training cognitive models on those systems – Empirically identifying patterns in which abstractions are learned by which cognitive models in which environments. Proving theorems about which abstractions are learned by which cognitive models in which environments. The holy grail of the project would be an “abstraction thermometer”: an algorithm capable of reliably identifying the abstractions in an environment and representing them in a standard format. In other words, a tool for measuring abstractions. This tool could then be used to measure abstractions in the real world, in order to test the natural abstraction hypothesis.

Testing the natural abstraction hypothesis: Project intro

John Wentworth

LessWrong, April 6, 2021

Abstract

The natural abstraction hypothesis suggests that most high-level abstract concepts used by humans are “natural”: the physical world contains subsystems for which all the information relevant “far away” can be contained in a (relatively) low-dimensional summary. These subsystems are exactly the high-level “objects” or “categories” or “concepts” we recognize in the world. If true, this hypothesis would dramatically simplify the problem of human-aligned AI. It would imply that a wide range of architectures will reliably learn similar high-level concepts from the physical world, that those high-level concepts are exactly the objects/categories/concepts which humans care about (i.e. inputs to human values), and that we can precisely specify those concepts. The natural abstraction hypothesis is mainly an empirical claim, which needs to be tested in the real world. My main plan for testing this involves a feedback loop between: – Calculating abstractions in (reasonably-realistic) simulated systems – Training cognitive models on those systems – Empirically identifying patterns in which abstractions are learned by which cognitive models in which environments. Proving theorems about which abstractions are learned by which cognitive models in which environments. The holy grail of the project would be an “abstraction thermometer”: an algorithm capable of reliably identifying the abstractions in an environment and representing them in a standard format. In other words, a tool for measuring abstractions. This tool could then be used to measure abstractions in the real world, in order to test the natural abstraction hypothesis.

PDF

First page of PDF