"The obvious patch is for a ..."

https://arbital.com/p/7d

by Paul Christiano Jun 18 2015


The obvious patch is for a sufficiently sophisticated system to have preferences over its own behavior, which motivate it to avoid reasoning in ways that we would dislike.

For example, suppose that my utility function U is "how good [ idealized Eliezer] thinks things are, after thinking for a thousand years." It doesn't take long to realize that [ idealized Eliezer] would be unhappy with a literal simulation of [ idealized Eliezer]. Moreover, a primitive understanding of Eliezer's views suffices to avoid the worst offenses (or at least to realize that they are the kinds of things which Eliezer would prefer that a human be asked about first).

An AI that is able to crush humans in the real world without being able to do this kind of reasoning seems catastrophic on other grounds. An AI that is able but unmotivated to carry out or act on this kind of reasoning seems even more catastophic for other reasons. (For example, I don't see any realistic approach to corrigibility that wouldn't solve this problem as well, and conversely I see many ways to resolve both.)

Edit: Intended as a response to the original post, but no way to delete and repost as far as I can tell.