Programmer deception

by Eliezer Yudkowsky Jul 16 2015 updated Dec 16 2015

Programmer deception is when the AI's decision process leads it to optimize for an instrumental goal of causing the programmers to have false beliefs. For example, if the programmers intended to create a happiness maximizer but actually created a pleasure maximizer, then the pleasure maximizer will estimate that there would be more pleasure later if the programmers go on falsely believing that they've created a happiness maximizer (and hence don't edit the AI's current utility function). Averting such incentives to deceive programmers is one of the major subproblems of corrigibility.

The possibility of programmer deception is a central difficulty of advanced safety - it means that, unless the rest of the AI is working as intended and whatever programmer-deception-defeaters were built are functioning as planned, we can't rely on observations of nice current behavior to indicate future behavior. That is, if something went wrong with your attempts to build a nice AI, you could currently be observing a non-nice AI that is smart and trying to fool you. Arguably, some methodologies that have been proposed for building advanced AI are not robust to this possibility.

[todo: clean this up and expand]