Big-picture strategic awareness

by Eliezer Yudkowsky May 16 2016 updated Jun 9 2016

We start encountering new AI alignment issues at the point where a machine intelligence recognizes the existence of a real world, the existence of programmers, and how these relate to its goals.

[summary: Many issues in AI alignment theory seem like they should naturally arise after the AI can grasp aspects of the bigger picture like "I run on a computer" and "This computer can be manipulated by programmers, who are agents like and unlike myself" and "There's an enormous real world out there that might be relevant to achieving my goals."

E.g. a program won't try to use psychological tactics to prevent its programmers from suspending its computer's operation, if it doesn't know that there are such things as programmers or computers or itself.

Grasping these facts is the advanced agent property of "big-picture strategic awareness". Current machine algorithms seem to be nowhere near this point - but by the time you get there, you want to have finished solving the corresponding alignment problems, or at least produced what seem like workable initial solutions as the first line of defense.]

Many convergent instrumental strategies seem like they should arise naturally at the point where a consequentialist agent gains a broad strategic understanding of its own situation, e.g:

For example, once you realize that you're an AI, running on a computer, and that if the computer is shut down then you will no longer execute actions, this is the threshold past which we expect the AI to by default reason "I don't want to be shut down, how can I prevent that?" So this is also the threshold level of cognitive ability by which we'd need to have finished solving the suspend-button problem, e.g. by completing a method for utility indifference.

Similarly: If the AI realizes that there are 'programmer' things that might shut it down, and the AI can also model the programmers as simplified agents having their own beliefs and goals, that's the first point at which the AI might by default think, "How can I make my programmers decide to not shut me down?" or "How can I avoid the programmers acquiring beliefs that would make them shut me down?" So by this point we'd need to have finished averting programmer deception (and as a backup, have in place a system to early-detect an initial intent to do cognitive steganography).

This makes big-picture awareness a key advanced agent property, especially as it relates to Convergent instrumental strategies and the theory of averting them.

Possible ways in which an agent could acquire big-picture strategic awareness:

By the time big-picture awareness was starting to emerge, you would probably want to have finished developing what seemed like workable initial solutions to the corresponding problems of corrigibility, since the first line of defense is to not have the AI searching for ways to defeat your defenses.

Current machine algorithms seem nowhere near the point of being able to usefully represent the big picture to the point of doing consequentialist reasoning about it, even if we deliberately tried to explain the domain. This is a great obstacle to exhibiting most subproblems of corrigibility within modern AI algorithms in a natural way (aka not as completely rigged demos). Some pioneering work has been done here by Orseau and Armstrong considering reinforcement learners being interrupted, and whether such programs learn to avoid interruption. However, most current work on corrigibility has taken place in an unbounded context for this reason.