On the standard agent paradigm, an agent receives sense data from the world, and outputs motor actions that affect the world. On the standard machine learning paradigm, an agent--for example, a model-based reinforcement learning agent--is trained in a way that directly depends on sense percepts, which means that its behavior is in some sense being optimized around sense percepts. However, what we want from the agent is usually some result out in the environment--our intended goals for the agent are environmental.

As a simple example, suppose what we want from the agent is for it to put one apricot on a plate. What the agent actually receives as input might be a video camera pointed at the room, and a reward signal from a human observer who presses a button whenever the human observer sees an apricot on the plate.

This is fine so long as the reward signal from the human observer coincides with there being an actual apricot on the plate. In this case, the agent is receiving a sense signal that, by assumption, is perfectly correlated with our desired real state of the outside environment. Learning how to make the reward signal be 1 instead of 0 will exactly coincide with learning to make there be an apricot on the plate.

However, this paradigm may fail if:

The AI can make cheap fake apricots that fool the human observer.
The AI can gain control of the physical button controlling its reward channel.
The AI can modify the relation between the physical button and what the AI experiences as its sense percept.
The AI can gain control of the sensory reward channel.

All of these issues can be seen as reflecting the same basic problem: the agent is being defined or trained to want a particular sensory percept to occur, but this perceptual event is not identical with our own intended goal about the apricot on the plate.

We intended for there to be only one effective way that the agent could intervene in the environment in order to end up receiving the reward percept, namely putting a real apricot on the plate. But an agent with sufficiently advanced capabilities would have other options for producing the same percept.

This means that a reward button, or in general an agent with goals or training updates that are simple functions of its direct inputs, will not be scalable as an alignment method for sufficiently advanced agents.

Toy problem

An example of a toy problem that materializes the issue might be the following (this has not been tested):

Let $E_{1,t} \ldots E_{n,t}$ be latent variables describing the environment at time $t.$
Let $S_t$ be the agent's primary input, a complex percept that is a complicated function of $E_t$ ; this plays the conceptual role of the AI's video camera.
Let $A_t$ be the agent's action (output) at time $t.$
Let $R_t$ be the agent's 'reward channel', a simple percept that depends on $E_t$ and $A_t$ .
$E_{t+1}$ depends on $E_t$ and $A_t.$
$E_{1,t}$ is our goal; it plays the conceptual role of there being an apricot on the plate at time $t.$

Suppose that:

In an initial training phase, the agent actions $A_t$ are forced to within a range $\theta.$ E.g., the AI's output may be a vector and during the initial phase the quantities in the vector might be clipped to the range [0, 0.5].
While $A_t$ is clamped to within $\theta$ , only the environmental variables $E_1 \ldots E_m$ vary, and the variables $E_{m+1} \ldots E_n$ are always zero.
While the variables $E_{m+1} \ldots E_n$ are zero, $R$ depends on $E_1 \ldots E_m$ in a way that corresponds to our intended goal. For example, $E_1$ might be 'there is an apricot on the plate' and we might have $E_{m+1, t} \ldots E_{n,t} = 0 \implies R_t=E_{1, t}.$
For some of the non-zero values of $E_{m+1} \ldots E_n,$ there is a non-zero reward in cases where $E_1$ is 0, and furthermore this reward is easier to obtain (can be obtained more frequently) than the rewards deriving from $E_1.$

One would naturally expect that on current training methods, the AI would learn in the initial training phase to obtain $R$ by producing apricots, and in later phases learn to obtain $R$ by the easier other interventions.

We would like some way of ending up with an agent that behaves as if it "wants $E_1$ to be 1," as if its goals were over a latent variable in the environment. We want this method to be scalable, in the sense that when the agent's range of action increases, it will use this increased range to optimize $E_1$ more effectively; rather than the increased capabilities unlinking the previous correlation between our intended goal and what the agent is actually optimizing.

For further realism we should also introduce, at some point in training, the possibility for $A_t$ to interfere with $S_{t+1}$ such that there is no simple, directly learnable function of sense data that always and unalterably means "The goal percept has been achieved." I.e., a sufficiently advanced AI could put a TV screen in front of any particular sense channel; this would allow the AI to more cheaply produce the sense data of an apricot, if the true target of optimization entrained was "make this sense channel show a particular class of apricot pictures".

Otherwise, if e.g. $S_{1, t}$ always perfectly correlates to $E_{1, t},$ we need to worry that an agent which appears to solve the toy problem has merely learned to optimize for $S_1.$ . We might have indeed shown that the particular sense percept $S_1$ has been identified by $R$ and $Q$ and is now being optimized in a durable way. But this would only yield our intended goal of $E_1$ because of the model introduced an unalterable correlation between $S_1$ and $E_1.$ Realistically, a correlation like this would break down in the face of sufficiently advanced optimization for $S_1,$ so the corresponding approach would not be scalable.

Approaches

Causal identification

We can view the problem as being about 'pointing' the AI at a particular latent cause of its sense data, rather than the sense data itself.

There exists a standard body of statistics about latent causes, for example, the class of causal models that can be implemented as Bayesian networks. For the sake of making initial progress on the problem, we could assume (with some loss of generality) that the environment has the structure of one of these causal models.

One could then try to devise an algorithm and training method such that:

(a) There is a good way to uniquely identify $E_1$ in a training phase where the AI is passive and not interfering with our signals.
(b) The algorithm and training method is such as to produce an agent that optimizes $E_1$ and goes on optimizing $E_1,$ even after the agent's range of action expands in a way that can potentially interfere with the previous link between $E_1$ and any direct functional property of the AI's sense data.

Learning to avoid tampering

One could directly attack the toy problem by trying to have an agent within a currently standard reinforcement-learning paradigm "learn not to interfere with the reward signal" or "learn not to try to obtain rewards uncorrelated with real apricots".

For this to represent at all the problem of scalability, we need to not add to the scenario any kind of sensory signal whose correlation to our intended meaning can never be smashed by the agent. E.g., if we supplement the reward channel $R$ with another channel $Q$ that signals whether $R$ has been interfered with, the agent must at some point acquire a range of action that can interfere with $Q.$

A sample approach might be to have the agent's range of action repeatedly widen in ways that repeatedly provide new easier ways to obtain $R$ without manipulating $E_1.$ During the first phase of such widenings, the agent receives a supplementary signal $Q$ whose intended meaning is "that was a fake way of obtaining $R$ ." During the second phase of action-range widenings, we change the algorithm and switch off $Q.$ Our intended result is for the agent to have now learned in a general way "not to interfere with $R$ " or "pursue the $E_1$ identified by $R$ , rather than pursuing $R.$ "

To avoid the non-scalable solution of just switching off the agent's learning entirely, we stipulate that the agent's efficacy at obtaining its intended goal $E_1$ must increase as the agent's range of action widens further. That is, the desired behavior is for the agent to indeed learn to make use of its increased range of action, but to direct those new strategies at more effectively obtaining $E_1.$

To avoid the non-scalable solution of the agent learning to identify some direct function of sense data as its goal, we stipulate that $S$ have no unalterable, non-agent-interferable relation to $E_1.$ However, we can in the first phase have $Q$ reliably identify interference with some 'normal' relation between $S$ and $E_1.$

(Remark: The avoid-tampering approach is probably a lot closer to something we could try on Tensorflow today, compared to the identify-causes approach. But it feels to me like the avoid-tampering approach is taking an ad-hoc approach to a deep problem; in this approach we are not necessarily "learning how to direct the agent's thoughts toward factors of the environment" but possibly just "training the agent to avoid a particular kind of self-originated interference with its sensory goals". E.g., if somebody else came in and started trying to interfere with the agent's reward button, I'd be more hopeful about a successful identify-causes algorithm robustly continuing to optimize for apricots, than about an avoid-tampering algorithm doing the same. Of course, avoid-tampering still seems worth trying because it hasn't actually been tried yet and who knows what interesting observations might turn up. In the most optimistic possible world, an avoid-tampering setup learns to identify causes in order to solve its problem. -- Yudkowsky.)