Look where I'm pointing, not at my finger

[summary: Suppose we're trying to give a Task AGI the task, "Put a strawberry on this pedestal". We mean to identify our intended category of strawberries by waving some strawberries and some non-strawberries in front of the AI's webcam. Alice in the control room will press a button to label which of these objects are strawberries. The "Look where I'm pointing, not at my finger" problem is getting the AI to focus on the strawberries rather than Alice or the button. The concepts "strawberry on the pedestal" and "event that makes Alice think of strawberries" and "event that causes the button to be pressed" are different goals to pursue, even though as concepts they'll all equally well-classify any normal training cases. AIs pursuing these goals respectively put a strawberry on the pedestal, fool Alice using a plastic strawberry, and build a robotic arm to press the labeling button.

We want a way to point to a particular part of the AI's model of the causal lattice that produces the labeled training data - the event we intuitively consider to be the strawberry on the pedestal, versus other parts of the causal lattice like Alice and the button. Hence "look where I'm pointing, not at my finger".

strawberry diagram ]

Example problem

Suppose we're trying to give a Task AGI the task, "Make there be a strawberry on the pedestal in front of your webcam." For example, a human could fulfill this task by buying a strawberry from the supermarket and putting it on the pedestal.

As part of aligning a Task AGI on this goal, we'd need to identify strawberries and the pedestal.

One possible approach to communicating the concept of "strawberry" is through a training set of human-selected cases of things that are and aren't strawberries, on and off the pedestal.

For the sake of distinguishing causal roles, let's say that one human, User1, is selecting training cases of objects and putting them in front of the AI's webcam. A different human, User2, is looking at the scene and pushing a button when they see something that looks like a strawberry on the pedestal. The intention is that pressing the button will label positive instances of the goal concept, namely strawberries on the pedestal. In actual use after training, the AI will be able to generate its own objects to put inside the room, possibly with further feedback from User2. We want these objects to be instances of our intended goal concept, aka, actual strawberries.

We could draw an intuitive causal model for this situation as follows:

strawberry diagram

Suppose that during the use phase, the AI actually creates a realistic plastic strawberry, one that will fool User2 into pressing the button. Or, similarly, suppose the AI creates a small robot that sprouts tiny legs and runs over to User2's button and presses the button directly.

Neither of these are the goal concept that we wanted the AI to learn, but any test of the hypothesis "Is this event classified as a positive instance of the goal concept?" will return "Yes, the button was pressed."

%%comment: (If you imagine some other User3 watching this and pressing an override button to tell the AI that this fake strawberry wasn't really a positive instance of the intended goal concept, imagine the AI modeling and then manipulating or bypassing User3, etcetera.)%%

More generally, the human is trying to point to their intuitive "strawberry" concept, but there may be other causal concepts that also separate the training data well into positive and negative instances, such as "objects which come from strawberry farms", "objects which cause (the AI's psychological model of) User2 to think that something is a strawberry", or "any chain of events leading up to the positive-instance button being pressed".

%%comment: move this to sensory identification section:  However, in a case like this, it's not like the actual physical glove is inside the AGI's memory.  Rather, we'd be, say, putting the glove in front of the AGI's webcam, and then (for the sake of simplified argument) pressing a button which is meant to label that thing as a "positive instance".  If we want our AGI to achieve particular states of the environment, we'll want it to reason about the causes of the image it sees on the webcam and identify a concept over those causes - have a goal over 'gloves' and not just 'images which look like gloves'.  In the latter case, it could just as well fulfill its goal by setting up a realistic monitor in front of its webcam and displaying a glove image.  So we want the AGI to [2rz identify its task] over the causes of its sensory data, not just pixel fields.%%

Abstract problem

To state the above potential difficulty more generally:

The "look where I'm pointing, not at my finger" problem is that the labels on the training data are produced by a complicated causal lattice, e.g., (strawberry farm) -> (strawberry) -> (User1 takes strawberry to pedestal) -> (Strawberry is on pedestal) -> (User2 sees strawberry) -> (User2 classifies strawberry) -> (User2 presses 'positive instance' button). We want to point to the "strawberry" part of the lattice of causality, but the finger we use to point there is User2's psychological classification of the training cases and User2's hand pressing the positive-instance button.

Worse, when it comes to which model best separates the training cases, concepts that are further downstream in the chain of causality should classify the training data better, if the AI is smart enough to understand those parts of the causal lattice.

Suppose that at one point User2 slips on a banana peel, and her finger slips and accidentally classifies a scarf as a positive instance of "strawberry". From the AI's perspective there's no good way of accounting for this observation in terms of strawberries, strawberry farms, or even User2's psychology. To maximize predictive accuracy over the training cases, the AI's reasoning must take into account that things are more likely to be positive instances of the goal concept when there's a banana peel on the control room floor. Similarly, if some deceptively strawberry-shaped objects slip into the training cases, or are generated by the AI querying the user, the best boundary that separates 'button pressed' from 'button not pressed' labeled instances will include a model of what makes a human believe that something is a strawberry.

A learned concept that's 'about' layers of the causal lattice that are further downstream of the strawberry, like User2's psychology or mechanical force being applied to the button, will implicitly take into account the upstream layers of causality. To the extent that something being strawberry-shaped causes a human to press the button, it's implicitly part of the category of "events that end applying mechanical force to the 'positive-instance' button"). Conversely, a concept that's about upstream layers of the causal lattice can't take into account events downstream. So if you're looking for pure predictive accuracy, the best model of the labeled training data - given sufficient AGI understanding of the world and the more complicated parts of the causal lattice - will always be "whatever makes the positive-instance button be pressed".

This is a problem because what we actually want is for there to be a strawberry on the pedestal, not for there to be an object that looks like a strawberry, or for User2's brain to be rewritten to think the object is a strawberry, or for the AGI to seize the control room and press the positive instance button.

This scenario may qualify as a context disaster if the AGI only understands strawberries in its development phase, but comes to understand User2's psychology later. Then the more complicated causal model, in which the downstream concept of User2's psychology separates the data better than reasoning about properties of strawberries directly, first becomes an issue only when the AI is over a high threshold level of intelligence.

Approaches

conservatism would try to align the AGI to plan out goal-achievement events that were as similar as possible to the particular goal-achievement events labeled positively in the training data. If the human got the strawberry from the supermarket in all training instances, the AGI will try to get the same brand of strawberry from the same supermarket.

Ambiguity identification would focus on trying to get the AGI to ask us whether we meant 'things that make humans think they're strawberries' or 'strawberry'. This approach might need to go through resolving ambiguities by the AGI explicitly symbolically communicating with us about the alternative possible goal concepts, or generating sufficiently detailed multiple-view descriptions of a hypothetical case, not the AGI trying real examples. Testing alternative hypotheses using real examples always says that the label is generated further causally downstream; if you are sufficiently intelligent to construct a fake plastic strawberry that fools a human, trying out the hypothesis will produce the response "Yes, this is a positive instance of the goal concept." If the AGI tests the hypothesis that the 'real' explanation of the positive instance label is 'whatever makes the button be pressed' rather than 'whatever makes User2 think of a strawberry' by carrying out the distinguishing experiment of pressing the button in a case where User2 doesn't think something is a strawberry, the AGI will find that the experimental result favors the 'it's just whatever presses the button hypothesis'. Some modes of ambiguity identification break for sufficiently advanced AIs, since the AI's experiment interferes with the causal channel that we'd intended to return information about our intended goal concept.

Specialized approaches to the pointing-finger problem in particular might try to define a supervised learning algorithm that tends to internally distill, in a predictable way, some model of causal events, such that the algorithm could be instructed somehow to try learning a simple or direct relation between the positive "strawberry on pedestal" instances, and the observed labels of the "sensory button" node within the training cases; with this relation not being allowed to pass through the causal model of User2 or mechanical force being applied to the button, because we know how to say "those things are too complicated" or "those things are too far causally downstream" relative to the algorithm's internal model.

strawberry diagram

This specialized approach seems potentially susceptible to initial approach within modern machine learning algorithms.

But to restate the essential difficulty from an advanced-safety perspective: in the limit of advanced intelligence, the best possible classifier of the relation between the training cases and the observed button labels will always pass through User2 and anything else that might physically press the button. Trying to 'forbid' the AI from using the most effective classifier for the relation between Strawberry? and observed values of Button! seems potentially subject to a Nearest Unblocked problem, where the 'real' simplest relation re-emerges in the advanced phase after being suppressed during the training phase. Maybe the AI reasons about certain very complicated properties of the material object on the pedestal… in fact, these properties are so complicated that they turn out to contain implicit models of User2's psychology, again because this produces a better separation of the labeled training data. That is, we can't allow the 'strawberry' concept to include complicated logical properties of the strawberry-object that in effect include a psychological model of User2 reacting to the strawberry, implying that if User2 can be fooled by a fake plastic model, that must be a strawberry. Even though this richer model will produce a more accurate classification of the training data, and any actual experiments performed will return results favoring the richer model.

Even so, this doesn't seem impossible to navigate as a machine learning problem; an algorithm might be able to recognize when an upstream causal mode starts to contain predicates that belong in a downstream causal node; or an algorithm might contain strong regularization rules that collect all inference about User2 into the User2 node rather than letting it slop over anywhere else; or it might be possible to impose a constraint, after the strawberry category has been learned sufficiently well, that the current level of strawberry complexity is the most complexity allowed; or the granularity of the AI's causal model might not allow such complex predicates to be secretly packed into the part of the causal graph we're identifying, without visible and transparent consequences when we monitor how the algorithm is learning the goal predicate.

A toy model of this setup ought to include analogues of User2 that sometimes make mistakes in a regular way, and actions the AI can potentially take to directly press the labeling button; this would test the ability to point an algorithm to learn about the compact properties of the strawberry in particular, and not other concepts causally downstream that could potentially separate the training data better, or better explain the results of experiments. A toy model might also introduce new discoverable regularities of the User2 analogue, or new options to manipulate the labeling button, as part of the test data, in order to simulate the progression of an advanced agent gaining new capabilities.

Comments

Ryan Carey

It makes sense that one wants to stop the AI from optimising on a false objective (maximising button-presses). It would be ideal if the agent can be taught to ignore whichever of its actions occur by controlling the button.

In practise, a hack solution would be to use multiple buttons and multiple overseers rather than just one - I guess this will be a common suggestion. Having multiple overseers might weaken the problem, in that an agent would be more likely to learn that they all point to the same thing. I could also think of arguments that such an agent may nonetheless maximise its reward by forcing one or all of the overseers to press approval buttons.