Identifying causal goal concepts from sensory data

https://arbital.com/p/identify_causal_goals

by Eliezer Yudkowsky Apr 14 2016

If the intended goal is "cure cancer" and you show the AI healthy patients, it sees, say, a pattern of pixels on a webcam. How do you get to a goal concept *about* the real patients?


Suppose we want an AI to carry out some goals involving strawberries, and as a result, we want to identify to the AI the [ concept] of "strawberry". One of the potential ways we could do this is by showing the AI objects that a teacher classifies as strawberries or non-strawberries. However, in the course of doing this, what the AI actually sees will be e.g. a pattern of pixels on a webcam - the actual, physical strawberry is not directly accessible to the AI's intelligence. When we show the AI a strawberry, what we're really trying to communicate is "A certain proximal [ cause] of this sensory data is a strawberry", not, "This arrangement of sensory pixels is a strawberry." An AI that learns the latter concept might try to carry out its goal by putting a picture in front of its webcam; the former AI has a goal that actually involves something in its environment.

The open problem of "identifying causal goal concepts from sensory data" or "identifying environmental concepts from sensory data" is about getting an AI to form [ causal] goal concepts instead of [ sensory] goal concepts. Since almost no human-intended goal will ever be satisfiable solely in virtue of an advanced agent arranging to see a certain field of pixels, safe ways of identifying goals to sufficiently advanced goal-based agents will presumably involve some way of identifying goals among the causes of sense data.

A "toy" (and still pretty difficult) version of this open problem might be to exhibit a machine algorithm that (a) has a causal model of its environment, (b) can learn concepts over any level of its causal model including sense data, (c) can learn and pursue a goal concept, (d) has the potential ability to spoof its own senses or create fake versions of objects, and (e) is shown to learn a proximal causal goal rather than a goal about sensory data as shown by it pursuing only the causal version of that goal even if it would have the option to spoof itself.

For a more elaborated version of this open problem, see "Look where I'm pointing, not at my finger".