Goal-concept identification

[summary: The problem of communicating to an AI a very simple local concept on the order of "strawberries" or "give me a strawberry". This level of the problem is meant to include subproblems of local categorization like "I meant a real strawberry like the ones I already showed you, not a fake strawberry that looks similar to your webcam". It isn't meant to include larger problems like verifying that a plan uses only known, whitelisted methods or identifying all possible harmful effects we could care about.]

The problem of trying to figure out how to communicate to an AGI an intended goal [ concept] on the order of "give me a strawberry, and not a fake plastic strawberry either".

At this level of the problem, we're not concerned with e.g. larger problems of safe plan identification such as not mugging people for strawberries, or minimizing side effects. We're not (at this level of the problem) concerned with identifying each and every one of the components of human value, as they might be impacted by side effects more distant in the causal graph. We're not concerned with [ philosophical uncertainty] about what we [normativity should] mean by "strawberry". We suppose that in an intuitive sense, we do have a pretty good idea of what we intend by "strawberry", such that there are things that are definitely strawberries and we're pretty happy with our sense of that so long as nobody is deliberately trying to fool it.

We just want to communicate a local goal concept that distinguishes edible strawberries from plastic strawberries, or nontoxic strawberries from poisonous strawberries. That is: we want to say "strawberry" in an understandable way that's suitable for fulfilling a task of "just give Sally a strawberry", possibly in conjunction with other features like conservatism or low impact or mild optimization.

For some open subproblems of the obvious approach that goes through showing actual strawberries to the AI's webcam, see "Identifying causal goal concepts from sensory data" and "Look where I'm pointing, not at my finger".

Comments

Paul Christiano

I think it's going to be hard to talk or think clearly about these problems (even at the level of separating them into distinct problems or telling which are real problems) until we get more specific about what a goal is, what a concept is, etc. What does the overall system actually look like, even very roughly?

I guess your take is that this is tied up in a very hard-to-separate way from the design of AI itself.

I understand that it is good to throw out some concrete problems before embarking on the project of clarifying our models of powerful AI systems. But I suspect you need at least some model of a powerful AI system where the questions make sense, just to keep things vaguely on track.