"I don't see indirect specif..."

I don't see indirect specifications as encountering these difficulties; all of the contenders so far go straight for the throat (defining behavior directly in terms of perceptions) rather than trying to pick out the programmer in the AI's ontology. Even formal accounts of e.g. language learning seem like they will have to go for the throat in this sense (learning the correspondence between language and an initially unknown world, based on perceptions), rather than manually binding nouns to parts of a particular ontology or something like that. So whatever mechanism you used to initially learn what a "programmer" is, it seems like you can use the same one to learn what a programmer is under your new physical theory (or more likely, your beliefs about the referent of "programmer" will automatically adjust with your beliefs about physics, and indeed will be used to help inform your changing beliefs about physics).

The "direct" approaches, that pick out what is valuable directly in the hard-coded ontology of the AI, seem clearly unsatisfactory on other grounds.

Comments

Paul Christiano

Six months and several discussions later this still seems like a serious concern (Nick Bostrom seemed to have the same response independently, and to consider it a pretty serious objection).

It really seems like the problem is an artifact of the toy example of diamond-maximization. This "easy" problem is so easy, in a certain sense, that it tempts us to a particular class of simple strategies where we literally specify a model of the world and say what diamond is.

Those strategies seem like an obvious dead end in the real case, and I think everyone is in agreement about that. They also seem like an almost-as-obvious dead end even in the diamond maximization case.

That's fine, but it means that the real justification is quite different from the simple story offered here. Everyone at MIRI I have talked to has fallen back to some variant of this more subtle justification when pressed. I don't know anywhere that the real justification has been fleshed out in any publicly available writing. I would be game for a public discussion about it.

It does seem like there is some real problem about getting agents to actually care about stuff in the real world. This just seems like a very strange way of describing or attacking the problem.