"Eliezer [objects](https://a..."


by Paul Christiano Mar 11 2016

Act\-based agents seem to be robust to certain kinds of errors\. You need only the vaguest understanding of humans to guess that killing the user is: \(1\) not something they would approve of, \(2\) not something they would do, \(3\) not in line with their instrumental preferences\.

Eliezer objects to this post's optimism about robustness.

Concretely, the complaint seems to be that a human-predictor would form generalizations like "the human takes and approves action that maximize expected utility" for some notion of "utility" and some notion of counterfactuals etc. It might then end up killing the users (or making some other irreversibly bad decision) because the bad action is the utility-maximizing thing to do according to the the learned values/decision theory/priors/etc. (which aren't identical to humans' values/decision theory/priors/etc.).

I'm not impressed by this objection.

Clearly this would be an objectively bad prediction of the human. And so the question is entirely about how hard it is to notice that it's a bad, or at least uncertain, prediction. That is, to a human it appears to be a comically bad prediction. So the question is: to what extent is this just because we are humans predicting humans?

I don't think these points totally capture my position, but hopefully they help explain where I am coming from. I still feel pretty good about the argument in the "robustness" section of this post. It really does seem like it is pretty easy to predict that the human won't generally endorse actions that leaves them dead.