Act\-based agents seem to be robust to certain kinds of errors\. You need only the vaguest understanding of humans to guess that killing the user is: \(1\) not something they would approve of, \(2\) not something they would do, \(3\) not in line with their instrumental preferences\.
Eliezer objects to this post's optimism about robustness.
Concretely, the complaint seems to be that a human-predictor would form generalizations like "the human takes and approves action that maximize expected utility" for some notion of "utility" and some notion of counterfactuals etc. It might then end up killing the users (or making some other irreversibly bad decision) because the bad action is the utility-maximizing thing to do according to the the learned values/decision theory/priors/etc. (which aren't identical to humans' values/decision theory/priors/etc.).
I'm not impressed by this objection.
Clearly this would be an objectively bad prediction of the human. And so the question is entirely about how hard it is to notice that it's a bad, or at least uncertain, prediction. That is, to a human it appears to be a comically bad prediction. So the question is: to what extent is this just because we are humans predicting humans?
- This class of errors has literally been talked about by humans in advance, as has the general observation that humans won't endorse irreversible and potentially catastrophic actions without checking in with humans first. It will probably be talked about in much more detail at the time. So noticing this is an error only requires something like an understanding of how humans' actions relate to their words, which is significantly easier than building a model of a human as an approximately rational goal-directed agent (since that approximate model would also need to explain human utterances). That is, you just need to be able to infer from a human saying "I think X would be a catastrophic mistake" that a human won't do X.
- It seems like this error is only possible for an agent that is unable to predict anything like "how a human would talk about their decision," or "how other people would respond to a decision," or so on. Are you imagining a system that can't predict any of these properties, but can just make OK predictions about actions? Or are you imagining a system that fills in the details of the "kill all humans" action with the human patiently explaining how the action is good because we are probably living in a simulation controlled by an adversarial superintelligence who will torture us if we don't take it, yet isn't able to distinguish this explanation from the explanations that are actually given in the real world for real actions?
- You seem to be describing the situation as though expected utility maximization with an aggregate utility function is an OK description of human behavior but for some issues like Pascal's mugging that only appear in future edge cases. This view seems surprising for a few reasons. First, how does it account for human philosophical deliberation, and the actual discussions that humans engage in when faced with cases superficially resembling these pathological edge cases? I don't see how any plausible human model is going to throw out the human deliberative model in favor of some simple general theory. Second, expected utility maximization basically can't reproduce even a single human decision. Taken literally these philosophical frameworks are mostly predictively useless, it's not like this is a basically right framework that has a few weird edge cases. A muggable value system doesn't behave badly in weird corner cases, it behaves badly literally all of the time (except perhaps when implementing convergent instrumental values).
- It doesn't seem necessary for a learner to generalize correctly to some far-out case on the first shot, it only seems necessary for it to know that this is a case where it is uncertain (e.g. because it entertains several conflicting hypotheses, or because there are several general regularities that come into conflict in this case).
I don't think these points totally capture my position, but hopefully they help explain where I am coming from. I still feel pretty good about the argument in the "robustness" section of this post. It really does seem like it is pretty easy to predict that the human won't generally endorse actions that leaves them dead.