"I don't think you've correc..."


by Paul Christiano Dec 31 2015

I don't think you've correctly diagnosed the disagreement yet (your strawman position is obviously crazy, given that some forms of "predicting humans" are already tractable while others won't be until humans are obsolete).

When I imply that "making prediction errors about humans isn't a big deal," it's not because I think that algorithms won't make such errors. It's because the resulting failures don't look malignant.

We are concerned about a particular class of failures, namely those that lead to intelligent optimization of alien goals. So to claim that mis-predicting humans is catastrophic, you need to explain why it leads to this special kind of failure. This seems to be where we disagree. Misunderstanding human values doesn't seem to necessarily lead to this kind of catastrophe, as long as you get the part right where human values are the things that humans want. Other failures cause you to transiently do things that aren't quite what the humans want, which is maybe regrettable but basically fits into the same category as other errors about object level tasks.

A simple example:

Suppose that I am giving instructions to the pseudo-genie (like a genie but follows predicted rather than actual instructions), and the pseudo-genie is predicting what instructions I would give it. I fully expect the pseudo-genie not to predict any instructions that predictably lead to it killing me, except in exceptional circumstances or in cases where I get to sign off on dying first or etc.

This is not a subtle question requiring full coverage of human value, or nuances of the definition of "dying." I also don't think there is any edge instantiation here in the problematic sense. There is edge instantiation in that the human may pick instructions that are as good as possible subject to the constraint of not killing anyone, but I claim that this kind of edge instantiation does not put significant extra burden on our predictor.

Do we disagree about this point? That is, do you think that such a pseudo-genie would predict me issuing instructions that lead to me dying? If not, do you think that outcomes like "losing effective control of the machines that I've built" or "spawning a brooding elder god" are much subtler than dying and therefore more likely to be violated?

(I actually think that killing the user directly is probably a much more likely failure mode than spawning an alien superintelligence.)

I also do think that a classifier trained to identify "instructions that the human would object vigorously to with 1% probability" could identify most instructions that a human would in fact object vigorously to. (At least in the context of the pseudo-genie, where this classifier is being applied to predicted human actions. If the plans are optimized for not being classified as objectionable, which seems like it should never ever happen, then indeed something may go wrong.)

If you consider the low-impact paradigm, then the idea is that you can get a lot of the same intended benefit of "do no harm" via "try not to needlessly affect things and tell me about the large effects you do expect so I can check, even if this involves a number of needlessly avoided effects and needless checks"

I think I understand the motivation. I'm expressing skepticism that this is really an easier problem. Sorry if "do no harm" was an unfairly ambitious paraphrase.

One motivating observation is that human predictions of other humans seem to be complete overkill for running my argument---that is, the kinds of errors you must be concerned about are totally unlike the errors that a sophisticated person might make when reasoning about another person. If you disagree about this then that seems like a great opportunity to flesh out our disagreement, since I think it is a slam dunk and it seems way easier to reason about.

Assuming that we agree on that point, then we can perhaps agree on a simpler claim: for a strictly superhuman AI, there would be no reason to have actual human involvement. Human involvement is needed only in domains where humans actually have capabilities, especially for reasoning about other humans, that our early AI lacks.

That is, in some sense the issue (on your scenario) seems to be that AI systems are good at some tasks and humans are good at other tasks, and we want to build a composite system that has both abilities. This is quite different from the usual relationship in AI control, where the human is contributing goals rather than abilities.