"Sorry, I tried to be concre..."


by Paul Christiano Dec 31 2015

Sorry, I tried to be concrete about what we were discussing, but I will try harder:

Consider some putative design for a genie, which behaves safely with human involvement.

Now form a pseudo-genie, that works as follows. Every time the original genie would consult a human (or provide an opportunity for human intervention) the pseudo-genie consults the human with small probability. It predicts how the human would respond, and behaves as if it had actually received the predicted feedback.

My weak claim is that the pseudo-genie will not have catastrophic failures unless either (1) it makes an inaccurate prediction or (2) the real genie has a catastrophic failure. This seems obvious on its face. But your most recent comment seems to be rejecting this claim, so it might be a good place to focus in order to clear up the discussion.

(I agree that even the best possible predictor cannot always make accurate predictions, so the relevance of the weak claim is not obvious. But you might hope that in situations that actually arise, very powerful systems will make accurate predictions.)

My strong claim is that if the human behaves sensibly the pseudo-genie will not have catastrophic failures unless either (1) it makes a prediction which seems obviously and badly wrong, or (2) the real genie has a catastrophic failure.

Even the strong claim is far from perfect reassurance, because the AI might expect to be in a simulation in which the human is about to be replaced by an adversarial superintelligence, and so make predictions that seem obviously and badly wrong. For the moment I am setting that difficulty aside---if you are willing to concede the point modulo that difficulty then I'll declare us on the same page.

it really sounds like you're assuming the problem of Friendly AI - reducing "does useful pivotal things and does not kill you" to "have a sufficiently good answer to some well-specified question whose interpretation doesn't depend on any further reflectively consistent degrees of freedom" - has been fully solved as just one step in your argument

No, I'm just arguing that if you had an AI that works well with human involvement, then you can make one that works well with minimal human involvement, modulo certain well-specified problems in AI (namely making good enough predictions about humans). Those problems almost but not quite avoid reflectively consistent degrees of freedom (the predictions still have a dependence on prior).

This is like one step of ten in the act-based approach, and so to the extent that we disagree it seems important to clear that up.