"> Do we disagree about this..."


by Eliezer Yudkowsky Dec 31 2015 updated Dec 31 2015

Do we disagree about this point? That is, do you think that such a pseudo-genie would predict me issuing instructions that lead to me dying?


One motivating observation is that human predictions of other humans seem to be complete overkill for running my argument---that is, the kinds of errors you must be concerned about are totally unlike the errors that a sophisticated person might make when reasoning about another person.

For early genies: Yes.

For later genies: It's more that I don't think the approval-based proposal, insofar as it's been specified so far, has demonstrated that it's reached the point where anything that kills you is a prediction error. I mean, if you can write out an AI design (or Python program that runs on a hypercomputer) which does useful pivotal things and never kills you unless it makes an epistemic error, that's a full in-principle solution to Friendly AI! Which I don't yet consider you to have presented! It's a very big thing to assume you can do!

Like, the way I expect this scenario cashes out in practice is that you write down an approval-directed design, I say, "Well, doesn't that seek out this point where it would correctly predict that you'd say 'yes' to this proposal, but this proposal actually kills you, because other optimization pressures sought out a case where you'd approve something extreme by mistake?" and you say "Oh of course that's not what I meant, I didn't mention this extra weird recursion here that prevents that" and this goes back and forth a bit. I expect that if you ever you present me with something that has all the loose variables nailed down (a la AIXI) and whose consequences can be understood, I'll think it kills the operator, and you'll disagree in a way that isn't based purely on math and doesn't let you convince me. That's what the world looks like in possible worlds where powerful optimization processes end up killing you unless you solve some hard problems and approval-based agents turn out not to deal with those problems.

Assuming that we agree on that point, then we can perhaps agree on a simpler claim: for a strictly superhuman AI, there would be no reason to have actual human involvement. Human involvement is needed only in domains where humans actually have capabilities, especially for reasoning about other humans, that our early AI lacks.

Or where humans have the preferable settings on their reflectively consistent degrees of freedom, where "reflectively consistent degrees of freedom" include Humean degrees of freedom in values, an intuitive decision theory that's reluctant to give everything away to blackmail or a Pascal's Mugging, etcetera. This is the reason to have human involvement with things that are superhumanly competent at computing the answers to well-specified problems, but aren't pointing in a sufficiently preferred direction with that competence if they were looped in on themselves and had to originate all their own directives.

This is making me wonder if there mustn't be a basic miscommunication on some end because it really sounds like you're assuming the problem of Friendly AI - reducing "does useful pivotal things and does not kill you" to "have a sufficiently good answer to some well-specified question whose interpretation doesn't depend on any further reflectively consistent degrees of freedom" - has been fully solved as just one step in your argument. Or like you're assuming that approval-directed agency and predicting human acts or answers can be used to solve that Big Question, but if so, this is exactly the great big key point and it's not something you can just ask me to take for granted!