"> It seems like the only ad..."


by Eliezer Yudkowsky Dec 30 2015 updated Dec 30 2015

It seems like the only advantage of the genie is that it doesn't make prediction errors about humans.

Well, YES. This seems to reflect a core disagreement about how hard it probably is to get full, correct predictive coverage of humans using a supervised optimization paradigm. Versus how hard it is to, say, ask a conservative low-impact genie to make a burrito and have it make a burrito even though the genie doesn't and couldn't predict what humans would think about the long-term impact of AI burrito-making on human society and whether making a burrito was truly the right thing to do. I think the latter is plausibly a LOT easier, though still not easy.

My instinctive diagnosis of this core disagreement is something like "Paul is overly inspired by this decade's algorithms and thinks everything labeled 'predicting humans' is equally difficult because it's all just 'generalized supervised learning'" but that is probably a strawman. Even if we're operating primarily on a supervision paradigm rather than a modeling paradigm, I expect differences in how easy it is to get complete coverage of some parts of the problem versus others. I expect that some parts of what humans want are a LOT easier to supervised-learn than others. The whole reason for being interested in e.g. 'low impact' genies is because of the suspicion that 'try not to have unnecessary impacts in general and plan to do things in a way that minimizes side effects while getting the job done, then check the larger impacts you expect to have', while by no means trivial, will still be a LOT easier to learn or specify to a usable and safe degree than the whole of human value.

You seem to be imagining a direct way to formulate an imperative like "do no harm" that doesn't involve predicting what the user would describe as a harm or what harm-avoidance strategy the user would advocate; I don't see much hope for that.

If you consider the low-impact paradigm, then the idea is that you can get a lot of the same intended benefit of "do no harm" via "try not to needlessly affect things and tell me about the large effects you do expect so I can check, even if this involves a number of needlessly avoided effects and needless checks" rather than "make a prediction of what I would consider 'harm' and avoid only that, which prediction I know to be good enough that there's no point in my checking your prediction any more". The former isn't trivial and probably is a LOT harder than someone not steeped in edge instantiation problems and unforeseen maxima would expect - if you do it in a naive way, you just end up with the whole universe maximized to minimize 'impact'. But it's plausible to me (>50% probability) that the latter case, what Bostrom would call a Sovereign, is a LOT harder to build (and know that you've built).