"If the distinguishing chara..."

https://arbital.com/p/1gj

by Paul Christiano Dec 29 2015


If the distinguishing characteristic of a genie is "primarily relying on the human ability to discern short-term strategies that achieve long-term value," then I guess that includes all act-based agents. I don't especially like this terminology.

Note that, logically speaking, "human ability" in the above sentence should refer to the ability of humans working in concert with other genies. This really seems like a key fact to me (it also doesn't seem like it should be controversial).


Comments

Eliezer Yudkowsky

(Should I be replacing 'approval-directed' with 'act-based' in my future writing?)

The intended meaning is that the AI isn't trying to do long-range forecasting out to a million years later, that part is up to the humans. My understanding of your model of act-based agents is that you think act-based agents would be carrying out this forecast internally as part of their forecast of which short-term strategies humans would approve. A Genie doesn't model its programmers linking long-term outcomes to short-term strategies and then output the corresponding short-term strategies, a Genie implements the short-term goals selected by the programmers (which will be a good thing if the programmers have successfully linked short-term goals to long-term outcomes).

Paul Christiano

Act-based is a more general designation, that includes e.g. imitation learning (and value learning where the agent learns short-term instrumental preferences of the user rather than long-term preferences).

So you see the difference as whether the programmers have to actually supply the short-term objective, or whether the AI learns the short-term objective they would have defined / which they would accept/prefer?

The distinction seems to buy you relatively little safety at a great cost (basically taking the system from "maybe it's good enough?" to "obviously operating at an incredible disadvantage"). You seem to think that it buys you much more safety than I do.

It seems like the main extra risk is from the AI making bad predictions about what the humans would do. Mostly this seems like it will lead to harmless failures if the humans behave responsibly, and it requires only very weak models of human behavior to avoid most of the really bad failures. The main new catastrophic risk I see is the agent thinking it is in a simulation. Are there other similar problems for the act-based approach?

(If we use approval-direction instead of imitation then we may introduce additional concerns depending on how we set it up. But those seem orthogonal to the actual involvement of the human.)

Eliezer Yudkowsky

So you see the difference as whether the programmers have to actually supply the short-term objective, or whether the AI learns the short-term objective they would have defined / which they would accept/prefer?

The distinction seems to buy you relatively little safety at a great cost (basically taking the system from "maybe it's good enough?" to "obviously operating at an incredible disadvantage"). You seem to think that it buys you much more safety than I do.

This statement confuses me. (Remember that you know more about my scenarios than I know about your scenarios, so it will help if you can be more specific and concrete than your first-order intuition claims to be necessary.)

Considering these two scenarios…

…it seems to me that the gap between X and Y very plausibly describes a case where it's much easier to safely build X, though I also reserve some probability mass for the case where almost-all the difficulty of value alignment is in things like reflective stability and "getting the AI to do anything you specify, at all" so that it's only 1% more real difficulty to go from X to Y. I also don't think that X would be at a computational disadvantage compared to Y. X seems to need to solve much fewer of the sort of problems that I think are dangerous and philosophically fraught (though I think we have a core disagreement where you think 'philosophically fraught' is much less dangerous).

I suspect you're parsing up the AI space differently, such that X and Y are not natural clusters to you. Rather than my guessing, do you want to go ahead and state your own parsing?

Paul Christiano

I was comparing act-based agents to what you are calling a genie. Both get objectives from humans and human preferences about how to carry out short-term projects (e.g. including conservatism). The genie is getting short-term objectives by literally asking humans. The act-based agent is basically getting objectives by predicting what a human would say if asked. It seems like the only advantage of the genie is that it doesn't make prediction errors about humans.

If you want to make the comparison as clear as possible, we can turn a proposed genie into the most-similar-possible act-based agent. This agent calls up a human with small probability and gets an instruction which it executes. If it doesn't call a human, it guesses what instruction a human would give if called, and then executes that. (Note that the executing the given instruction may require asking questions of the user, and that the user needs to behave slightly differently when giving instructions to this kind of modified genie.)

The genie seems to be at a big disadvantage: it requires human involvement in every medium- or long-term decision, which will rapidly become impractical. This is especially bad when making medium- or long-term decisions itself requires consulting AI systems which themselves requires humans to make medium- or long-term decisions… Rather than say a 100x increase in human effort, actually providing feedback can result in exponentially large increases in required effort.

One reason that the act-based approach seems clearly preferable to me is that I don't imagine you can really carry out instructions without being able to make similarly good predictions about the user. You seem to be imagining a direct way to formulate an imperative like "do no harm" that doesn't involve predicting what the user would describe as a harm or what harm-avoidance strategy the user would advocate; I don't see much hope for that.

Eliezer Yudkowsky

It seems like the only advantage of the genie is that it doesn't make prediction errors about humans.

Well, YES. This seems to reflect a core disagreement about how hard it probably is to get full, correct predictive coverage of humans using a supervised optimization paradigm. Versus how hard it is to, say, ask a conservative low-impact genie to make a burrito and have it make a burrito even though the genie doesn't and couldn't predict what humans would think about the long-term impact of AI burrito-making on human society and whether making a burrito was truly the right thing to do. I think the latter is plausibly a LOT easier, though still not easy.

My instinctive diagnosis of this core disagreement is something like "Paul is overly inspired by this decade's algorithms and thinks everything labeled 'predicting humans' is equally difficult because it's all just 'generalized supervised learning'" but that is probably a strawman. Even if we're operating primarily on a supervision paradigm rather than a modeling paradigm, I expect differences in how easy it is to get complete coverage of some parts of the problem versus others. I expect that some parts of what humans want are a LOT easier to supervised-learn than others. The whole reason for being interested in e.g. 'low impact' genies is because of the suspicion that 'try not to have unnecessary impacts in general and plan to do things in a way that minimizes side effects while getting the job done, then check the larger impacts you expect to have', while by no means trivial, will still be a LOT easier to learn or specify to a usable and safe degree than the whole of human value.

You seem to be imagining a direct way to formulate an imperative like "do no harm" that doesn't involve predicting what the user would describe as a harm or what harm-avoidance strategy the user would advocate; I don't see much hope for that.

If you consider the low-impact paradigm, then the idea is that you can get a lot of the same intended benefit of "do no harm" via "try not to needlessly affect things and tell me about the large effects you do expect so I can check, even if this involves a number of needlessly avoided effects and needless checks" rather than "make a prediction of what I would consider 'harm' and avoid only that, which prediction I know to be good enough that there's no point in my checking your prediction any more". The former isn't trivial and probably is a LOT harder than someone not steeped in edge instantiation problems and unforeseen maxima would expect - if you do it in a naive way, you just end up with the whole universe maximized to minimize 'impact'. But it's plausible to me (>50% probability) that the latter case, what Bostrom would call a Sovereign, is a LOT harder to build (and know that you've built).

Paul Christiano

I don't think you've correctly diagnosed the disagreement yet (your strawman position is obviously crazy, given that some forms of "predicting humans" are already tractable while others won't be until humans are obsolete).

When I imply that "making prediction errors about humans isn't a big deal," it's not because I think that algorithms won't make such errors. It's because the resulting failures don't look malignant.

We are concerned about a particular class of failures, namely those that lead to intelligent optimization of alien goals. So to claim that mis-predicting humans is catastrophic, you need to explain why it leads to this special kind of failure. This seems to be where we disagree. Misunderstanding human values doesn't seem to necessarily lead to this kind of catastrophe, as long as you get the part right where human values are the things that humans want. Other failures cause you to transiently do things that aren't quite what the humans want, which is maybe regrettable but basically fits into the same category as other errors about object level tasks.

A simple example:

Suppose that I am giving instructions to the pseudo-genie (like a genie but follows predicted rather than actual instructions), and the pseudo-genie is predicting what instructions I would give it. I fully expect the pseudo-genie not to predict any instructions that predictably lead to it killing me, except in exceptional circumstances or in cases where I get to sign off on dying first or etc.

This is not a subtle question requiring full coverage of human value, or nuances of the definition of "dying." I also don't think there is any edge instantiation here in the problematic sense. There is edge instantiation in that the human may pick instructions that are as good as possible subject to the constraint of not killing anyone, but I claim that this kind of edge instantiation does not put significant extra burden on our predictor.

Do we disagree about this point? That is, do you think that such a pseudo-genie would predict me issuing instructions that lead to me dying? If not, do you think that outcomes like "losing effective control of the machines that I've built" or "spawning a brooding elder god" are much subtler than dying and therefore more likely to be violated?

(I actually think that killing the user directly is probably a much more likely failure mode than spawning an alien superintelligence.)

I also do think that a classifier trained to identify "instructions that the human would object vigorously to with 1% probability" could identify most instructions that a human would in fact object vigorously to. (At least in the context of the pseudo-genie, where this classifier is being applied to predicted human actions. If the plans are optimized for not being classified as objectionable, which seems like it should never ever happen, then indeed something may go wrong.)

If you consider the low-impact paradigm, then the idea is that you can get a lot of the same intended benefit of "do no harm" via "try not to needlessly affect things and tell me about the large effects you do expect so I can check, even if this involves a number of needlessly avoided effects and needless checks"

I think I understand the motivation. I'm expressing skepticism that this is really an easier problem. Sorry if "do no harm" was an unfairly ambitious paraphrase.

One motivating observation is that human predictions of other humans seem to be complete overkill for running my argument---that is, the kinds of errors you must be concerned about are totally unlike the errors that a sophisticated person might make when reasoning about another person. If you disagree about this then that seems like a great opportunity to flesh out our disagreement, since I think it is a slam dunk and it seems way easier to reason about.

Assuming that we agree on that point, then we can perhaps agree on a simpler claim: for a strictly superhuman AI, there would be no reason to have actual human involvement. Human involvement is needed only in domains where humans actually have capabilities, especially for reasoning about other humans, that our early AI lacks.

That is, in some sense the issue (on your scenario) seems to be that AI systems are good at some tasks and humans are good at other tasks, and we want to build a composite system that has both abilities. This is quite different from the usual relationship in AI control, where the human is contributing goals rather than abilities.

Eliezer Yudkowsky

Do we disagree about this point? That is, do you think that such a pseudo-genie would predict me issuing instructions that lead to me dying?

Yes!

One motivating observation is that human predictions of other humans seem to be complete overkill for running my argument---that is, the kinds of errors you must be concerned about are totally unlike the errors that a sophisticated person might make when reasoning about another person.

For early genies: Yes.

For later genies: It's more that I don't think the approval-based proposal, insofar as it's been specified so far, has demonstrated that it's reached the point where anything that kills you is a prediction error. I mean, if you can write out an AI design (or Python program that runs on a hypercomputer) which does useful pivotal things and never kills you unless it makes an epistemic error, that's a full in-principle solution to Friendly AI! Which I don't yet consider you to have presented! It's a very big thing to assume you can do!

Like, the way I expect this scenario cashes out in practice is that you write down an approval-directed design, I say, "Well, doesn't that seek out this point where it would correctly predict that you'd say 'yes' to this proposal, but this proposal actually kills you, because other optimization pressures sought out a case where you'd approve something extreme by mistake?" and you say "Oh of course that's not what I meant, I didn't mention this extra weird recursion here that prevents that" and this goes back and forth a bit. I expect that if you ever you present me with something that has all the loose variables nailed down (a la AIXI) and whose consequences can be understood, I'll think it kills the operator, and you'll disagree in a way that isn't based purely on math and doesn't let you convince me. That's what the world looks like in possible worlds where powerful optimization processes end up killing you unless you solve some hard problems and approval-based agents turn out not to deal with those problems.

Assuming that we agree on that point, then we can perhaps agree on a simpler claim: for a strictly superhuman AI, there would be no reason to have actual human involvement. Human involvement is needed only in domains where humans actually have capabilities, especially for reasoning about other humans, that our early AI lacks.

Or where humans have the preferable settings on their reflectively consistent degrees of freedom, where "reflectively consistent degrees of freedom" include Humean degrees of freedom in values, an intuitive decision theory that's reluctant to give everything away to blackmail or a Pascal's Mugging, etcetera. This is the reason to have human involvement with things that are superhumanly competent at computing the answers to well-specified problems, but aren't pointing in a sufficiently preferred direction with that competence if they were looped in on themselves and had to originate all their own directives.

This is making me wonder if there mustn't be a basic miscommunication on some end because it really sounds like you're assuming the problem of Friendly AI - reducing "does useful pivotal things and does not kill you" to "have a sufficiently good answer to some well-specified question whose interpretation doesn't depend on any further reflectively consistent degrees of freedom" - has been fully solved as just one step in your argument. Or like you're assuming that approval-directed agency and predicting human acts or answers can be used to solve that Big Question, but if so, this is exactly the great big key point and it's not something you can just ask me to take for granted!

Paul Christiano

Sorry, I tried to be concrete about what we were discussing, but I will try harder:

Consider some putative design for a genie, which behaves safely with human involvement.

Now form a pseudo-genie, that works as follows. Every time the original genie would consult a human (or provide an opportunity for human intervention) the pseudo-genie consults the human with small probability. It predicts how the human would respond, and behaves as if it had actually received the predicted feedback.

My weak claim is that the pseudo-genie will not have catastrophic failures unless either (1) it makes an inaccurate prediction or (2) the real genie has a catastrophic failure. This seems obvious on its face. But your most recent comment seems to be rejecting this claim, so it might be a good place to focus in order to clear up the discussion.

(I agree that even the best possible predictor cannot always make accurate predictions, so the relevance of the weak claim is not obvious. But you might hope that in situations that actually arise, very powerful systems will make accurate predictions.)

My strong claim is that if the human behaves sensibly the pseudo-genie will not have catastrophic failures unless either (1) it makes a prediction which seems obviously and badly wrong, or (2) the real genie has a catastrophic failure.

Even the strong claim is far from perfect reassurance, because the AI might expect to be in a simulation in which the human is about to be replaced by an adversarial superintelligence, and so make predictions that seem obviously and badly wrong. For the moment I am setting that difficulty aside---if you are willing to concede the point modulo that difficulty then I'll declare us on the same page.

it really sounds like you're assuming the problem of Friendly AI - reducing "does useful pivotal things and does not kill you" to "have a sufficiently good answer to some well-specified question whose interpretation doesn't depend on any further reflectively consistent degrees of freedom" - has been fully solved as just one step in your argument

No, I'm just arguing that if you had an AI that works well with human involvement, then you can make one that works well with minimal human involvement, modulo certain well-specified problems in AI (namely making good enough predictions about humans). Those problems almost but not quite avoid reflectively consistent degrees of freedom (the predictions still have a dependence on prior).

This is like one step of ten in the act-based approach, and so to the extent that we disagree it seems important to clear that up.

Eliezer Yudkowsky

My weak claim is that the pseudo-genie will not have catastrophic failures unless either (1) it makes an inaccurate prediction or (2) the real genie has a catastrophic failure. This seems obvious on its face.

This seems true "so long as nothing goes wrong", i.e., so long as the human behavior doesn't change when they're not actually familiar with the last 99 simulated questions as opposed to the case where they did encounter the last 99 simulated questions, so long as the pseudo-genie's putative outputs to the human don't change in any way from the real-genie case and in particular don't introduce any new cases of operator maximization that didn't exist in the real-genie case, etcetera.

It should be noted that I would not expect many classes of Do What I Mean genies that we'd actually want to build in practice, to be capable of making knowably reliably accurate predictions at all the most critical junctures. In other words, I think that for most genies we'd want to build, the project to convert them to pseudo-genies would fail at the joint of inaccurate prediction. I think that if we had a real genie that was capable of knowable, reliable, full-coverage prediction of which orders it received, we could probably convert it to an acceptable working pseudo-genie using a lot less further effort and insight than was required to build the real-genie, even taking account that the genie might previously have relied on the human remembering the interactions from previous queries, etcetera. I think in this sense we're probably on mostly the same page about the weak claim, and differ mainly in how good a prediction of human behavior we expect from 'AIs we ought to construct' (about which we also currently have different beliefs).

Oh, and of course a minor caveat that mindcrime must not be considered catastrophic in this case.

My strong claim is that if the human behaves sensibly the pseudo-genie will not have catastrophic failures unless either (1) it makes a prediction which seems obviously and badly wrong, or (2) the real genie has a catastrophic failure.

The big caveat I have about this is that "obviously and badly wrong" must be evaluated relative to human notions of "obviously and badly wrong" rather than some formal third-party sense of how much it was a reasonable mistake to make given the previous data. (Imagine a series of ten barrels where the human can look inside the barrel but the AI can't. The first nine barrels contain red balls, and the human says 'red'. The tenth barrel has white balls and the AI says 'red' and the human shouts "No, you fool! Red is nothing like white!" It's an obvious big error from a human perspective but not from a reasonable-modeling perspective. The contents of the barrels, in this case, are metaphors for reflectively stable or 'philosophical' degrees of freedom which are not fully correlated.) The possible divergence in some critical new case between 'obviously objectionable to a human' and 'obviously human-objectionable to a good modeler given the previously observed data provided on human objectionality' is of course exactly the case for continuing to have humans in the loop.

The minor caveat that jumps out at me about the strong claim is that, e.g., we can imagine a case where the real-genie is supposed to perform 10 checks from different angles, such that it's not an "obviously and badly wrong misprediction" to say that the human misses any one of the checks, but normal operation would usually have the human catching at least one of the checks. This exact case seems unlikely because I'd expect enough correlation on which checks the actual humans miss, that if the pseudo-genie can make a catastrophic error that way, then the real genie probably fails somewhere (if not on the exact same problem). But the general case of systematic small divergences adding up to some larger catastrophe strikes me as more actually worrisome. To phrase it in a more plausible way, imagine that there's some important thing that a human would say in at least 1 out of 100 rounds of commands, so that failing to predict the statement on any given round is predictively reasonable and indeed the modal prediction, but having it not appear in all 100 rounds is catastrophic. I expect you to respond that in this case the human should have a conditionally very high probability of making the statement on the 100th round and therefore it's a big actual prediction error not to include it, but you can see how it's more the kind of prediction error that a system with partial coverage might make, plus we have to consider the situation if there is no final round and just a gradual decline into catastrophic badness over one million simulated rounds, etcetera.

No, I'm just arguing that if you had an AI that works well with human involvement, then you can make one that works well with minimal human involvement, modulo certain well-specified problems in AI (namely making good enough predictions about humans).

I think I mostly agree, subject to the aforementioned caveat that "good enough prediction" means "really actually accurate" rather than "reasonable given previously observed data". I'm willing to call that well-specified, and it's possible that total coverage of it could be obtained given a molecular-nanotech level examination of a programmer's brain plus some amount of mindcrime.

This is like one step of ten in the act-based approach, and so to the extent that we disagree it seems important to clear that up.

I'm sorry if I seem troublesome or obstinate here. My possibly wrong or strawmanning instinctive model of one of our core disagreements is that, in general, Eliezer thinks "The problem of making a smarter-than-human intelligence that doesn't kill you, on the first try, is at least as metaphorically difficult as building a space shuttle in a realm where having the wrong temperature on one O-Ring will cause the massive forces cascading through the system to blow up and kill you, unless you have some clever meta-system that prevents that, and then the meta-system has to not blow up and kill you" and Paul does not feel quite the same sense of "if you tolerate enough minor-seeming structural problems it adds up to automatic death".

Paul Christiano

This is like one step of ten in the act-based approach, and so to the extent that we disagree it seems important to clear that up.

I'm sorry if I seem troublesome or obstinate here.

I'm just pointing this out to clarify why I care about what may seem like a minor point (if you could make a safe genie, then there is a relatively clear research path to removing the human involvement). I don't care much about this point on its own, I'm mostly interested because this is one key step of the research project I'm outlining.

I don't have objections to going through basic points in great detail (either here or in the last discussion).

My possibly wrong or strawmanning instinctive model of one of our core disagreements is that, in general, Eliezer thinks "The problem of making a smarter-than-human intelligence that doesn't kill you, on the first try, is at least as metaphorically difficult as building a space shuttle in a realm where having the wrong temperature on one O-Ring will cause the massive forces cascading through the system to blow up and kill you, unless you have some clever meta-system that prevents that, and then the meta-system has to not blow up and kill you" and Paul does not feel quite the same sense of "if you tolerate enough minor-seeming structural problems it adds up to automatic death".

I agree that we have some disagreement about P(doom), though I assume that isn't fundamental (and is instead a function of disagreements about humanity's competence, the likely speed of takeoff, the character of the AI safety problem).

But I think that most of the practical disagreements we have, about where to focus attention or what research problems to work on, are more likely to be driven by different approaches to research rather than different levels of optimism.