Coherent extrapolated volition (alignment target)

https://arbital.com/p/cev

by Eliezer Yudkowsky Apr 27 2016 updated Oct 24 2017

A proposed direction for an extremely well-aligned autonomous superintelligence - do what humans would want, if we knew what the AI knew, thought that fast, and understood ourselves.


Introduction

"Coherent extrapolated volition" (CEV) is Eliezer Yudkowsky's proposed thing-to-do with an extremely advanced AGI, if you're extremely confident of your ability to align it on complicated targets.

Roughly, a CEV-based superintelligence would do what currently existing humans would want* the AI to do, if counterfactually:

  1. We knew everything the AI knew;
  2. We could think as fast as the AI and consider all the arguments;
  3. We knew ourselves perfectly and had better self-control or self-modification ability;

to whatever extent most existing humans, thus extrapolated, would predictably want* the same things. (For example, in the limit of extrapolation, nearly all humans might want* not to be turned into paperclips, but might not agree* on the best pizza toppings. See below.)

CEV is meant to be the literally optimal or ideal or normative thing to do with an autonomous superintelligence, if you trust your ability to perfectly align a superintelligence on a very complicated target. (See below.)

CEV is rather complicated and meta and hence not intended as something you'd do with the first AI you ever tried to build. CEV might be something that everyone inside a project agreed was an acceptable mutual target for their second AI. (The first AI should probably be a Task AGI.)

For the corresponding metaethical theory see Extrapolated volition (normative moral theory).

[todo: start splitting the subsections into separate pages, then build learning paths.]

Concept

%%knows-requisite(Extrapolated volition (normative moral theory)):

See "Extrapolated volition (normative moral theory)".

%%

%%!knows-requisite(Extrapolated volition (normative moral theory)):

Extrapolated volition is the metaethical theory that when we ask "What is right?", then insofar as we're asking something meaningful, we're asking "What would a counterfactual idealized version of myself want* if it knew all the facts, had considered all the arguments, and had perfect self-knowledge and self-control?" (As a metaethical theory, this would make "What is right?" a mixed logical and empirical question, a function over possible states of the world.)

A very simple example of extrapolated volition might be to consider somebody who asks you to bring them orange juice from the refrigerator. You open the refrigerator and see no orange juice, but there's lemonade. You imagine that your friend would want you to bring them lemonade if they knew everything you knew about the refrigerator, so you bring them lemonade instead. On an abstract level, we can say that you "extrapolated" your friend's "volition", in other words, you took your model of their mind and decision process, or your model of their "volition", and you imagined a counterfactual version of their mind that had better information about the contents of your refrigerator, thereby "extrapolating" this volition.

Having better information isn't the only way that a decision process can be extrapolated; we can also, for example, imagine that a mind has more time in which to consider moral arguments, or better knowledge of itself. Maybe you currently want revenge on the Capulet family, but if somebody had a chance to sit down with you and have a long talk about how revenge affects civilizations in the long run, you could be talked out of that. Maybe you're currently convinced that you advocate for green shoes to be outlawed out of the goodness of your heart, but if you could actually see a printout of all of your own emotions at work, you'd see there was a lot of bitterness directed at people who wear green shoes, and this would change your mind about your decision.

In Yudkowsky's version of extrapolated volition considered on an individual level, the three core directions of extrapolation are:

%%

Motivation

Different people initially react differently to the question "Where should we point a superintelligence?" or "What should an aligned superintelligence do?" - not just different beliefs about what's good, but different frames of mind about how to ask the question.

Some common reactions:

  1. "Different people want different things! There's no way you can give everyone what they want. Even if you pick some way of combining things that people want, you'll be the one saying how to combine it. Someone else might think they should just get the whole world for themselves. Therefore, in the end you're deciding what the AI will do, and any claim to some sort of higher justice or normativity is nothing but sophistry."
  2. "What we should do with an AI is obvious - it should optimize liberal democratic values. That already takes into account everyone's interests in a fair way. The real threat is if bad people get their hands on an AGI and build an AGI that doesn't optimize liberal democratic values."
  3. "Imagine the ancient Greeks telling a superintelligence what to do. They'd have told it to optimize for glorious deaths in battle. Programming any other set of inflexible goals into a superintelligence seems equally stupid; it has to be able to change and grow."
  4. "What if we tell the superintelligence what to do and it's the wrong thing? What if we're basically confused about what's right? Shouldn't we let the superintelligence figure that out on its own, with its assumed superior intelligence?"

An initial response to each of these frames might be:

  1. "Okay, but suppose you're building a superintelligence and you're trying not to be a jerk about it. If you say, 'Whatever I do originates in myself, and therefore is equally selfish, so I might as well declare myself God-Emperor of the Universe' then you're being a jerk. Is there anything you could do instead which would be less like being a jerk? What's the least jerky thing you could do?"
  2. "What if you would, after some further discussion, want to tweak your definition of 'liberal democratic values' just a little? What if it's predictable that you would do that? Would you really want to be stuck with your off-the-cuff definition a million years later?"
  3. "Okay, so what should the Ancient Greeks have done if they did have to program an AI? How could they not have doomed future generations? Suppose the Ancient Greeks are clever enough to have noticed that sometimes people change their minds about things and to realize that they might not be right about everything. How can they use the cleverness of the AGI in a constructively specified, computable fashion that gets them out of this hole? You can't just tell the AGI to compute what's 'right', you need to put an actual computable question in there, not a word."
  4. "You asked, what if we're basically confused about what's right - well, in that case, what does the word 'right' even mean? If you don't know what's right, and you don't know how to compute what's right, then what are we even talking about? Do you have any ground on which to say that an AGI which only asks 'Which outcome leads to the greatest number of paperclips?' isn't computing rightness? If you don't think a paperclip maximizer is computing rightness, then you must know something about the rightness-question which excludes that possibility - so let's talk about how to program that rightness-question into an AGI."

Arguendo by CEV's advocates, all of these lines of discussion eventually end up converging on the idea of coherent extrapolated volition. For example:

  1. Asking what everyone would want* if they knew what the AI knew, and doing what they'd all predictably agree on, is just about the least jerky thing you can do. If you tell the AI to give everyone a volcano lair because you think volcano lairs are neat, you're not being selfish, but you're being a jerk to everyone who doesn't want a volcano lair. If you have the AI just do what people actually say, they'll end up hurting themselves with dumb wishes and you'd be a jerk. If you only extrapolate your friends and have the AI do what only you'd want, you're being jerks to everyone else.
  2. Yes, liberal democratic values are good; so is apple pie. Apple pie is a good thing but it's not the only good thing. William Frankena's list of ends-in-themselves included "Life, consciousness, and activity; health and strength; pleasures and satisfactions of all or certain kinds; happiness, beatitude, contentment" and then 25 more items, and the list certainly isn't complete. The only way you're going to get a complete list is by analyzing human minds; and even then, if our descendants would predictably want something else a million years later, we ought to take that into account too.
  3. Every improvement is a change, but not every change is an improvement. Just letting a superintelligence change at random doesn't encapsulate moral progress. Saying that change toward more liberal democratic values is progress, presumes that we already know the destination or answer. We can't even just ask the AGI to predict what civilizations would think a thousand years later, since (a) the AI itself impacts this and (b) if the AI did nothing, maybe in a thousand years everyone would have accidentally blissed themselves out while trying to modify their own brains. If we want to do better than the hypothetical ancient Greeks, we need to define a sufficiently abstract and meta criterion that describes valid directions of progress - such as changes in moral beliefs associated with learning new facts, for example; or moral change that would predictably occur if we considered a larger set of arguments; or moral change that would predictably occur if we understood ourselves better.
  4. This one is a long story: Metaethics deals with the question of what sort of entity 'rightness' is exactly - tries to reconcile this strange ineffable 'rightness' business with a universe made out of particle fields. Even though it seems like human beings wanting to murder people wouldn't make murder right, there's also nowhere in the stars or mountains where we can actually find it written that murder is wrong. At the end of a rather long discussion, we decide that for any given person speaking at a given point in time, 'rightness' is a logical constant which, although not counterfactually dependent on the state of the person's brain, must be analytically identified with the extrapolated volition of that brain; and we show that (only) this stance gives consistent answers to all the standard questions in metaethics. (This discussion takes a while, on the order of explaining how deterministic laws of physics don't show that you have unfree will.)

(To do: Write dialogues from each of these four entrance points.) [todo: write these and split them up into separate subpages]

Situating CEV in contemporary metaethics

See the corresponding section in "Extrapolated volition (normative moral theory)".

Scary design challenges

There are several reasons why CEV is way too challenging to be a good target for any project's first try at building machine intelligence:

  1. A CEV agent would be intended to carry out an autonomous open-ended mission. This implies all the usual reasons we expect an autonomous AI to be harder to make safe than a Task AGI.
  2. CEV is a weird goal. It involves recursion.
  3. Even the terms in CEV, like "know more" or "extrapolate a human", seem complicated and value-laden. You might have to build a high-level Do What I Know I Mean agent, and then tell it to do CEV. Do What I Know I Mean is complicated enough that you'd need to build an AI that can learn DWIKIM, so that DWIKIM can be taught rather than formally specified. So we're looking at something like CEV, running on top of DWIKIM, running on top of a goal-learning system, at least until the first time the CEV agent rewrites itself.

Doing this correctly the very first time we build a smarter-than-human intelligence seems improbable. The only way this would make a good first target is if the CEV concept is formally simpler than it currently seems, and timelines to AGI are unusually long and permit a great deal of advance work on safety.

If AGI is 20 years out (or less), it seems wiser to think in terms of a Task AGI performing some relatively simple pivotal act. The role of CEV is of answering the question, "What can you all agree in advance that you'll try to do next, after you've executed your Task AGI and gotten out from under the shadow of immediate doom?"

What if CEV fails to cohere?

A frequently asked question is "What if extrapolating human volitions produces incoherent answers?"

Arguendo according to the original motivation for CEV, if this happens in some places, a Friendly AI ought to ignore those places. If it happens everywhere, you probably picked a silly way to construe an extrapolated volition and you ought to rethink it. %note: Albeit in practice, you would not want an AI project to take a dozen tries at defining CEV. This would indicate something extremely wrong about the method being used to generate suggested answers. Whatever final attempt passed would probably be the first answer [blackzoning all of whose remaining flaws were hidden], rather than an answer with all flaws eliminated.%

That is:

The original motivation for CEV can also be viewed from the perspective of "What is it to help someone?" and "How can one help a large group of people?", where the intent behind the question is to build an AI that renders 'help' as we really intend that. The elements of CEV can be seen as caveats to the naive notion of "Help is giving people whatever they ask you for!" in which somebody asks you to bring them orange juice but the orange juice in the refrigerator is poisonous (and they're not trying to poison themselves).

What about helping a group of people? If two people ask for juice and you can only bring one kind of juice, you should bring a non-poisonous kind of juice they'd both like, to the extent any such juice exists. If no such juice exists, find a kind of juice that one of them is meh about and that the other one likes, and flip a coin or something to decide who wins. You are then being around as helpful as it is possible to be.

Can there be no way to help a large group of people? This seems implausible. You could at least give the starving ones pizza with a kind of pizza topping they currently like. To the extent your philosophy claims "Oh noes even that is not helping because it's not perfectly coherent," you have picked the wrong construal of 'helping'.

It could be that, if we find that every reasonable-sounding construal of extrapolated volition fails to cohere, we must arrive at some entirely other notion of 'helping'. But then this new form of helping also shouldn't involve bringing people poisonous orange juice that they don't know is poisoned, because that still intuitively seems unhelpful.

Helping people with incoherent preferences

What if somebody believes themselves to prefer onions to pineapple on their pizza, prefer pineapple to mushrooms, and prefer mushrooms to onions? In the sense that, offered any two slices from this set, they would pick according to the given ordering?

(This isn't an unrealistic example. Numerous experiments in behavioral economics demonstrate exactly this sort of circular preference. For instance, you can arrange 3 items such that each pair of them brings a different salient quality into focus for comparison.)

One may worry that we couldn't 'coherently extrapolate the volition' of somebody with these pizza preferences, since these local choices obviously aren't consistent with any coherent utility function. But how could we help somebody with a pizza preference like this?

Well, appealing to the intuitive notion of helping:

Conversely, these alternatives seem less helpful:

Arguendo by advocates of CEV: If you blank the complexities of 'extrapolated volition' out of your mind; and ask how you could reasonably help people as best as possible if you were trying not be a jerk; and then try to figure out how to semiformalize whatever mental procedure you just followed to arrive at your answer for how to help people; then you will eventually end up at CEV again.

Role of meta-ideals in promoting early agreement

A primary purpose of CEV is represent a relatively simple meta-level ideal that people can agree upon, even where they might disagree on the object level. By a hopefully analogous example, two honest scientists might disagree on the correct mass of an electron, but agree that the experimental method is a good way to resolve the answer.

Imagine Millikan believes an electron's mass is 9.1e-28 grams, and Nannikan believes the correct electron mass is 9.1e-34 grams. Millikan might be very worried about Nannikan's proposal to program an AI to believe the electron mass is 9.1e-34 grams; Nannikan doesn't like Millikan's proposal to program in 9.1-e28; and both of them would be unhappy with a compromise mass of 9.1e-31 grams. They might still agree on programming an AI with some analogue of probability theory and a simplicity prior, and letting a superintelligence come to the conclusions implied by Bayes and Occam, because the two can agree on an effectively computable question even though they think the question has different answers. Of course, this is easier to agree on when the AI hasn't yet produced an answer, or if the AI doesn't tell you the answer.

It's not guaranteed that every human embodies the same implicit moral questions, indeed this seems unlikely, which means that Alice and Bob might still expect their extrapolated volitions to disagree about things. Even so, while the outputs are still abstract and not-yet-computed, Alice doesn't have much of a place to stand on which to appeal to Carol, Dennis, and Evelyn by saying, "But as a matter of morality and justice, you should have the AI implement my extrapolated volition, not Bob's!" To appeal to Carol, Dennis, and Evelyn about this, you'd need them to believe that Alice's EV was more likely to agree with their EVs than Bob's was - and at that point, why not come together on the obvious Schelling point of extrapolating everyone's EVs?

Thus, one of the primary purposes of CEV (selling points, design goals) is that it's something that Alice, Bob, and Carol can agree now that Dennis and Evelyn should do with an AI that will be developed later; we can try to set up commitment mechanisms now, or check-and-balance mechanisms now, to ensure that Dennis and Evelyn are still working on CEV later.

Role of 'coherence' in reducing expected unresolvable disagreements

A CEV is not necessarily a majority vote. A lot of people with an extrapolated weak preference* might be counterbalanced by a few people with a strong extrapolated preference* in the opposite direction. Nick Bostrom's "parliamentary model" for resolving uncertainty between incommensurable ethical theories, permits a subtheory very concerned about a decision to spend a large amount of its limited influence on influencing that particular decision.

This means that, e.g., a vegan or animal-rights activist should not need to expect that they must seize control of a CEV algorithm in order for the result of CEV to protect animals. It doesn't seem like most of humanity would be deriving huge amounts of utility from hurting animals in a post-superintelligence scenario, so even a small part of the population that strongly opposes* this scenario should be decisive in preventing it.

Moral hazard vs. debugging

One of the points of the CEV proposal is to have minimal moral hazard (aka, not tempting the programmers to take over the world or the future); but this may be compromised if CEV's results don't go literally unchecked.

Part of the purpose of CEV is to stand as an answer to the question, "If the ancient Greeks had been the ones to invent superintelligence, what could they have done that would not, from our later perspective, irretrievably warp the future? If the ancient Greeks had programmed in their own values directly, they would have programmed in a glorious death in combat. Now let us consider that perhaps we too are not so wise." We can imagine the ancient Greeks writing a CEV mechanism, peeking at the result of this CEV mechanism before implementing it, and being horrified by the lack of glorious-deaths-in-combat in the future and value system thus revealed.

We can also imagine that the Greeks, trying to cut down on moral hazard, virtuously refuse to peek at the output; but it turns out that their version of CEV has some unforeseen behavior when actually run by a superintelligence, and so their world is turned into paperclips.

This is a safety-vs.-moral-hazard tradeoff between (a) the benefit of being able to look at CEV outputs in order to better-train the system or just verify that nothing went horribly wrong; and (b) the moral hazard that comes from the temptation to override the output, thus defeating the point of having a CEV mechanism in the first place.

There's also a potential safety hazard just with looking at the internals of a CEV algorithm; the simulated future could contain all sorts of directly mind-hacking cognitive hazards.

Rather than giving up entirely and embracing maximum moral hazard, one possible approach to this issue might be to have some single human that is supposed to peek at the output and provide a 1 or 0 (proceed or stop) judgment to the mechanism, without any other information flow being allowed to the programmers if the human outputs 0. (For example, the volunteer might be in a room with explosives that go off if 0 is output.)

"Selfish bastards" problem

Suppose that Fred is funding Grace to work on a CEV-based superintelligence; and Evelyn has decided not to oppose this project. The resulting CEV is meant to extrapolate the volitions of Alice, Bob, Carol, Dennis, Evelyn, Fred, and Grace with equal weight. (If you're reading this, you're more than usually likely to be one of Evelyn, Fred, or Grace.)

Evelyn and Fred and Grace might worry: "What if a supermajority of humanity consists of 'selfish* bastards', such that their extrapolated volitions would cheerfully vote* for a world in which it was legal to own artificial sapient beings as slaves so long as they personally happened to be in the slaveowning class; and we, Evelyn and Fred and Grace, just happen to be in the minority that extremely doesn't want nor want* the future to be like that?"

That is: What if humanity's extrapolated volitions diverge in such a way that from the standpoint of our volitions - since, if you're reading this, you're unusually likely to be one of Evelyn or Fred or Grace - 90% of extrapolated humanity would choose* something such that we would not approve of it, and our volitions would not approve* of it, even after taking into account that we don't want to be jerks about it and that we don't think we were born with any unusual or exceptional right to determine the fate of humanity.

That is, let the scenario be as follows:

90% of the people (but not we who are collectively sponsoring the AI) are selfish bastards at the core, such that any reasonable extrapolation process (it's not just that we picked a broken one) would lead to them endorsing a world in which they themselves had rights, but it was okay to create artificial people and hurt them. Furthermore, they would derive enough utility from being personal God-Emperors that this would override our minority objection even in a parliamentary model.

We can see this hypothetical outcome as potentially undermining every sort of reason that we, who happen to be in a position of control to prevent that outcome, should voluntarily relinquish that control to the remaining 90% of humanity:

Rather than giving up entirely and taking over the world, or exposing ourselves to moral hazard by peeking at the results, one possible approach to this issue might be to run a three-stage process.

This process involves some internal references, so the detailed explanation needs to follow a shorter summary explanation.

In summary:

In detail:

The particular fallback of "kick out from the extrapolation any weighted portions of extrapolated decision processes that would act unilaterally and without caring for others, given unchecked power" is meant to have a property of poetic justice, or rendering objections to it self-defeating: If it's okay to act unilaterally, then why can't we unilaterally kick out the unilateral parts? This is meant to be the 'simplest' or most 'elegant' way of kicking out a part of the CEV whose internal reasoning directly opposes the whole reason we ran CEV in the first place, but imposing the minimum possible filter beyond that.

Thus if Alice (who by hypothesis is not in any way a contributor) says, "But I demand you altruistically include the extrapolation of me that would unilaterally act against you if it had power!" then we reply, "We'll try that, but if it turns out to be a sufficiently bad idea, there's no coherent interpersonal grounds on which you can rebuke us for taking the fallback option instead."

Similarly in regards to the Fail option at the end, to anyone who says, "Fairness demands that you run Fallback CEV even if you wouldn't like* it!" we can reply, "Our own power may not be used against us; if we'd regret ever having built the thing, fairness doesn't oblige us to run it."

Why base CEV on "existing humans" and not some other class of extrapolees?

One frequently asked question about the implementation details of CEV is either:

In particular, it's been asked why restrictive answers to Question 1 don't also imply the more restrictive answer to Question 2.

Why not include mammals?

We'll start by considering some replies to the question, "Why not include all mammals into CEV's extrapolation base?"

To expand on this last consideration, we can reply: "Even if you would regard it as more just to have the right animal-protecting outcome baked into the future immediately, so that your EV didn't need to expend some of its voting strength on assuring it, not everyone else might regard that as just. From our perspective as programmers we have no particular reason to listen to you rather than Alice. We're not arguing about whether animals will be protected if a minority vegan-type subpopulation strongly want* that and the rest of humanity doesn't care*. We're arguing about whether, if you want* that but a majority doesn't, your EV should justly need to expend some negotiating strength in order to make sure animals are protected. This seems pretty reasonable to us as programmers from our standpoint of wanting to be fair, not be jerks, and not start any slap-fights over world domination."

This third reply is particularly important because taken in isolation, the first two replies of "You could be wrong about that being a good idea" and "Even if you care about their welfare, maybe you wouldn't like their EVs" could equally apply to argue that contributors to the CEV project ought to extrapolate only their own volitions and not the rest of humanity:

The proposed way of addressing this was to run a composite CEV with a contributor-CEV check and a Fallback-CEV fallback. But then why not run an Animal-CEV with a Contributor-CEV check before trying the Everyone-CEV?

One answer would go back to the third reply above: Nonhuman mammals aren't sponsoring the CEV project, allowing it to pass, or potentially getting angry at people who want to take over the world with no seeming concern for fairness. So they aren't part of the Schelling Point for "everyone gets an extrapolated vote".

Why not extrapolate all sapients?

Similarly if we ask: "Why not include all sapient beings that the SI suspects to exist everywhere in the measure-weighted multiverse?"

Why not extrapolate deceased humans?

"Why not include all deceased human beings as well as all currently living humans?"

In this case, we can't then reply that they didn't contribute to the human project (e.g. I. J. Good). Their EVs are also less likely to be alien than in any other case considered above.

But again, we fall back on the third reply: "The people who are still alive" is a simple Schelling circle to draw that includes everyone in the current political process. To the extent it would be nice or fair to extrapolate Leo Szilard and include him, we can do that if a supermajority of EVs decide* that this would be nice or just. To the extent we don't bake this decision into the model, Leo Szilard won't rise from the grave and rebuke us. This seems like reason enough to regard "The people who are still alive" as a simple and obvious extrapolation base.

Why include people who are powerless?

"Why include very young children, uncontacted tribes who've never heard about AI, and retrievable cryonics patients (if any)? They can't, in their current state, vote for or against anything."


Comments

Paul Christiano

Even so, while the outputs are still abstract and not-yet-computed, Alice doesn't have much of a place to stand on which to appeal to Carol, Dennis, and Evelyn by saying, "But as a matter of morality and justice, you should have the AI implement my extrapolated volition, not Bob's!"

They may not have a moral argument, but they can surely have an argument.

And so on, this is a tiny fraction of the plausible alternatives. I don't really think that any is a strong Schelling point, and certainly none is so strong that you can't argue for one of the others.

You say that the purpose of not being a jerk is so that people can cooperate, rather than turning the development of AI into a conflict. If that's your goal, wouldn't the default approach be to give each individual enough influence to ensure that they have no incentive to defect? If you try to assign weight democratically, you are massively reducing the influence of many particular individuals, including almost every researcher, investor, and regulator. That does not seem like the most natural recipe for eliminating conflict!

As another way of putting it, suppose that I was to be made dictator of the world tomorrow. What should I do, if I wanted to not be a jerk? One proposal is to redistribute all resources equally amongst living humans. Another is to do nothing. People will justifiably object to both, I don't think there is a simple story about which is right (setting aside pragmatic concerns about feasibility).

You can try to get out of this, by claiming that the pie is going to grow so much that this kind of conflict is a non-issue. I think that's true to the extent that people just want to live happy, normal lives. But many people have preferences over what happens in the world, not only about their own lives. From an aggregative altruistic perspective these are the preferences that are really important, and they are almost necessarily in tension since realizing any of them demands some resources.

Robert Peetsalu

Personal vs Global CEV could also be mentioned here.

Upon reading the ideal advisor theories paper an idea came to mind about how to protect CEV from Sobel's fourth objection where the ideal adviser recommends actions that would lead to death because it knows that its original self would want to commit suicide after seeing how inferior and hopeless their life is compared to a perfect self. If we limit the "better version of ourselves" to only have superior knowledge and skills and nothing that we couldn't obtain if we had enough time and resources, then it wouldn't view us as disabled or hopeless, only misinformed. Hence there would be a way out and the perfectly informed self would also know all the ways to improve the situation. So it wouldn't recommend mercy death, unless the original self already had suicidal tendencies. What a nice topic to discuss =P