[summary: The notion that some act or policy "can in fact be wrong", even when you "think it is right", is more intuitive to some people than others; and it raises the question of what "rightness" is and how to compute it.

On the extrapolated volition theory of metaethics, if you would change your mind about something after learning new facts or considering new arguments, then your updated state of mind is righter. This can be true in advance of you knowing the facts.

E.g., maybe you currently want revenge on the Capulet family. But if somebody had a chance to sit down with you and have a long talk about how revenge affects civilizations in the long run, you could be talked out of thinking that revenge, in general, is right. So long as this might be true, it makes sense to say, "I want revenge on the Capulets, but maybe that's not really right."

Extrapolated volition is a normative moral theory. It is a theory of how the concept of shouldness or goodness is or ought to be cashed out (rescued). The corresponding proposal for completely aligning a fully self-directed superintelligence is coherent extrapolated volition.]

(This page is about extrapolated volition as a normative moral theory - that is, the theory that extrapolated volition captures the concept of value or what outcomes we should want. For the closely related proposal about what a sufficiently advanced self-directed AGI should be built to want/target/decide/do, see coherent extrapolated volition.)

Concept

Extrapolated volition is the notion that when we ask "What is right?", then insofar as we're asking something meaningful, we're asking about the result of running a certain logical function over possible states of the world, where this function is analytically identical to the result of extrapolating our current decision-making process in directions such as "What if I knew more?", "What if I had time to consider more arguments (so long as the arguments weren't hacking my brain)?", or "What if I understood myself better and had more self-control?"

A simple example of extrapolated volition might be to consider somebody who asks you to bring them orange juice from the refrigerator. You open the refrigerator and see no orange juice, but there's lemonade. You imagine that your friend would want you to bring them lemonade if they knew everything you knew about the refrigerator, so you bring them lemonade instead. On an abstract level, we can say that you "extrapolated" your friend's "volition", in other words, you took your model of their mind and decision process, or your model of their "volition", and you imagined a counterfactual version of their mind that had better information about the contents of your refrigerator, thereby "extrapolating" this volition.

Having better information isn't the only way that a decision process can be extrapolated; we can also, for example, imagine that a mind has more time in which to consider moral arguments, or better knowledge of itself. Maybe you currently want revenge on the Capulet family, but if somebody had a chance to sit down with you and have a long talk about how revenge affects civilizations in the long run, you could be talked out of that. Maybe you're currently convinced that you advocate for green shoes to be outlawed out of the goodness of your heart, but if you could actually see a printout of all of your own emotions at work, you'd see there was a lot of bitterness directed at people who wear green shoes, and this would change your mind about your decision.

In Yudkowsky's version of extrapolated volition considered on an individual level, the three core directions of extrapolation are:

Increased knowledge - having more veridical knowledge of declarative facts and expected outcomes.
Increased consideration of arguments - being able to consider more possible arguments and assess their validity.
Increased reflectivity - greater knowledge about the self, and to some degree, greater self-control (though this raises further questions about which parts of the self normatively get to control which other parts).

Motivation

Different people react differently to the question "Where should we point an autonomous superintelligence, if we can point it exactly?" and approach it from different angles. [todo: and we'll eventually need an Arbital dispatching questionnaire on a page that handles it] These angles include:

All this talk of 'shouldness' is just a cover for the fact that whoever gets to build the superintelligence wins all the marbles; no matter what you do with your superintelligence, you'll be the one who does it.
What if we tell the superintelligence what to do and it's the wrong thing? What if we're basically confused about what's right? Shouldn't we let the superintelligence figure that out on its own with its own superior intelligence?
Imagine the Ancient Greeks telling a superintelligence what to do. They'd have told it to optimize personal virtues, including, say, a glorious death in battle. This seems like a bad thing and we need to figure out how not to do the analogous thing. So telling an AGI to do what seems like a good idea to us will also end up seeming a very regrettable decision a million years later.
Obviously we should just tell the AGI to optimize liberal democratic values. Liberal democratic values are good. The real threat is if bad people get their hands on AGI and build an AGI that doesn't optimize liberal democratic values.

Some corresponding initial replies might be:

Okay, but suppose you're a programmer and you're trying not to be a jerk. If you're like, "Well, whatever I do originates in myself and is therefore equally selfish, so I might as well declare myself God-Emperor of Earth," you're being a jerk. Is there anything we can do which is less jerky, and indeed, minimally jerky?
If you say you have no information at all about what's 'right', then what does the term even mean? If I might as well have my AGI maximize paperclips and you have no ground on which to stand and say that's the wrong way to compute normativity, then what are we even talking about in the first place? The word 'right' or 'should' must have some meaning that you know about, even if it doesn't automatically print out a list of everything you know is right. Let's talk about hunting down that meaning.
Okay, so what should the Ancient Greeks have done if they did have to program an AI? How could they not have doomed future generations? Suppose the Ancient Greeks are clever enough to have noticed that sometimes people change their minds about things and to realize that they might not be right about everything. How can they use the cleverness of the AGI in a constructively specified, computable fashion that gets them out of this hole? You can't just tell the AGI to compute what's 'right', you need to put an actual computable question in there, not a word.
What if you would, after some further discussion, want to tweak your definition of "liberal democratic values" just a little? What if it's predictable that you would do that? Would you really want to be stuck with your off-the-cuff definition a million years later?

Arguendo by CEV's advocates, these conversations eventually all end up converging on Coherent Extrapolated Volition as an alignment proposal by different roads.

"Extrapolated volition" is the corresponding normative theory that you arrive at by questioning the meaning of 'right' or trying to figure out what we 'should' really truly do.

EV as rescuing the notion of betterness

We can see EV as trying to rescue the following pretheoretic intuitions (as they might be experienced by someone feeling confused, or just somebody who'd never questioned metaethics in the first place):

(a) It's possible to think that something is right, and be incorrect.
(a1) It's possible for something to be wrong even if nobody knows that it's wrong. E.g. an uneven division of an apple pie might be unfair even if all recipients don't realize this.
(a2) We can learn more about what's right, and change our minds to be righter.
(b) Taking a pill that changes what you think is right, should not change what is right. (If you're contemplating taking a pill that makes you think it's right to secretly murder 12-year-olds, you should not reason, "Well, if I take this pill I'll murder 12-year-olds… but also it will be all right to murder 12-year-olds, so this is a great pill to take.")
(c) We could be wrong, but it sure seems like the things on Frankena's list are all reasonably good. ("Life, consciousness, and activity; health and strength; pleasures and satisfactions of all or certain kinds; happiness, beatitude, contentment, etc…")
(c1) The fact that we could be in some mysterious way "wrong" about what belongs on Frankena's list, doesn't seem to leave enough room for "make as many paperclips as possible" to be the only thing on the list. Even our state of confusion and possible ignorance doesn't seem to allow for that to be the answer. We're at least pretty sure that isn't the total sum of goodness.
(c2) Similarly, on the meta-level, it doesn't seem like the meta-level procedure "Pick whatever procedure for determining rightness, leads to the most paperclips existing after you adopt it" could be the correct answer.

We cannot rescue these properties by saying:

"There is an irreducible, non-natural 'rightness' XML tag attached to some objects and events. Our brains perceive this XML tag, but imperfectly, giving us property (a) when we think the XML tag is there, even though it isn't. The XML tags are there even if nobody sees them (a1). Sometimes we stare harder and see the XML tag better (a2). Obviously, doing anything to a brain isn't going to change the XML tag (b), just fool the brain or invalidate its map of the XML tag. All of the things on Frankena's list have XML tags (c) or at least we think so. For paperclips to be the total correct content of Frankena's list, we'd need to be wrong about paperclips not having XML tags and wrong about everything on Frankena's list that we think does have an XML tag (c1). And on the meta-level, "Which sense of rightness leads to the most paperclips?" doesn't say anything about XML tags, and it doesn't lead to there being lots of XML tags, so there's no justification for it (c2)."

This doesn't work because:

There are, in fact, no tiny irreducible XML tags attached to objects.
If there were little tags like that, there'd be no obvious normative justification for our caring about them.
It doesn't seem like we should be able to make it good to murder 12-year-olds by swapping around the irreducible XML tags on the event.
There's no way our brains could perceive these tiny XML tags even if they were there.
There's no obvious causal story for how humans could have evolved such that we do in fact care about these tiny XML tags. (A descriptive rather than normative problem with the theory as a whole; natural selection has no normative force or justificational power, but we do need our theory of how brains actually work to be compatible with it).

Onto what sort of entity can we then map our intuitions, if not onto tiny XML tags?

Consider the property of sixness possessed by six apples on a table. The relation between the physical six apples on a table, and the logical number '6', is given by a logical function that takes physical descriptions as inputs: in particular, the function "count the number of apples on the table".

Could we rescue 'rightness' onto a logical function like this, only much more complicated?

Let's examine how the 6-ness property and the "counting apples" function behave:

There are, in fact, no tiny tags saying '6' attached to the apples (and yet there are still six of them).
It's possible to think there are 6 apples on the table, and be wrong.
We can sometimes change our minds about how many apples there are on a table.
There can be 6 apples on a table even if nobody is looking at it.
Taking a pill that changes how many apples you think are on the table, doesn't change the number of apples on the table.
You can't have a 6-tag-manipulator that changes the number of apples on a table without changing anything about the table or apples.
There's a clear causal story for how we can see apples, and also for how our brains can count things, and there's an understandable historical fact about why humans count things.
Changing the history of how humans count things could change which logical function our brains were computing on the table, so that our brains were no longer "counting apples", but it wouldn't change the number of apples on the table. We'd be changing which logical function our brains were considering, not changing the logical facts themselves or making it so that identical premises would lead to different conclusions.
Suppose somebody says, "Hey, you know, sometimes we're wrong about whether there's 6 of something or not, maybe we're just entirely confused about this counting thing; maybe the real number of apples on this table is this paperclip I'm holding." Even if you often made mistakes in counting, didn't know how to axiomatize arithmetic, and were feeling confused about the nature of numbers, you would still know enough about what you were talking about to feel pretty sure that the number of apples on the table was not in fact a paperclip.
If you could ask a superintelligence how many grains of sand your brain would think there were on a beach, in the limit of your brain representing everything the superintelligence knew and thinking very quickly, you would indeed gain veridical knowledge about the number of grains of sand on that beach. Your brain doesn't determine the number of grains of sand on the beach, and you can't change the logical properties of first-order arithmetic by taking a pill that changes your brain. But there's an analytic relation between the procedure your brain currently represents and tries to carry out in an error-prone way, and the logical function that counts how many grains of sand on the beach.

This suggests that 6-ness has the correct ontological nature for some much bigger and more complicated logical function than "Count the number of apples on the table" to be outputting rightness. Or rather, if we want to rescue our pretheoretic sense of rightness in a way that adds up to moral normality, we should rescue it onto a logical function.

This function, e.g., starts with the items on Frankena's list and everything we currently value; but also takes into account the set of arguments that might change our mind about what goes on the list; and also takes into account meta-level conditions that we would endorse as distinguishing "valid arguments" and "arguments that merely change our minds". (This last point is pragmatically important if we're considering trying to get a superintelligence to extrapolate our volitions. The list of everything that does in fact change your mind might include particular patterns of rotating spiral pixel patterns that effectively hack a human brain.)

The end result of all this work is that we go on guessing which acts are right and wrong as before, go on considering that some possible valid arguments might change our minds, go on weighing such arguments, and go on valuing the things on Frankena's list in the meantime. The theory as a whole is intended to add up to the same moral normality as before, just with that normality embedded into the world of causality and logic in a non-confusing way.

One point we could have taken into our starting list of important properties, but deferred until later:

It sure feels like there's a beautiful, mysterious floating 'rightness' property of things that are right, and that the things that have this property are terribly precious and important.

On the general program of "rescuing the utility function", we should not scorn this feeling, and should instead figure out how to map it onto what actually exists.

In this case, having preserved almost all the structural properties of moral normality, there's no reason why anything should change about how we experience the corresponding emotion in everyday life. If our native emotions are having trouble with this new, weird, abstract, learned representation of 'a certain big complicated logical function', we should do our best to remember that the rightness is still there. And this is not a retreat to second-best any more than "disordered kinetic energy" is some kind of sad consolation prize for the universe's lack of ontologically basic warmth, etcetera.

Unrescuability of moral internalism

In standard metaethical terms, we have managed to rescue 'moral cognitivism' (statements about rightness have truth-values) and 'moral realism' (there is a fact of the matter out there about how right something is). We have not however managed to rescue the pretheoretic intuition underlying 'moral internalism':

A moral argument, to be valid, ought to be able to persuade anyone. If a moral argument is unpersuasive to someone who isn't making some kind of clear mistake in rejecting it, then that argument must rest on some appeal to a private or merely selfish consideration that should form no part of true morality that everyone can perceive.

This intuition cannot be preserved in any reasonable way, because paperclip maximizers are in fact going to go on making paperclips (and not because they made some kind of cognitive error). A paperclip maximizer isn't disagreeing with you about what's right (the output of the logical function), it's just following whatever plan leads to the most paperclips.

Since the paperclip maximizer's policy isn't influenced by any of our moral arguments, we can't preserve the internalist intuition without reducing the set of valid justifications and truly valuable things to the empty set - and even that, a paperclip maximizer wouldn't find motivationally persuasive!

Thus our options regarding the pretheoretic internalist intuition that a moral argument is not valid if not universally persuasive, seem to be limited to the following:

Give up on the intuition in its intuitive form: a paperclip maximizer doesn't care if it's unjust to kill everyone; and you can't talk it into behaving differently; and this doesn't reflect a cognitive stumble on the paperclip maximizer's part; and this fact gives us no information about what is right or justified.
Preserve, at the cost of all other pretheoretic intuitions about rightness, the intuition that only arguments that universally influence behavior are valid: that is, there are no valid moral arguments.
Try to sweep the problem under the rug by claiming that reasonable minds must agree that paperclips are objectively pointless… even though Clippy is not suffering from any defect of epistemic or instrumental power, and there's no place in Clippy's code where we can point to some inherently persuasive argument being dropped by a defect or special case of that code.

It's not clear what the point of stance (2) would be, since even this is not an argument that would cause Clippy to alter its behavior, and hence the stance is self-defeating. (3) seems like a mere word game, and potentially a very dangerous word game if it tricks AI developers into thinking that rightness is a default behavior of AIs, or even a function of low algorithmic complexity, or that beneficial behavior automatically correlates with 'reasonable' judgments about less value-laden questions. See "Orthogonality Thesis" for the extreme practical importance of acknowledging that moral internalism is in practice false.

Situating EV in contemporary metaethics

Metaethics is the field of academic philosophy that deals with the question, not of "What is good?", but "What sort of property is goodness?" As applied to issues in Artificial Intelligence, rather than arguing over which particular outcomes are better or worse, we are, from a standpoint of executable philosophy, asking how to compute what is good; and why the output of any proposed computation ought to be identified with the notion of shouldness.

EV replies that for each person at a single moment in time, right or should is to be identified with a (subjectively uncertain) logical constant that is fixed for that person at that particular moment in time, where this logical constant is to be identified with the result of running the extrapolation process on that person. We can't run the extrapolation process so we can't get perfect knowledge of this logical constant, and will be subjectively uncertain about what is right.

To eliminate one important ambiguity in how this might cash out, we regard this logical constant as being analytically identified with the extrapolation of our brains, but not counterfactually dependent on counterfactually varying forms of our brains. If you imagine being administered a pill that makes you want to kill people, then you shouldn't compute in your imagination that different things are right for this new self. Instead, this new self now wants to do something other than what is right. We can meaningfully say, "Even if I (a counterfactual version of me) wanted to kill people, that wouldn't make it right" because the counterfactual alteration of the self doesn't change the logical object that you mean by saying 'right'.

However, there's still an analytic relation between this logical object and your actual mindstate, which is indeed is implied by the very meaning of discourse about shouldness, which means that you can get veridical information about this logical object by having a sufficiently intelligent AI run an approximation of the extrapolation process over a good model of your actual mind. If a sufficiently intelligent and trustworthy AGI tells you that after thinking about it for a while you wouldn't want to eat cows, you have gained veridical information about whether it's right to eat cows.

Within the standard terminology of academic metaethics, "extrapolated volition" as a normative theory is:

Cognitivist. Normative propositions can be true or false. You can believe that something is right and be mistaken.
Naturalist. Normative propositions are not irreducible or based on non-natural properties of the world.
Externalist / not internalist. It is not the case that all sufficiently powerful optimizers must act on what we consider to be moral propositions. A paperclipper does what is clippy, not what is right, and the fact that it's trying to turn everything into paperclips does not indicate a disagreement with you about what is right any more than you disagree about what is clippy.
Reductionist. The whole point of this theory is that it's the sort of thing you could potentially compute.
More synthetic reductionist than analytic reductionist. We don't have a priori knowledge of our starting mindstate and don't have enough computing power to complete the extrapolation process over it. Therefore, we can't figure out exactly what our extrapolated volition would say just by pondering the meaning of the word 'right'.

Closest antecedents in academic metaethics are Rawls and Goodman's reflective equilibrium, Harsanyi and Railton's ideal advisor theories, and Frank Jackson's moral functionalism.

Moore's Open Question

Argument. If extrapolated volition is analytically equivalent to good, then the question "Is it true that extrapolated volition is good?" is meaningless or trivial. However, this question is not meaningless or trivial, and seems to have an open quality about it. Therefore, extrapolated volition is not analytically equivalent to goodness.

Reply. Extrapolated volition is not supposed to be transparently identical to goodness. The normative identity between extrapolated volition and goodness is allowed to be something that you would have to think for a while and consider many arguments to perceive.

Natively, human beings don't start out with any kind of explicit commitment to a particular metaethics; our brains just compute a feeling of rightness about certain acts, and then sometimes update and say that acts we previously thought were right are not-right.

When we go from that, to trying to draw a corresponding logical function that we can see our brains as approximating, and updating when we learn new things or consider new arguments, we are carrying out a project of "rescuing the utility function". We are reasoning that we can best rescue our native state of confusion by seeing our reasoning about goodness as having its referent in certain logical facts, which lets us go on saying that it is better ceteris paribus for people to be happy than in severe pain, and that we can't reverse this ordering by taking a pill that alters our brain (we can only make our future self act on different logical questions), etcetera. It's not surprising if this bit of philosophy takes longer than five minutes to reason through.

Comments

Robert Peetsalu

What disturbs me in this article is the normativeness - describing values, rightness and goodness as something objective, having an objective boolean value, existing in the world without an observer to have those values, like some motivation without someone being motivated by it. Instead rightness and goodness are meaningless outside of some utility function, some desired end state that would label moving towards it as positive direction and against it as negative direction. Without a destination every direction is as good as every other. Values are always subjective, so when teaching them to an AI we can only refer to how common it is to regard value A as being positive or negative among people.

The universe doesn't want anything, so for example killing humans has no innate badness and is not negative for the universe. It's just negative for most humans. If taking a pill will change your subjective values to "killing=good", then rightness will also change and the AI will now extrapolate this new rightness from your brain. Furthermore it will correctly recommend futures with killing because they are better than futures without it according to these values.

We have no reason to believe that if each of us knew as much as a superintelligence knows, could think as fast as it and reason as soundly as it does, that we would then have no differences in values. Let's assume safely that subjectivity isn't going anywhere. We can still define some useful values for the AI by substituting objective values with an overwhelming consensus of known subjective values. Those are basic values that are common to most people and don't vary significantly with political or personal preference, like human rights, basic criminal law, maybe some of the soft positive values mentioned in the article. Ban on wars would be nice to include! (We'd need to define what level of aggression is considered war and whether information war and sanctions are also included.)

The utility function of an AI is what defines its priorities for possible outcomes aka its values. In case of forementioned rights and laws they tend to take the form of penalty for wrong actions instead of utility gain for good actions, which is a slippery slope in the sense that AI-s tend to find loopholes in prohibitions, but on the other hand penalties can't be abused for utility maximization like gains can. For example rewarding for creating happy fluffy feelings in people would turn the AI into a maximizer.

In any case we'll want to change the AI-s values as our understanding of good and right evolves, so let's hope utility indifference will let us update them. Instead of changing drastically over time our values will probably become more detailed and situational - full of exceptions, just like our laws. Already justice systems of many countries are so complex that it would make sense to delegate judgement to AI-s. Can't wait to see news of first AI judges being bribed with utility gains.

P.S: Act of opposing normative values is the definition of rebelling, so I guess I'm a rebel now ^_^