Extrapolated volition (normative moral theory)

https://arbital.com/p/normative_extrapolated_volition

by Eliezer Yudkowsky Apr 1 2016 updated Jan 7 2017

If someone asks you for orange juice, and you know that the refrigerator contains no orange juice, should you bring them lemonade?


[summary: The notion that some act or policy "can in fact be wrong", even when you "think it is right", is more intuitive to some people than others; and it raises the question of what "rightness" is and how to compute it.

On the extrapolated volition theory of metaethics, if you would change your mind about something after learning new facts or considering new arguments, then your updated state of mind is righter. This can be true in advance of you knowing the facts.

E.g., maybe you currently want revenge on the Capulet family. But if somebody had a chance to sit down with you and have a long talk about how revenge affects civilizations in the long run, you could be talked out of thinking that revenge, in general, is right. So long as this might be true, it makes sense to say, "I want revenge on the Capulets, but maybe that's not really right."

Extrapolated volition is a normative moral theory. It is a theory of how the concept of shouldness or goodness is or ought to be cashed out (rescued). The corresponding proposal for completely aligning a fully self-directed superintelligence is coherent extrapolated volition.]

(This page is about extrapolated volition as a normative moral theory - that is, the theory that extrapolated volition captures the concept of value or what outcomes we should want. For the closely related proposal about what a sufficiently advanced self-directed AGI should be built to want/target/decide/do, see coherent extrapolated volition.)

Concept

Extrapolated volition is the notion that when we ask "What is right?", then insofar as we're asking something meaningful, we're asking about the result of running a certain logical function over possible states of the world, where this function is analytically identical to the result of extrapolating our current decision-making process in directions such as "What if I knew more?", "What if I had time to consider more arguments (so long as the arguments weren't hacking my brain)?", or "What if I understood myself better and had more self-control?"

A simple example of extrapolated volition might be to consider somebody who asks you to bring them orange juice from the refrigerator. You open the refrigerator and see no orange juice, but there's lemonade. You imagine that your friend would want you to bring them lemonade if they knew everything you knew about the refrigerator, so you bring them lemonade instead. On an abstract level, we can say that you "extrapolated" your friend's "volition", in other words, you took your model of their mind and decision process, or your model of their "volition", and you imagined a counterfactual version of their mind that had better information about the contents of your refrigerator, thereby "extrapolating" this volition.

Having better information isn't the only way that a decision process can be extrapolated; we can also, for example, imagine that a mind has more time in which to consider moral arguments, or better knowledge of itself. Maybe you currently want revenge on the Capulet family, but if somebody had a chance to sit down with you and have a long talk about how revenge affects civilizations in the long run, you could be talked out of that. Maybe you're currently convinced that you advocate for green shoes to be outlawed out of the goodness of your heart, but if you could actually see a printout of all of your own emotions at work, you'd see there was a lot of bitterness directed at people who wear green shoes, and this would change your mind about your decision.

In Yudkowsky's version of extrapolated volition considered on an individual level, the three core directions of extrapolation are:

Motivation

Different people react differently to the question "Where should we point an autonomous superintelligence, if we can point it exactly?" and approach it from different angles. [todo: and we'll eventually need an Arbital dispatching questionnaire on a page that handles it] These angles include:

Some corresponding initial replies might be:

Arguendo by CEV's advocates, these conversations eventually all end up converging on Coherent Extrapolated Volition as an alignment proposal by different roads.

"Extrapolated volition" is the corresponding normative theory that you arrive at by questioning the meaning of 'right' or trying to figure out what we 'should' really truly do.

EV as rescuing the notion of betterness

We can see EV as trying to rescue the following pretheoretic intuitions (as they might be experienced by someone feeling confused, or just somebody who'd never questioned metaethics in the first place):

We cannot rescue these properties by saying:

"There is an irreducible, non-natural 'rightness' XML tag attached to some objects and events. Our brains perceive this XML tag, but imperfectly, giving us property (a) when we think the XML tag is there, even though it isn't. The XML tags are there even if nobody sees them (a1). Sometimes we stare harder and see the XML tag better (a2). Obviously, doing anything to a brain isn't going to change the XML tag (b), just fool the brain or invalidate its map of the XML tag. All of the things on Frankena's list have XML tags (c) or at least we think so. For paperclips to be the total correct content of Frankena's list, we'd need to be wrong about paperclips not having XML tags and wrong about everything on Frankena's list that we think does have an XML tag (c1). And on the meta-level, "Which sense of rightness leads to the most paperclips?" doesn't say anything about XML tags, and it doesn't lead to there being lots of XML tags, so there's no justification for it (c2)."

This doesn't work because:

Onto what sort of entity can we then map our intuitions, if not onto tiny XML tags?

Consider the property of sixness possessed by six apples on a table. The relation between the physical six apples on a table, and the logical number '6', is given by a logical function that takes physical descriptions as inputs: in particular, the function "count the number of apples on the table".

Could we rescue 'rightness' onto a logical function like this, only much more complicated?

Let's examine how the 6-ness property and the "counting apples" function behave:

This suggests that 6-ness has the correct ontological nature for some much bigger and more complicated logical function than "Count the number of apples on the table" to be outputting rightness. Or rather, if we want to rescue our pretheoretic sense of rightness in a way that adds up to moral normality, we should rescue it onto a logical function.

This function, e.g., starts with the items on Frankena's list and everything we currently value; but also takes into account the set of arguments that might change our mind about what goes on the list; and also takes into account meta-level conditions that we would endorse as distinguishing "valid arguments" and "arguments that merely change our minds". (This last point is pragmatically important if we're considering trying to get a superintelligence to extrapolate our volitions. The list of everything that does in fact change your mind might include particular patterns of rotating spiral pixel patterns that effectively hack a human brain.)

The end result of all this work is that we go on guessing which acts are right and wrong as before, go on considering that some possible valid arguments might change our minds, go on weighing such arguments, and go on valuing the things on Frankena's list in the meantime. The theory as a whole is intended to add up to the same moral normality as before, just with that normality embedded into the world of causality and logic in a non-confusing way.

One point we could have taken into our starting list of important properties, but deferred until later:

On the general program of "rescuing the utility function", we should not scorn this feeling, and should instead figure out how to map it onto what actually exists.

In this case, having preserved almost all the structural properties of moral normality, there's no reason why anything should change about how we experience the corresponding emotion in everyday life. If our native emotions are having trouble with this new, weird, abstract, learned representation of 'a certain big complicated logical function', we should do our best to remember that the rightness is still there. And this is not a retreat to second-best any more than "disordered kinetic energy" is some kind of sad consolation prize for the universe's lack of ontologically basic warmth, etcetera.

Unrescuability of moral internalism

In standard metaethical terms, we have managed to rescue 'moral cognitivism' (statements about rightness have truth-values) and 'moral realism' (there is a fact of the matter out there about how right something is). We have not however managed to rescue the pretheoretic intuition underlying 'moral internalism':

This intuition cannot be preserved in any reasonable way, because paperclip maximizers are in fact going to go on making paperclips (and not because they made some kind of cognitive error). A paperclip maximizer isn't disagreeing with you about what's right (the output of the logical function), it's just following whatever plan leads to the most paperclips.

Since the paperclip maximizer's policy isn't influenced by any of our moral arguments, we can't preserve the internalist intuition without reducing the set of valid justifications and truly valuable things to the empty set - and even that, a paperclip maximizer wouldn't find motivationally persuasive!

Thus our options regarding the pretheoretic internalist intuition that a moral argument is not valid if not universally persuasive, seem to be limited to the following:

  1. Give up on the intuition in its intuitive form: a paperclip maximizer doesn't care if it's unjust to kill everyone; and you can't talk it into behaving differently; and this doesn't reflect a cognitive stumble on the paperclip maximizer's part; and this fact gives us no information about what is right or justified.
  2. Preserve, at the cost of all other pretheoretic intuitions about rightness, the intuition that only arguments that universally influence behavior are valid: that is, there are no valid moral arguments.
  3. Try to sweep the problem under the rug by claiming that reasonable minds must agree that paperclips are objectively pointless… even though Clippy is not suffering from any defect of epistemic or instrumental power, and there's no place in Clippy's code where we can point to some inherently persuasive argument being dropped by a defect or special case of that code.

It's not clear what the point of stance (2) would be, since even this is not an argument that would cause Clippy to alter its behavior, and hence the stance is self-defeating. (3) seems like a mere word game, and potentially a very dangerous word game if it tricks AI developers into thinking that rightness is a default behavior of AIs, or even a function of low algorithmic complexity, or that beneficial behavior automatically correlates with 'reasonable' judgments about less value-laden questions. See "Orthogonality Thesis" for the extreme practical importance of acknowledging that moral internalism is in practice false.

Situating EV in contemporary metaethics

Metaethics is the field of academic philosophy that deals with the question, not of "What is good?", but "What sort of property is goodness?" As applied to issues in Artificial Intelligence, rather than arguing over which particular outcomes are better or worse, we are, from a standpoint of executable philosophy, asking how to compute what is good; and why the output of any proposed computation ought to be identified with the notion of shouldness.

EV replies that for each person at a single moment in time, right or should is to be identified with a (subjectively uncertain) logical constant that is fixed for that person at that particular moment in time, where this logical constant is to be identified with the result of running the extrapolation process on that person. We can't run the extrapolation process so we can't get perfect knowledge of this logical constant, and will be subjectively uncertain about what is right.

To eliminate one important ambiguity in how this might cash out, we regard this logical constant as being analytically identified with the extrapolation of our brains, but not counterfactually dependent on counterfactually varying forms of our brains. If you imagine being administered a pill that makes you want to kill people, then you shouldn't compute in your imagination that different things are right for this new self. Instead, this new self now wants to do something other than what is right. We can meaningfully say, "Even if I (a counterfactual version of me) wanted to kill people, that wouldn't make it right" because the counterfactual alteration of the self doesn't change the logical object that you mean by saying 'right'.

However, there's still an analytic relation between this logical object and your actual mindstate, which is indeed is implied by the very meaning of discourse about shouldness, which means that you can get veridical information about this logical object by having a sufficiently intelligent AI run an approximation of the extrapolation process over a good model of your actual mind. If a sufficiently intelligent and trustworthy AGI tells you that after thinking about it for a while you wouldn't want to eat cows, you have gained veridical information about whether it's right to eat cows.

Within the standard terminology of academic metaethics, "extrapolated volition" as a normative theory is:

Closest antecedents in academic metaethics are Rawls and Goodman's reflective equilibrium, Harsanyi and Railton's ideal advisor theories, and Frank Jackson's moral functionalism.

Moore's Open Question

Argument. If extrapolated volition is analytically equivalent to good, then the question "Is it true that extrapolated volition is good?" is meaningless or trivial. However, this question is not meaningless or trivial, and seems to have an open quality about it. Therefore, extrapolated volition is not analytically equivalent to goodness.

Reply. Extrapolated volition is not supposed to be transparently identical to goodness. The normative identity between extrapolated volition and goodness is allowed to be something that you would have to think for a while and consider many arguments to perceive.

Natively, human beings don't start out with any kind of explicit commitment to a particular metaethics; our brains just compute a feeling of rightness about certain acts, and then sometimes update and say that acts we previously thought were right are not-right.

When we go from that, to trying to draw a corresponding logical function that we can see our brains as approximating, and updating when we learn new things or consider new arguments, we are carrying out a project of "rescuing the utility function". We are reasoning that we can best rescue our native state of confusion by seeing our reasoning about goodness as having its referent in certain logical facts, which lets us go on saying that it is better ceteris paribus for people to be happy than in severe pain, and that we can't reverse this ordering by taking a pill that alters our brain (we can only make our future self act on different logical questions), etcetera. It's not surprising if this bit of philosophy takes longer than five minutes to reason through.


Comments

Robert Peetsalu

What disturbs me in this article is the normativeness - describing values, rightness and goodness as something objective, having an objective boolean value, existing in the world without an observer to have those values, like some motivation without someone being motivated by it. Instead rightness and goodness are meaningless outside of some utility function, some desired end state that would label moving towards it as positive direction and against it as negative direction. Without a destination every direction is as good as every other. Values are always subjective, so when teaching them to an AI we can only refer to how common it is to regard value A as being positive or negative among people.

The universe doesn't want anything, so for example killing humans has no innate badness and is not negative for the universe. It's just negative for most humans. If taking a pill will change your subjective values to "killing=good", then rightness will also change and the AI will now extrapolate this new rightness from your brain. Furthermore it will correctly recommend futures with killing because they are better than futures without it according to these values.

We have no reason to believe that if each of us knew as much as a superintelligence knows, could think as fast as it and reason as soundly as it does, that we would then have no differences in values. Let's assume safely that subjectivity isn't going anywhere. We can still define some useful values for the AI by substituting objective values with an overwhelming consensus of known subjective values. Those are basic values that are common to most people and don't vary significantly with political or personal preference, like human rights, basic criminal law, maybe some of the soft positive values mentioned in the article. Ban on wars would be nice to include! (We'd need to define what level of aggression is considered war and whether information war and sanctions are also included.)

The utility function of an AI is what defines its priorities for possible outcomes aka its values. In case of forementioned rights and laws they tend to take the form of penalty for wrong actions instead of utility gain for good actions, which is a slippery slope in the sense that AI-s tend to find loopholes in prohibitions, but on the other hand penalties can't be abused for utility maximization like gains can. For example rewarding for creating happy fluffy feelings in people would turn the AI into a maximizer.

In any case we'll want to change the AI-s values as our understanding of good and right evolves, so let's hope utility indifference will let us update them. Instead of changing drastically over time our values will probably become more detailed and situational - full of exceptions, just like our laws. Already justice systems of many countries are so complex that it would make sense to delegate judgement to AI-s. Can't wait to see news of first AI judges being bribed with utility gains.

P.S: Act of opposing normative values is the definition of rebelling, so I guess I'm a rebel now ^_^