by Eliezer Yudkowsky Apr 24 2015 updated Jun 1 2016

The word 'value' in the phrase 'value alignment' is a metasyntactic variable that indicates the speaker's future goals for intelligent life.

[summary: Different people advocate different views on what we should want for the outcome of an '[value aligned]' AI (desiderata like human flourishing, or a [ fun-theoretic eudaimonia], or coherent extrapolated volition, or an AI that mostly leaves us alone but protects us from other AIs). These differences might not be [ irreconcilable]; people are sometimes persuaded to change their views of what we should want. Either way, there's (arguably) a tremendous overlap in the technical issues for aligning an AI with any of these goals. So in the technical discussion, 'value' is really a metasyntactic variable that stands in for the speaker's current view, or for what an AI project might later adopt as a reasonable target after further discussion.]


In the context of value alignment as a subject, the word 'value' is a speaker-dependent variable that indicates our ultimate goal - the property or meta-property that the speaker wants or 'should want' to see in the final outcome of Earth-originating intelligent life. E.g: [ human flourishing], [ fun], coherent extrapolated volition, [ normativity].

Different viewpoints are still being debated on this topic; people [ sometimes change their minds about their views]. We don't yet have full knowledge of which views are 'reasonable' in the sense that people with good cognitive skills might retain them even in the limit of ongoing discussion. Some subtypes of potentially internally coherent views may not be sufficiently [ interpersonalizable] for even very small AI projects to cooperate on them; if e.g. Alice wants to own the whole world and will go on believing that in the limit of continuing contemplation, this is not a desideratum on which Alice, Bob, and Carol can all cooperate. Thus, using 'value' as a potentially speaker-dependent variable isn't meant to imply that everyone has their own 'value' and that no further debate or cooperation is possible; people can and do talk each other out of positions which are then regarded as having been mistaken, and completely incommunicable stances seem unlikely to be reified even into a very small AI project. But since this debate is ongoing, there is not yet any one definition of 'value' that can be regarded as settled.

Nonetheless, on many of the current views being advocated, it seems like very similar technical problems of value alignment seem to arise in many of them. We would need to figure out how to identify the objects of value to the AI, robustly assure that the AI's preferences are stable as the AI self-modifies, or create corrigible ways of recovering from errors in the way we tried to identify and specify the objects of value.

To centralize the very similar discussions of these technical problems while the outer debate about reasonable end goals is ongoing, the word 'value' acts as a metasyntactic placeholder for different views about the target of value alignment.

Similarly, in the larger value achievement dilemma, the question of what the end goals should be, and policy difficulties of getting 'good' goals to be adopted in name by the builders or creators of AI, are factored out as the [value_selection value selection problem]. The output of this process is taken to be an input into the value loading problem, and 'value' is a name referring to this output.

'Value' is not assumed to be what the AI is given as its utility function or preference framework. On many views implying that value is complex or otherwise difficult to convey to an AI, the AI may be, e.g., a Genie where some stress is taken off the proposition that the AI exactly understands value and put onto human ability to use the Genie well.

Consider a Genie with an explicit preference framework targeted on a [ Do What I Know I Mean system] for making [ checked wishes]. The word 'value' in any discussion thereof should still only be used to refer to whatever the AI creators are targeting for real-world outcomes. We would say the 'value alignment problem' had been successfully solved to the extent that running the Genie produced high-value outcomes in the sense of the humans' viewpoint on 'value', not to the extent that the outcome matched the Genie's preference framework for how to follow orders.

Specific views on value

Obviously, a listing like this will only summarize long debates. But that summary at least lets us point to some examples of views that have been advocated, and not indefinitely defer the question of what 'value' could possibly refer to.

Again, keep in mind that by technical definition, 'value' is what we are using or should use to rate the ultimate real-world consequences of running the AI, not the explicit goals we are giving the AI.

Some of the major views that have been advocated by more than one person are as follows:

The following versions of desiderata for AI outcomes would tend to imply that the value alignment / value loading problem is an entirely wrong way of looking at the issue, which might make it disingenuous to claim that 'value' in 'value alignment' can cover them as a metasyntactic variable as well:

Modularity of 'value'

Alignable values

Many issues in value alignment seem to generalize very well across the Reflective Equilibrium, Fun Theory, Intuitive Desiderata, and Deflationary Error Theory viewpoints. In all cases we would have to consider stability of self-modification, the Edge Instantiation problem in value identification, and most of the rest of 'standard' value alignment theory. This seemingly good generalization of the resulting technical problems across such wide-ranging viewpoints, and especially that it (arguably) covers the case of intuitive desiderata, is what justifies treating 'value' as a metasyntactic variable in 'value loading problem'.

A neutral term for referring to all the values in this class might be 'alignable values'.

Simple purpose

In the [ simple purpose] case, the key difference from an Immediate Goods scenario is that the desideratum is usually advocated to be simple enough to negate Complexity of Value and make value identification easy.

E.g., Juergen Schmidhuber stated at the 20XX Singularity Summit that he thought the only proper and normative goal of any agent was to increase compression of sensory information [todo: find exact quote, exact Summit]. Conditioned on this being the sum of all normativity, 'value' is algorithmically simple. Then the problems of Edge Instantiation, Unforeseen Maximums, and Nearest Unblocked Neighbor are all moot. (Except perhaps as there is an Ontology Identification problem for defining exactly what constitutes 'sensory information' for a [ self-modifying agent].)

Even in the [ simple purpose] case, the [ value loading problem] would still exist (it would still be necessary to make an AI that cared about the simple purpose rather than paperclips) along with associated problems of reflective stability (it would be necessary to make an AI that went on caring about X through self-modification). Nonetheless, the overall problem difficulty and immediate technical priorities would be different enough that the Simple Purpose case seems importantly distinct from e.g. Fun Theory on a policy level.

Moral internalism

Some viewpoints on 'value' deliberately reject Orthogonality. Strong versions of the [ moral internalist position in metaethics] claim as an empirical prediction that every sufficiently powerful cognitive agent will come to pursue the same end, which end is to be identified with normativity, and is the only proper object of human desire. If true, this would imply that the entire value alignment problem is moot for advanced agents.

Many people who advocate 'simple purposes' also claim these purposes are universally compelling. In a policy sense, this seems functionally similar to the Moral Internalist case regardless of the simplicity or complexity of the universally compelling value. Hence an alleged simple universally compelling purpose is categorized for these purposes as Moral Internalist rather than Simple Purpose.

The special case of a Simple Purpose claimed to be universally instrumentally convergent also seems functionally identical to Moral Internalism from a policy standpoint.)

AI Rights

Someone might believe as a proposition of fact that all (accessible) AI designs would have 'innate' desires, believe as a proposition of fact that no AI would gain enough advantage to wipe out humanity or prevent the creation of other AIs, and assert as a matter of morality that a good outcome consists of everyone being free to pursue their own value and trade. In this case the value alignment problem is implied to be an entirely wrong way to look at the problem, with all associated technical issues moot. Thus, it again might be disingenuous to have 'value' as a metasyntactic variable try to cover this case.


Brandon Reinhart

It may be worth commenting on the rights of computations-as-people here (Some computations are people). We would seek to respect the rights of AIs, but we also seek to respect the rights of the computations within the AI (and other complex systems) that are themselves sentient. This would also apply in cases of self-modification, where modified biological brains become sophisticated enough to create complex models that are also objects of ethical value.

Benjy Forstadt

Due partly to the choice of using 'value' as a speaker dependent variable, some of the terminology used in this article doesn't align with how the terms are used by professional metaethicists. I would strongly suggest one of:

1) replacing the phrase "moral internalism" with a new phrase that better individuates the concept.

2) including a note that the phrase is being used extremely non-standardly.

3) adding a section explaining the layout of metaethical possibilities, using moral internalism in the sense intended by professional metaethicists.

In metaethics, moral internalism, roughly, is the disjunction:

'Value' is speaker independent and universally compelling OR 'Value' is speaker dependent and is only used to indicate properties the speaker finds compelling

This seems very un-joint-carvy from a perspective of value allignment, but most philosophers see internalism as a semantic thesis that captures the relation between moral judgements and motivation. The idea is: If someone says something has value, she values that thing. This is very very different from how the term is used in this article.

I can provide numerous sources to back this up, if needed.