Natural language understanding of "right" will yield normativity

https://arbital.com/p/4s

by Eliezer Yudkowsky Apr 17 2015 updated Dec 16 2015

What will happen if you tell an advanced agent to do the "right" thing?


This proposition is true if you can take a cognitively powerful agent that otherwise seems pretty competent at understanding natural language, and that has been previously trained out of infrahuman errors in understanding natural language, ask it to 'do the right thing' or 'do the right thing, defined the right way' and its natural language understanding of 'right' yields what we would intuitively see as normativity.

Arguments

[todo: expand]

Natural categories have boundaries with low algorithmic information relative to boundaries produced by a purely epistemic system with a simplicity prior.

'Unnatural' categories have value-laden boundaries. Values have high algorithmic information because of the Orthogonality Thesis and Complexity of value. Unnatural categories appear simple to us because we do dimensional reduction on value boundaries. Things merely near to the boundaries of unnatural categories can fall off rapidly in value because of fragility.

There's an inductive problem where 18 things are important and only 17 of them vary between the positive and negative examples in the data.

Edge instantiation makes this worse because it tends to seek out extreme cases.

The word 'right' involves a lot of what we call 'philosophical competence' in the sense that humans figuring it out will go through a lot of new cognitive use-paths ('unprecedented excursions') that they didn't traverse while disambiguating blue and green. This also holds true when people are reflecting on how to figure out 'right'. Example case of CDT vs. UDT.

This also matters because edge instantiation on the most 'right' as persuasively-right cases, will produce things that humans find superpersuasive (perhaps via shoving brains onto strange new pathways). So we can't define right as that which would counterfactually cause a human model to agree that 'right' applies.

This keys into the inductive problem where variation must be shadowed in the data for the induced concept to cover it.

But if you had a complete predictive model of a human, it's then possible though not necessary that normative boundaries might be possible to induce by examples and asking to clarify ambiguities.


Comments

Paul Christiano

These arguments seem weak to me.

I agree that many commenters and some researchers are too optimistic about this kind of thing working automatically or by default. But I think your post doesn't engage with the substantive optimistic view.

It would be easier to respond if you gave a tighter argument for your conclusion, but it might also be worth someone actively making a tighter argument for the optimistic view, especially if you actually don't understand the strong optimistic view (rather than initially responding to a weak version of it for clarity).