Natural language understanding of "right" will yield normativity

This proposition is true if you can take a cognitively powerful agent that otherwise seems pretty competent at understanding natural language, and that has been previously trained out of infrahuman errors in understanding natural language, ask it to 'do the right thing' or 'do the right thing, defined the right way' and its natural language understanding of 'right' yields what we would intuitively see as normativity.

Arguments

[todo: expand]

Natural categories have boundaries with low algorithmic information relative to boundaries produced by a purely epistemic system with a simplicity prior.

'Unnatural' categories have value-laden boundaries. Values have high algorithmic information because of the Orthogonality Thesis and Complexity of value. Unnatural categories appear simple to us because we do dimensional reduction on value boundaries. Things merely near to the boundaries of unnatural categories can fall off rapidly in value because of fragility.

There's an inductive problem where 18 things are important and only 17 of them vary between the positive and negative examples in the data.

Edge instantiation makes this worse because it tends to seek out extreme cases.

The word 'right' involves a lot of what we call 'philosophical competence' in the sense that humans figuring it out will go through a lot of new cognitive use-paths ('unprecedented excursions') that they didn't traverse while disambiguating blue and green. This also holds true when people are reflecting on how to figure out 'right'. Example case of CDT vs. UDT.

This also matters because edge instantiation on the most 'right' as persuasively-right cases, will produce things that humans find superpersuasive (perhaps via shoving brains onto strange new pathways). So we can't define right as that which would counterfactually cause a human model to agree that 'right' applies.

This keys into the inductive problem where variation must be shadowed in the data for the induced concept to cover it.

But if you had a complete predictive model of a human, it's then possible though not necessary that normative boundaries might be possible to induce by examples and asking to clarify ambiguities.

Comments

Paul Christiano

These arguments seem weak to me.

I think the basic issue is that you are not properly handling uncertainty about what will be practically needed to train an agent out of infrahuman errors in language understanding. Your arguments seem much more reasonable under a particular model (e.g. a system that is making predictions or plans and develops language understanding as a tool for making better predictions), but it seems hard to justify 90% confidence in that model.
It's not at all clear that language understanding means identifying "natural" categories. Whether values have high information content doesn't seem like a huge consideration given what I consider plausible approaches to language learning---it's the kind of thing that makes the problem linearly harder / require linearly more data, rather than causing a qualitative change.
It seems clear that "right" does not mean "a human would judge right given a persuasive argument." That's a way we might try to define right, but it's clearly an alternative to a natural language understanding of right (an alternative I consider more plausible), not an aspect of it.
"Do the right thing" does not have to cash out as a function from outcomes --> rightness followed by rightness-maximization. That's not even really an intuitive way to cash it out.
The key issue may be how well natural language understanding degrades under uncertainty. Again, you seem to be imagining a distribution over vague maps from outcome --> rightness which is then maximized in expectation, whereas I (and I think most people) are imagining an incomplete set of tentative views about rightness. The incomplete set of tentative views about rightness can include strong claims about things like violations of human autonomy (even though autonomy is similarly defined by an incomplete set of tentative views rather than a distribution over maps from outcome ---> autonomy).

I agree that many commenters and some researchers are too optimistic about this kind of thing working automatically or by default. But I think your post doesn't engage with the substantive optimistic view.

It would be easier to respond if you gave a tighter argument for your conclusion, but it might also be worth someone actively making a tighter argument for the optimistic view, especially if you actually don't understand the strong optimistic view (rather than initially responding to a weak version of it for clarity).