Reflectively consistent degree of freedom

A "reflectively consistent degree of freedom" is when a self-modifying AI can have multiple possible properties $X_i \in X$ such that an AI with property $X_1$ wants to go on being an AI with property $X_1,$ and an AI with $X_2$ will ceteris paribus only choose to self-modify into designs that are also $X_2,$ etcetera.

The archetypal reflectively consistent degree of freedom is a [humean_freedom Humean degree of freedom], the refective consistency of many different possible utility functions. If Gandhi doesn't want to kill you, and you offer Gandhi a pill that makes him want to kill people, then [gandhi_stability_argument Gandhi will refuse the pill], because he knows that if he takes the pill then pill-taking-future-Gandhi will kill people, and the current Gandhi rates this outcome low in his preference function. Similarly, a paperclip maximizer wants to remain a paperclip maximizer. Since these two possible preference frameworks are both consistent under reflection, they constitute a "reflectively consistent degree of freedom" or "reflective degree of freedom".

From a design perspective, or the standpoint of an AI safety mindset, the key fact about a reflectively consistent degree of freedom is that it doesn't automatically self-correct as a result of the AI trying to improve itself. The problem "Has trouble understanding General Relativity" or "Cannot beat a human at poker" or "Crashes on seeing a picture of a dolphin" is something that you might expect to correct automatically and without specifically directed effort, assuming you otherwise improved the AI's general ability to understand the world and that it was self-improving. "Wants paperclips instead of eudaimonia" is not self-correcting.

Another way of looking at it is that reflective degrees of freedom describe information that is not automatically extracted or learned given a sufficiently smart AI, the way it would automatically learn General Relativity. If you have a concept whose borders (membership condition) relies on knowing about General Relativity, then when the AI is sufficiently smart it will see a simple definition of that concept. If the concept's borders instead rely on [ value-laden] judgments, there may be no algorithmically simple description of that concept, even given lots of knowledge of the environment, because the [humean_freedom Humean degrees of freedom] need to be independently specified.

Other properties besides the preference function look like they should be reflectively consistent in similar ways. For example, [ son of CDT] and [ UDT] both seem to be reflectively consistent in different ways. So an AI that has, from our perspective, a 'bad' decision theory (one that leads to behaviors we don't want), isn't 'bugged' in a way we can rely on to self-correct. (This is one reason why MIRI studies decision theory and not computer vision. There's a sense in which mistakes in computer vision automatically fix themselves, given a sufficiently advanced AI, and mistakes in decision theory don't fix themselves.)

Similarly, Bayesian priors are by default consistent under reflection - if you're a Bayesian with a prior, you want to create copies of yourself that have the same prior or Bayes-updated versions of the prior. So 'bugs' (from a human standpoint) like being Pascal's Muggable might not automatically fix themselves in a way that correlated with sufficient growth in other knowledge and general capability, in the way we might expect a specific mistaken belief about gravity to correct itself in a way that correlated to sufficient general growth in capability. (This is why MIRI thinks about [ naturalistic induction] and similar questions about prior probabilities.)

Comments

Paul Christiano

I agree that reflective degrees of freedom won't "fix themselves" automatically, and that this is a useful concept.

There are at least two different approaches to getting the reflective degrees of freedom right:

Figure out the right settings and build a reflectively consistent system that has those settings.
Build a system which is motivated to defer to human judgments or to hypothetical human judgments.

A system of type 2 might be motivated to adopt the settings that humans would endorse upon reflection, rather than to continue using its interim decision theory/prior/etc.

On its face, I think that type 2 approach seems significantly more promising. The techniques needed to defer to human views about decision theory / priors / etc. already seem necessary to defer to human values.

You've given the argument that the interim prior/decision theory/whatever would lead to catastrophically bad outcomes, either because there are exotic failures, or because we wouldn't have a good enough theory and so would be forced to use a less principled approach (which we wouldn't actually be able to make aligned).

I don't find this argument especially convincing. I think it is particularly weak in the context of act-based agents, and especially proposals like this one. In this context I don't think we have compelling examples of plausible gotchas. We've seen some weird cases like simulation warfare, but these appear to be ruled out by the kinds of robustness guarantees that are already needed in more prosaic cases. Others, like blackmail or Pascal's mugging, don't seem to come up.