I agree that reflective degrees of freedom won't "fix themselves" automatically, and that this is a useful concept.
There are at least two different approaches to getting the reflective degrees of freedom right:
- Figure out the right settings and build a reflectively consistent system that has those settings.
- Build a system which is motivated to defer to human judgments or to hypothetical human judgments.
A system of type 2 might be motivated to adopt the settings that humans would endorse upon reflection, rather than to continue using its interim decision theory/prior/etc.
On its face, I think that type 2 approach seems significantly more promising. The techniques needed to defer to human views about decision theory / priors / etc. already seem necessary to defer to human values.
You've given the argument that the interim prior/decision theory/whatever would lead to catastrophically bad outcomes, either because there are exotic failures, or because we wouldn't have a good enough theory and so would be forced to use a less principled approach (which we wouldn't actually be able to make aligned).
I don't find this argument especially convincing. I think it is particularly weak in the context of act-based agents, and especially proposals like this one. In this context I don't think we have compelling examples of plausible gotchas. We've seen some weird cases like simulation warfare, but these appear to be ruled out by the kinds of robustness guarantees that are already needed in more prosaic cases. Others, like blackmail or Pascal's mugging, don't seem to come up.