Reflective stability

by Eliezer Yudkowsky Dec 28 2015 updated May 21 2016

Wanting to think the way you currently think, building other agents and self-modifications that think the same way.

[summary: An agent is "reflectively stable" in some regard if it only self-modifies (or constructs successors) to think similarly in that regard. For example, an agent with a utility function that only values paperclips will construct successors that only value paperclips, so having a paperclip utility function is "reflectively stable" (and [goals_reflectively_stable so are most other utility functions]). Contrast "reflectively consistent".]

An agent is "reflectively stable" in some regard, if having a choice of how to construct a successor agent or modify its own code, the agent will only construct a successor that thinks similarly in that regard.

If, thinking the way you currently do (in some regard), it seems unacceptable to not think that way (in that regard), then you are reflectively stable (in that regard).

[todo: untangle possible confusion about reflective stability not being "good" and wanting reflectively unstable agents because it seems bad to them if a paperclip maximizer stays a paperclip maximizer, or they imagine causal decision theorists building something incrementally saner than casual decision theorists.]