"> The obvious patch is for ..."

The obvious patch is for a sufficiently sophisticated system to have preferences over its own behavior, which motivate it to avoid reasoning in ways that we would dislike.

My worry here would be that we'll run into a Nearest Unblocked Neighbor problem on our attempts to define sapience as a property of computer simulations.

For example, suppose that my utility function U is "how good [ idealized Eliezer] thinks things are, after thinking for a thousand years." It doesn't take long to realize that [ idealized Eliezer] would be unhappy with a literal simulation of [ idealized Eliezer].

Let's say that sapience1 is a definition that covers most of the 'actual definition of sapience' (e.g. what we'd come up with given unlimited time to think, etc.) that I'll call sapience0, relative to some measure on probable computer programs. But there are still exceptions; there are sapient0 things not detected by sapience1. The best hypothesis for predicting an actually sapient mind that is not in sapience1, seems unusually likely to be one of the special cases that is still in sapience0. It might even just be an obfuscated ordinary sapient program, rather than one with an exotic kind of sapience, if sapience_1 doesn't incorporate some advanced-safe way of preventing obfuscation.

We can't throw a superhumanly sophisticated definition at the problem (e.g. the true sapience0 plus an advanced-safe block against obfuscation) without already asking the AI to simulate us or to predict the results of simulating us in order to obtain this hypothetical sapience2.

Moreover, a primitive understanding of Eliezer's views suffices to avoid the worst offenses (or at least to realize that they are the kinds of things which Eliezer would prefer that a human be asked about first).

This just isn't obvious to me. It seems likely to me that an extremely advanced understanding of Eliezer's idealized views is required to answer questions about what Eliezer would say about consciousness, with extreme accuracy, without