Shutdown utility function

A special case of low impact which probably seems deceptively trivial - how would you create a utility function such that an agent with this utility function would harmlessly shut down? Without, for example, creating an environmental subagent that assimilated all matter in the universe and used it to make absolutely sure that the AI stayed shut down forever and wasn't accidentally reactivated by some remote probability? If we had a shutdown utility function, and a safe button that switched between utility functions in a reflectively stable way, we could combine these two features to create an AI that had a safe shutdown button.

Better yet would be an abort utility function which incentivizes the safe aborting of all previous plans and actions in a low-impact way, and, say, suspending the AI itself to disk in a way that preserved its log files; if we had this utility function plus a safe button that switched to it, we could safely abort the AI's current actions at any time. (This, however, would be more difficult, and it seems wise to work on just the shutdown utility function first.)

To avoid a rock trivially fulfilling this desideratum, we should add the requirement that (1) the shutdown utility function be something that produces "just switch yourself off and do nothing else" behavior in a generally intelligent agent, which if instead hooked up to a paperclip utility function, would be producing paperclips; and that the shutdown function should be omni-safe (the AI safely shuts down even if it has all other outcomes available as primitive actions).

"All outcomes have equal utility" would not be a shutdown utility function since in this case the actual action produced will be undefined under most forms of unbounded analysis - in essence, the AI's internal systems would continue under their own inertia and produce some kind of undefined behavior which might well be coherent and harmful. We need a utility function that identifies harmless behavior, rather than failing to identify anything and producing undefined behavior.

Comments

Ryan Carey

Interesting question.

Here's how this problem is motivated in my head… The more obvious way to get an AI system to shut down is to have a shutdown action. Then utility-maximization occurs in an inner loop that is overridden by instructions to shutdown or change the value function. But then you need the utility-maximizer to be corrigible somehow, perhaps using a shutdown utility function, making this a purported subproblem of corrigibility.

As for obvious proposed solutions, if you had defined a shutdown action [e.g. run this routine that switches the power off], then you could have the objective "The chance of this action being performed is greater than 99.999%" as your utility function. Though an incorrigible AI might be able to copy itself to get around this…

One also wonders if this could be adapted into a reductio ad absurdum of the idea of making aligned AI by specifying a sovereign's utility function.