Omnipotence test for AI safety

https://arbital.com/p/omni_test

by Eliezer Yudkowsky Mar 26 2015 updated Mar 31 2016

Would your AI produce disastrous outcomes if it suddenly gained omnipotence and omniscience? If so, why did you program something that *wants* to hurt you and is held back only by lacking the power?


[summary: Suppose your AI suddenly became omniscient and omnipotent - suddenly knew all facts and could directly ordain any outcome as a policy option. Would the executing AI code lead to bad outcomes in that case? If so, why did you write a program that in some sense 'wanted' to hurt you and was only held in check by lack of knowledge and capability? Isn't that a bad way for you to configure computing power?

The Omni Test suggests, e.g., that you should not rely on a human agent to monitor the AI's current growth rate and intervene if something goes visibly wrong. Instead, growth should be measured internally, and cumulative growth should require external validation before proceeding. The former case fails if the AI becomes suddenly omnipotent; the latter does not. Or similarly, if weird new options open up to the AI, the AI should stay inside a conservatively whitelisted part of the option space until more user interactions have occurred. Or similarly, we should never write an AI that we think will cognitively search for a way to defeat its own security measures, even if we think the search will probably fail. See also Niceness is the first line of defense.]

Suppose your AI suddenly became omniscient and omnipotent - suddenly knew all facts and could directly ordain any outcome as a policy option. Would the executing AI code lead to bad outcomes in that case? If so, why did you write a program that in some sense 'wanted' to hurt you and was only held in check by lack of knowledge and capability? Isn't that a bad way for you to configure computing power? Why not write different code instead?

The Omni Test is that an advanced AI should be expected to remain aligned, or not lead to catastrophic outcomes, or fail safely, even if it suddenly knows all facts and can directly ordain any possible outcome as an immediate choice. The policy proposal is that, among agents meant to act in the rich real world, any predicted behavior where the agent might act destructively if given unlimited power (rather than e.g. pausing for a safe user query) should be treated as a bug.

Safety mindset

The Omni Test highlights any reasoning step on which we've presumed, in a non-failsafe way, that the agent must not obtain definite knowledge of some fact or that it must not have access to some strategic option. There are epistemic obstacles to our becoming extremely confident of our ability to lower-bound the reaction times or upper-bound the power of an advanced agent.

The deeper idea behind the Omni Test is that any predictable failure in an Omni scenario, or lack of assured reliability, exposes some more general flaw. Suppose NASA found that an alignment of four planets would cause their code to crash and a rocket's engines to explode. They wouldn't say, "Oh, we're not expecting any alignment like that for the next hundred years, so we're still safe." They'd say, "Wow, that sure was a major bug in the program." Correctly designed programs just shouldn't explode the rocket, period. If any specific scenario exposes a behavior like that, it shows that some general case is not being handled correctly.

The omni-safe mindset says that, rather than trying to guess what facts an advanced agent can't figure out or what strategic options it can't have, we just shouldn't make these guesses of ours load-bearing premises of an agent's safety. Why design an agent that we expect will hurt us if it knows too much or can do too much?

For example, rather than design an AI that is meant to be monitored for unexpected power gains by programmers who can then press a pause button - which implicitly assumes that no capability gain can happen in fast enough that a programmer wouldn't have time to react - an omni-safe proposal would design the AI to detect unvetted capability gains and pause until the vetting had occurred. Even if it seemed improbable that some amount of cognitive power could be gained faster than the programmers could react, especially when no such previous sharp power gain had occurred even in the course of a day, etcetera, the omni-safe mindset says to just not build an agent that is unsafe when such background variables have 'unreasonable' settings. The correct general behavior is to, e.g., always pause when new capability has been acquired and a programmer has not yet indicated approval of its use. It might not be possible for an AGI design to suddenly use unlimited power optimally, or even use it in any safe way at all, but that's still no excuse for building an omni-unsafe system; because it ought to be possible to detect that case, say "Something weird just happened!", and suspend to disk.

Similarly, consider the paradigm of conservative planning. Rather than thinking in terms of blacklisting features of bad plans, we think in terms of whitelisting allowed plans using conservative generalizations. So long as we're narrowly whitelisting rather than blacklisting, lots of new option space suddenly opening up shouldn't result in any of those strange new options being taken until the users can whitelist more things.

Similarly, the omni-safe mindset says that any time the AI is running a cognitive search for a way to do something harmful or bypass one of our security measures, we've already seen a bug in the design, even if we strongly expect that search to come up empty. Why configure computing power in a way that will search for a way to hurt you? This is clearly the wrong thing to do with computing power; we, the programmers, should have our code do something else instead. This is part of the agenda motivating the study of Corrigibility.

Trying to guess an agent's actual capabilities is a reasonable project, especially if there are more 'adversarial' fallback precautions that revolve around our estimates of the agent's actual powers. E.g., having designed an agent that doesn't want to gain unvetted communication channels with the outside world, we might then put the computing hardware into a Faraday cage and make sure there are no cellphone towers around - an 'adversarial' precaution meant to block communication even if the first line of defense fails and the AI starts wanting to communicate in an unvetted way. But 'adversarial' precautions are secondary lines of defense; a reasonable expectation of omni-safety is the first line of defense. First we assume that all adversarial fallback measures will fail, and design the agent to remain nonharmful or fail safely no matter what new capability or knowledge is gained. Then we assume the first line of defense has failed, and try, if it's at all possible or realistic, to put up fallback measures that will prevent total catastrophe so long as the agent has realistic amounts of power and can't violate what we think are 'the laws of physics' and so on.


Comments

Paul Christiano

Consider an AI system composed of many interacting subsystems, or a world containing many AI systems. Are you asking for safety even if one of these systems or subsystems becomes omniscient while others did not? Clearly this would be a nice property to have if it were attainable, but it seems pretty ambitious. I'm also not convinced it's a big deal one way or the other, because I don't expect there to be massive unnoticed (by the AI systems that are designing new AI systems) disparities in power during normal operation. So whether designing for such disparities is useful seems to depend on an empirical claim about the plausibility of big differentials.

You could make your original point with respect to differentials "if it fails for a large enough differential, then why think the real differential is small enough?" but I don't find this very compelling when we can say relatively precisely what kind of differential is small enough.