Directing, vs. limiting, vs. opposing

https://arbital.com/p/direct_limit_oppose

by Eliezer Yudkowsky Jan 16 2017 updated May 23 2017

Getting the AI to compute the right action in a domain; versus getting the AI to not compute at all in an unsafe domain; versus trying to prevent the AI from acting successfully. (Prefer 1 & 2.)


[summary: With respect to the theory of constructing sufficiently advanced AIs in ways that yield good outcomes:

For example:

'Directing' versus 'limiting' versus 'opposing' is a proposed conceptual distinction between 3 ways of getting good outcomes and avoiding bad outcomes, when running a sufficiently advanced Artificial Intelligence:

For example:

A fourth category not reducible to the other three might be stabilizing, e.g. numerical stability of floating-point algorithms, not having memory leaks in the code, etcetera. These are issues that a sufficiently advanced AI would fix in itself automatically, but an insufficiently advanced AI might not, which causes problems either if early errors introduce changes that are reflectively stable later, or if we are intending to run the AI in insufficiently-advanced mode.

Not running adversarial searches

Under the [agents_as_searches] view, we can see an agent as embodying computations that search out actions within a domain. The non-adversarial principle then says that we should not be running a search which is looking for ways to hurt us.

We can achieve this desideratum in one of two ways:

If both of these measures fail, then as a [strictly fallback second line of defense], we could try to make a bad search return a null result (the AI doesn't find any way to do the harmful thing).

For example: The non-adversarial principle implies that if you build a shutdown button for the AI that suspends the AI to disk when pressed, you must also ensure:

Or:

The first set of measures would be 'direction'--if those measures work, it's okay for the AI to think here because we think those thoughts are pointed in the right direction. The second approach would be limitation--we don't think the AI's computations are aligned inside this domain yet, so until then, we shouldn't be running computations there.

As a final, strictly fallback line of defense, you could put a big red lever in the computing center that shut off all the electricity even if the AI was failing to suspend to disk. (But you're not relying on this working; it's just there as a maybe-it-might-work measure that might buy you a little more probable safety even if the primary lines of defense failed.)

Relation to other non-adversarial ideas

The direction/limitation/opposition distinction can help state other ideas from the AI safety mindset. For example:

The principle niceness is the first line of defense can be rephrased as follows: When designing an AGI, we should imagine that all 'oppositional' measures are absent or failed, and think only about 'direction' and 'limitation'. Any oppositional measures are then added on top of that, just in case.

Similarly, the Omnipotence test for AI safety says that when thinking through our primary design for alignment, we should think as if the AGI just will get Internet access on some random Tuesday. This says that we should design an AGI that is limited by [whitelisting not wanting to act in newly opened domains without some programmer action], rather than relying on the AI to be unable to reach the Internet until we've finished aligning it.