Relevant limited AI

It is an open problem to propose a limited AI that would be relevant to the value achievement dilemma - an agent cognitively constrained along some dimensions that render it much safer, but still able to perform some task useful enough to prevent catastrophe.

Basic difficulty

Consider an Oracle AI that is so constrained as to be allowed only to output proofs in HOL of input theorems; these proofs are then verified by a simple and secure-seeming verifier in a sandbox whose exact code is unknown to the Oracle, and this verifier outputs 1 if the proof is true and 0 otherwise, then discards the proof-data. Suppose also that the Oracle is in a shielded box, etcetera.

It's possible that this Provability Oracle has been so constrained that it is cognitively containable (it has no classes of options we don't know about). If the verifier is unhackable, it gives us trustworthy knowledge that a theorem is provable. But this limited system is not obviously useful in a way that enables humanity to extricate itself from its larger dilemma. Nobody has yet stated a plan which could save the world if only we had a superhuman capacity to detect which theorems were provable in Zermelo-Fraenkel set theory.

Saying "The solution is for humanity to only build Provability Oracles!" does not resolve the value achievement dilemma because humanity does not have the coordination ability to 'choose' to develop only one kind of AI over the indefinite future, and the Provability Oracle has no obvious use that prevents non-Oracle AIs from ever being developed. Thus our larger value achievement dilemma would remain unsolved. It's not obvious how the Provability Oracle would even constitute significant strategic progress.

Open problem

Describe a cognitive task or real-world task for a AI to carry out, that makes great progress upon the value achievement dilemma if executed correctly, and that can be done with a limited AI that:

Has a real-world solution state that is exceptionally easy to pinpoint using a utility function, thereby avoiding some of edge instantiation, unforeseen maximums, context change, [ programmer maximization], and the other pitfalls of advanced safety, if there is otherwise a trustworthy solution for [ low-impact AI]; or
Seems exceptionally implementable using a known-algorithm non-self-improving agent, thereby averting problems of stable self-modification, if there is otherwise a trustworthy solution for a known-algorithm non-self-improving agent; or
Constrains the agent's option space so drastically as to make the strategy space not be rich (and the agent hence containable), while still containing a trustworthy, otherwise unfindable solution to some challenge that resolves the larger dilemma.

[todo: ### Additional difficulties]

[todo: (Fill in this section later; all the things that go wrong when somebody eagerly says something along the lines of "We just need AI that does X!")]

Comments

Paul Christiano

Normally I think that you set the bar too high for yourself. In this case, I think that you would be justified in setting the bar much higher (I guess if we disagreed in the same direction in every case, it wouldn't be clear that we were really disagreeing).

If you design a "safe" AI which is much less efficient (say 10x more expensive to do the same things) then an unsafe AI, that may be useful but it does not seem to resolve what you call the value achievement dilemma. It would need to be coupled with very good coordination to prevent people from deploying the more efficient, unsafe AI.

So I think it is reasonable to set the bar at safe systems that act in the world (acquire resources, produce things, influence politics…) nearly as effectively as any unsafe system that we could construct using the same underlying technologies.

This kind of requirement seems much more important than (e.g.) ensuring that your system remains safe if it were to suddenly become infinitely powerful.

This disagreement likely relates to our disagreement about the likely pace and dynamics of AI development. One difference is that in this case assuming a fast takeoff may actually be less conservative. So if you want to push the "plan for the worst" line, it seems like you should probably be pessimistic about an intelligence explosion where that would be inconvenient, but also be pessimistic about the tolerable gaps in efficiency where that would be inconvenient.