[summary: The Zermelo-Fraenkel provability oracle is a [ thought experiment] illustrating the safety-capability tradeoff: a narrowly Boxed advanced agent, wrapped in multiple layers of software simulation, that can only output purported proofs of prespecified theorems in Zermelo-Fraenkel set theory, which proofs are then routed to a Boxed [proof-verifier] which checks the proof to make sure all the derivations follow syntactically. Then the verifier outputs a 0 or 1 saying whether the prespecified theorem was proven from the Zermelo-Fraenkel axioms. While this might be restrictive enough to actually be safe, and the result trustworthy, it's not clear what pivotal achievement could be carried out thereby.]
The Zermelo-Fraenkel provability oracle is the endpoint of the following conversation:
Alice: Let's build an AI that only answers questions. That sounds pretty safe to me. In fact, I don't understand why all you lot keep talking about building AIs that don't just answer questions, when the question-answering systems sound so much safer to me.
Bob: Suppose you screw up the goal system and the AI starts wanting to do something besides answer questions? This system is only as safe as your ability to make an AI with a stable goal system that actually implements your intended goal of "not doing anything except answering questions", which is [oracle_utility_function difficult to formalize].
Alice: But surely it's safer? Surely it's easier to develop a goal system that just safely answers questions?
Bob: How much safer? How much easier? If we hypothesize restricting the abilities of the AI in this way, we're reducing the potential payoff if we actually succeed. An AI that can act in the world and build its own nanotechnology can potentially protect us from other AIs becoming superintelligent, at least as a temporary fix to the immediate dilemma and the immediate danger of death, if not as an eternal solution. An AI that can act in the real world using nanotechnology would be, in our parlance, a pivotal achievement. Getting from a question-answering AI to a safer world seems a lot harder, unless the question you ask the AI is "How to do I build an intelligent, directed nanotechnological system?" in which case you might as not well bother pretending it's an Oracle. So we're losing something by supposing that the system only answers questions - we're losing some of the pivotal-ness of the hypothetical success. If you'll pardon the extreme example, if the Oracle system is only 0.01% safer or 99.9% as difficult to build as a Genie, then it definitely wouldn't be worth the decreased probability of being able to win the larger dilemma using a successful Oracle instead of a successful Genie. So the question is, how useful is the Oracle compared to the Genie, and given that you're going to use it that way, how much safer is it, or easier to build, compared to the Genie? An "Oracle" that immediately writes the code for a Genie is equally useful, but it's definitely not any easier to build safely.
Alice: But there are substantial gains in safety, or in ease of building the system to adequate safety, from adopting a question-answering design. Like, imagine that the computer is in a sealed chamber on the Moon, without any attached robots or robot arms, and just a single radio downlink to send the answers to the questions we ask. Then we'll be safe even if the goal design fails and it turns against us, because, you know, it'll be stuck on the Moon without any hands. To me this obviously seems a lot safer than, say, a Genie running around in the world with its own nanotechnology. Less useful assuming it works, yes, but also a lot less dangerous assuming it fails.
Bob: Are you sure you're properly respecting the intelligence of something vastly smarter than you? We're not smart enough to consider all the strategies it might imagine. What if there's some creative thing you can do even if you're in a sealed chamber on the Moon? There are radio waves leaving the chamber that depend on the flows of electrons in transistors -
Alice: Okay, a sealed chamber and we're running it in three layers of sandboxed simulation and we haven't told the AI about the hardware it's running on or about any of the emulation layers. I think it'd be pretty hard for the AI to manipulate the radio waves in that condition. It just wouldn't be able to figure out a relation between anything it could control, and the radio waves going out, and even if it could modulate the radio waves some tiny amount, do you really think that could somehow end the world?
[todo: finish the conversation]
Work in progress
bob: it can manipulate the human
alice: we won't tell it about humans
bob: it may be able to deduce an awful lot by looking at its own code.
alice: okay it gives a bare answer, like just a few bits.
bob: timing problem
alice: time it, obviously.
bob: what the heck do you think you can do with just a few bits that's interesting?
alice: we'll ask it which of two policies is the better policy for getting the world out of the AI hole.
bob: but how can you trust it?
alice: it can't coordinate with future AIs, but humans are trustworthy because we're honorable.
bob: cognitive uncontainability zaps you! it can totally coordinate. we have proof thereof. so any question it answers could just be steering the future toward a hostile AI. you can't trust it if you screwed up the utility function.
alice: okay fine, it outputs proofs of predetermined theorems in ZF that go into a verifier which outputs 0 or 1 and is then destroyed by thermite.
bob: I agree that this complete system might actually work.
bob: but unfortunately it's now useless.
alice: useless? we could ask it if the Riemann Hypothesis is provable in ZF -
bob: and how is that going to save the world, exactly?
alice: it could generally advance math progress.
bob: and now we're back at the "contributing one more thingy to a big pool" that people do when they're trying to promote their non-pivotal exciting fun thing to pivotalness.
alice: that's not fair. it could advance math a lot. beyond what humans could do for two hundred years.
bob: but the kind of math we do at MIRI frankly doesn't depend on whether RH is true. it's very rare that we know exactly what theorem we might want to prove, and we don't really know whether it's a consequence of ZF, and if we do know, we can move on. that's basically never happened in our whole history, except for maybe 7 days at a time. if we'd had your magic box the whole time, there are certain fields of math that would be decades ahead, but our field would be maybe a week or a month ahead. most of our work is in thinking up the theorems, not proving them.
alice: so what if it could suggest theorems to prove? like show you the form of a tiling agent?
bob: if it's showing us the outline of a framework for AI and proving something about it, I'm now very worried that it's going to manage to smuggle in some deadly trick of decision theory that lets it take over somehow. we've found all sorts of gotchas that could potentially do that, like Pascal's Mugging, probable environment hacking, blackmail… by unrestricting it to be able to suggest which theorems it can prove, you've given it a rich output channel and a wide range of options for influencing future AI design.
- supports methodology of foreseeable difficulties and its rule of explicitization and critique: when something looks easy or safe to someone, it often isn't.
- intuitive notions of what sort of restricted or limited AI should be "much easier to build" are often wrong, and this screws up the first intuition of how much safety we're gaining in exchange for putatively relinquishing some capabilities
- by the time you correct this bias, the amount of capability you actually have to relinquish in or order to get a bunch of safety and development ease, often makes the putative AI not have any obvious use
- it is wise to be explicit about how you intend to use the restricted AI to save the world, because trying to think of concrete details here will often reveal that anything you think of either (1) isn't actually very safe or (2) isn't actually very useful.