[summary(Brief): In modern AI and especially in value alignment theory, there's a sharp divide between "problems we know how to solve using unlimited computing power", and "problems we can't state how to solve using computers larger than the universe". Not knowing how to do something with unlimited computing power reveals that you are confused about the structure of the problem. It is presently a disturbing fact that we don't know how to solve many aspects of value alignment even given unlimited computing power - we can't state a program that would be a nice AI if we ran it on a sufficiently large computer. "Unbounded analysis" tries to improve what we understand how to do in principle, in that sense.]

[summary: In modern AI and especially in value alignment theory, there's a sharp divide between "problems we know how to solve using unlimited computing power", and "problems we can't state how to solve using computers larger than the universe". Not knowing how to do something with unlimited computing power reveals that you are confused about the structure of the problem. The first paper ever written on computer chess was by Shannon in 1950, which gave in passing the unbounded algorithm for solving chess. The 1997 program Deep Blue beat the human world champion 47 years later. In 1836, Edgar Allen Poe carefully argued that no automaton could ever play chess, since at each turn there are many possible moves by the player and opponent, but gears and wheels can only represent deterministic motions. The year 1836 was confused about the nature of the cognitive problem involved in chess; by 1950, we understood the work in principle that needed to be performed; in 1997 it was finally possible to play chess better than humans do. It is presently a disturbing fact that we don't know how to build a nice AI even given unlimited computing power, and there are arguments for tackling this problem specifically as such.]

[summary(Technical): Much current technical work in value alignment theory takes place against a background of unbounded computing power, and other simplifying assumptions such as perfect knowledge, environments simpler than the agent, full knowledge of other agents' code, etcetera. Real-world agents will be bounded, have imperfect knowledge, etcetera. There are four primary reasons for doing unbounded analysis anyway:

If we don't know how to solve a problem even given unlimited computing power, that means we're confused about the nature of the work to be performed. It is often worthwhile to tackle this basic conceptual confusion in a setup that exposes the confusing part of the problem as simply as possible, and doesn't introduce further complications until the core issue is resolved.
Introducing 'realistic' complications can make it difficult to engage in cooperative discourse about which ideas have which consequences - AIXI was a watershed moment because it was formally specified and there was no way to say "Oh, I didn't mean that" when somebody pointed out that AIXI would try to seize control of its own reward channel.
Increasing the intelligence of an advanced agent may sometimes move its behavior closer to ideals and further from specific complications of an algorithm. Early, stupid chess algorithms had what seemed to humans like weird idiosyncracies tied to their specific algorithms. Modern chess programs, far beyond the human champions, can from an intuitive human perspective be seen as just making good chess moves.
Current AI algorithms are often incapable of demonstrating future phenomena that seem like they should predictably occur later, and whose interesting properties seem like they can be described using an unbounded algorithm as an example. E.g. current AI algorithms are very far from doing free-form self-reprogramming, but this is predictably the sort of issue we might encounter later.]

Summary

"Unbounded analysis" refers to determining the behavior of a computer program that would, to actually run, require an unphysically large amount of computing power, or sometimes hypercomputation. If we know how to solve a problem using unlimited computing power, but not with real-world computers, then we have an "unbounded solution" but no "bounded solution".

As a central example, consider computer chess. The first paper ever written on computer chess, by Claude Shannon in 1950, gave an unbounded solution for playing perfect chess by exhaustively searching the tree of all legal chess moves. (Since a chess game draws if no pawn is moved and no piece is captured for 50 moves, this is a finite search tree.) Shannon then passed to considering more bounded ways of playing imperfect chess, such as evaluating the worth of a midgame chess position by counting the balance of pieces, and searching a smaller tree up to midgame states. It wasn't until 47 years later, in 1997, that Deep Blue beat Garry Kasparov for the world championship, and there were multiple new basic insights along the way, such as alpha-beta pruning.

In 1836, there was a sensation called the Mechanical Turk, allegedly a chess-playing automaton. Edgar Allen Poe, who was also an amateur magician, wrote an essay arguing that the Turk must contain a human operator hidden in the apparatus (which it did). Besides analyzing the Turk's outward appearance to locate the hidden compartment, Poe carefully argued as to why no arrangement of wheels and gears could ever play chess in the first place, explicitly comparing the Turk to "the calculating machine of Mr. Babbage":

Arithmetical or algebraical calculations are, from their very nature, fixed and determinate. Certain data being given, certain results necessarily and inevitably follow […] But the case is widely different with the Chess-Player. With him there is no determinate progression. No one move in chess necessarily follows upon any one other. From no particular disposition of the men at one period of a game can we predicate their disposition at a different period […] Now even granting that the movements of the Automaton Chess-Player were in themselves determinate, they would be necessarily interrupted and disarranged by the indeterminate will of his antagonist. There is then no analogy whatever between the operations of the Chess-Player, and those of the calculating machine of Mr. Babbage […] It is quite certain that the operations of the Automaton are regulated by mind, and by nothing else. Indeed this matter is susceptible of a mathematical demonstration, a priori.

(In other words: In an algebraical problem, each step follows with the previous step of necessity, and therefore can be represented by the determinate motions of wheels and gears as in Charles Babbage's proposed computing engine. In chess, the player's move and opponent's move don't follow with necessity from the board position, and therefore can't be represented by deterministic gears.)

This is an amazingly sophisticated remark, considering the era. It even puts a finger on the part of chess that is computationally difficult, the combinatorial explosion of possible moves. And it is still entirely wrong.

Even if you know an unbounded solution to chess, you might still be 47 years away from a bounded solution. But if you can't state a program that solves the problem in principle, you are in some sense confused about the nature of the cognitive work needed to solve the problem. If you can't even solve a problem given infinite computing power, you definitely can't solve it using bounded computing power. (Imagine Poe trying to write a chess-playing program before he'd had the insight about search trees.)

We don't presently know how to write a Python program that would be a nice AI if we ran it on a unphysically large computer. Trying directly to cross this conceptual gap by carving off pieces of the problem and trying to devise unbounded solutions to them is "the methodology of unbounded analysis in AI alignment theory".

Since "bounded agent" has come to mean in general an agent that is realistic, the term "unbounded agent" also sometimes refers to agents that:

Perfectly know their environments.
Can fully simulate their environments (the agent is larger than its environment and/or the environment is very simple).
Operate in turn-based, discrete time (the environment waits for the agent to compute its move).
Cartesian agents that are perfectly separated from the environment except for sensory inputs and motor outputs.

These are two importantly distinct kinds of unboundedness, and we'll refer to the above list of properties as "unrealism" to distinguish them here. Sufficiently unrealistic setups are also called "toy models" or "toy problems".

Unrealistic setups have disadvantages, most notably that the results, observations, and solutions may fail to generalize to realistic applications. Correspondingly, the classic pitfall of unbounded analysis is that it's impossible to run the code, which means that certain types of conceptual and empirical errors are more likely to go uncaught (see below).

There are nonetheless several forces pushing technical work in value alignment theory toward unbounded analyses and unrealistic setups, at least for now:

Attacking confusion in the simplest settings. If we don't know how to solve a problem given unlimited computing power, this means we're confused about the nature of the work to be performed. It is sometimes worthwhile to tackle this conceptual confusion in a setup that tries to expose the confusing part of the problem as simply as possible. Trying to bound the proposed solution or make it realistic can introduce a lot of complications into this discussion, arguably unnecessary ones. (Deep Blue was far more complicated than Shannon's ideal chess program, and it wouldn't be doing Edgar Allen Poe any favors to show him Deep Blue's code and hide away Shannon's ideal outline.)
Unambiguous consequences and communication. Introducing 'realistic' complications can make it difficult to engage in cooperative discourse about which ideas have which consequences. (AIXI was a watershed moment for alignment theory because AIXI was formally specified and there was no way to say "Oh, I didn't mean that" when somebody pointed out that AIXI would try to seize control of its own reward channel.)
More advanced agents might be less idiosyncratic. Increasing the cognitive power of an agent may sometimes move its behavior closer to ideals and further from specific complications of an algorithm. (From the perspective of a human onlooker, early chess algorithms seemed to have weird idiosyncracies tied to their specific algorithms. Modern chess programs can from an intuitive human perspective be seen as just making good chess moves.)
Runnable toy models often aren't a good fit for advanced-agent scenarios. Current AI algorithms are often not good natural fits for demonstrating certain phenomena that seem like they might predictably occur in sufficiently advanced AIs. Usually it's a lot of work to carve off a piece of such a phenomenon and pose it as an unbounded problem. But that can still be significantly easier to fit into an unbounded setting than into a toy model. (All examples here are too complicated for one sentence, but see the subsection below and the discussion of Utility indifference and Tiling agents theory.)

Even to the extent that these are good reasons, standard pitfalls of unbounded analyses and unrealistic setups still apply, and some of our collective and individual precautions against them are discussed below.

For historical reasons, computer scientists are sometimes suspicious of unbounded or unrealistic analyses, and may wonder aloud if they reflect unwillingness or incapacity to do a particular kind of work associated with real-world code. For discussion of this point see [ Why MIRI uses unbounded analysis].

Attacking confusion in the simplest settings.

Imagine somebody claiming that an ideal chess program ought to evaluate the ideal goodness of each move, and giving their philosophical analysis in terms of a chess agent which knows the perfect goodness of each move, without ever giving an effective specification of what determines the ideal goodness, but using the term $\gamma$ to symbolize it so that the paper looks very formal. We can imagine an early programmer sitting down to write a chess-playing program, crafting the parts that take in user-specified moves and the part that displays the chess screen, using random numbers for move goodness until they get around to writing the "move-goodness determining module", and then finally finding themselves being unable to complete the program; at which point they finally recognize that all the talk of "the ideal goodness $\gamma$ of a chess move" wasn't an effective specification.

Part of the standard virtue ethics of computer science includes an injunction to write code in order to force this kind of grad student to realize that they don't know how to effectively specify something, even if they symbolized it using a Greek letter. But at least this kind of ineffectiveness seems to be something that some people can learn to detect without actually running the code - consider that, in our example above, the philosopher-programmer realized that they didn't know how to compute $\gamma$ at the point where they couldn't complete that part of the program, not at a point where the program ran and failed. Adding on all the code to take in user moves and display a chess board on the screen only added to the amount of time required to come to this realization; once they know what being unable to write code feels like, they might be able to come to the same realization much faster by standing in front of a whiteboard and failing to write pseudocode.

At this point, it sometimes makes sense to step back and try to say exactly what you don't know how to solve - try to crisply state what it is that you want an unbounded solution for. Sometimes you can't even do that much, and then you may actually have to spend some time thinking 'philosophically' - the sort of stage where you talk to yourself about some mysterious ideal quantity of move-goodness and you try to pin down what its properties might be. It's important not to operate in this stage under the delusion that your move-goodness $\gamma$ is a well-specified concept; it's a symbol to represent something that you're confused about. By asking for an unbounded solution, or even an effectively-specified representation of the unbounded problem, that is, asking for pseudo-code which could be run given nothing more than an unphysically large computer (but no otherwise-unspecified Oracle of Move Goodness that outputs $\gamma$ ), we're trying to invoke the "Can I write code yet?" test on whatever philosophical reasoning we're doing.

Can trying to write running code help at this stage? Yes, depending on how easy it is to write small programs that naturally represent the structure of the problem you're confused about how to solve. Edgar Allen Poe might have been willing to concede that he could conceive of deterministic gears that would determine whether a proposed chess move was legal or illegal, and it's possible that if he'd actually tried to build that automaton and then try to layer gears on top of that to pick out particular legal moves by whatever means, he might have started to think that maybe chess was computable after all - maybe even have hit upon a representation of search, among other possible things that gears could do, and so realized how in principle the problem could be solved. But this adds a delay and a cost to build the automaton and try out variations of it, and complications from trying to stay within the allowed number of gears; and it's not obvious that there can't possibly be any faster way to hit upon the idea of game-tree search, say, by trying to write pseudocode or formulas on a whiteboard, thinking only about the core structure of the game. If we ask how it is that Shannon had an easier time coming up with the unbounded solution (understanding the nature of the work to be performed) than Poe, the most obvious cause would be the intervening work by Church and Turing (among others) on the nature of computation.

And then in other cases it's not obvious at all how you could well-represent a problem using current AI algorithms, but with enough work you can figure out how represent the problem in an unbounded setting.

The pitfall of simplying away the key, confusing part of the problem.

The tiling agents problem, in the rocket alignment metaphor, is the metaphorical equivalent of trying to fire a cannonball around a perfectly spherical Earth with no atmosphere - to obtain an idea how any kind of "stable orbit" can work at all. Nonmetaphorically, the given problem is to exhibit the simplest nontrivial case of stable self-modification - an agent that, given its current reasoning, wants to create an agent with similar properties as a successor, i.e., preserving its current goals.

In a perfect world, we'd be able to, with no risk, fire up running code for AI algorithms that reason freely about self-modification and have justified beliefs about how alternative versions of their own code will behave and the outer consequences of that behavior (the way you might imagine what would happen in the real world if you took a particular drug affecting your cognition). But this is way, way beyond current AI algorithms to represent in any sort of natural or realistic way.

A bad "unbounded solution" would be to suppose agents that could exactly simulate their successors, determine exactly what their successors would think, extrapolate the exact environmental consequences, and evaluate those. If you suppose an exact simulation ability, you don't need to consider how the agent would reason using generalizations, abstractions, or probabilities… but this trivial solution doesn't feel like it sheds any light or gets us any closer to understanding reflective stability; that is, it feels like the key part of the problem has been simplified away and solving what remains was too easy and didn't help.

Faced with an "unbounded solution" you don't like, the next step is to say crisply exactly what is wrong with it in the form of a new desideratum for your solution. In this case, our reply would be that for Agent 1 to exactly simulate Agent 2, Agent 1 must be larger than Agent 2, and since we want to model stable self-modification, we can't introduce a requirement that Agent 2 be strictly weaker than Agent 1. More generally, we apply the insight of Vinge's Principle to this observation and arrive at the desiderata of Vingean uncertainty and Vingean reflection, which we also demand that an unbounded solution exhibit.

This illustrates one of the potential general failure modes of using an unbounded setup to shed conceptual light on a confusing problem - namely, when you simplify away the key confusing issue you wanted to resolve, and end up with a trivial-seeming solution that sheds no light on the original problem. A chief sign of this is when your paper is too easy to write. The next action is to say exactly what you simplified away, and put it in the form of a new desideratum, and try to say exactly why your best current attempts can't meet that desideratum.

So we've now further added: we want the agent to generalize over possible exact actions and behaviors of its successor rather than needing to know its successor's exact actions in order to approve building it.

With this new desideratum in hand, there's now another obvious unbounded model: consider deterministic agents operating in environments with known rules that reason about possible designs and the environment using first-order logic. The agent then uses an unbounded proof search, which no current AI algorithm could tackle in reasonable time (albeit a human engineer would be able to do it with a bunch of painstaking work) to arrive at justified logical beliefs about the effect of its successor on its environment. This is certainly still extremely unrealistic; but has this again simplified away all the interesting parts of the problem? In this case we can definitely reply, "No, it does expose something confusing" since we don't in fact know how to build a tiling agent under this setup. It may not capture all the confusing or interesting parts of the problem, but it seems to expose at least one confusion. Even if, as seems quite possible, it's introduced some new problem that's an artifact of the logical setup and wouldn't apply to agents doing probabilistic reasoning, there's then a relatively crisp challenge of, "Okay, show me how probabilistic reasoning resolves the problem, then."

It's not obvious that there's anything further to be gained by trying to create a toy model of the problem, or a toy model of the best current unsatisfactory partial solution, that could run as code with some further cheating and demo-rigging, but this is being tried anyway. The tiling agents problem did spend roughly nine years exclusively on paper before that, and the best current unsatisfactory solution was arrived at with whiteboards.

The pitfall of residual terms.

Besides "simplifying away the confusing part of the problem", another way that unbounded thinking can "bounce off" a confusing problem is by creating a residual term that encapsulates the confusion. Currently, there are good unbounded specifications for [cartesian Cartesian] non-self-modifying expected reward maximizers: if we allow the agent to use unlimited computing power, don't allow the environment to have unlimited computing power, don't ask the agent to modify itself, separate the agent from its environment by an impermeable barrier through which only sensory information and motor outputs can pass, and then ask the agent to maximize a sensory reward signal, there's a simple Python program which is expected to behave superintelligently given sufficient computing power. If we then introduce permeability into the Cartesian boundary and allow for the possibility that the agent can take drugs or drop an anvil on its own head, nobody has an unbounded solution to that problem any more.

So one way of bouncing off that problem is to say, "Oh, well, my agent calculates the effect of its motor actions on the environment and the expected effect on sensory information and the reward signal, plus a residual term $\gamma$ which stands for the expected utility of all effects of the agent's actions that change the agent's processing or destroys its hardware". How is $\gamma$ to be computed? This is left unsaid.

In this case you haven't omitted the confusing part of the problem, but you've packed it into a residual term you can't give an effective specification for calculating. So you no longer have an unbounded solution - you can't write down the Python program that runs given unlimited computing power - and you've probably failed to shed any important light on the confusing part of the problem. Again, one of the warning signs here is that the paper is very easy to write, and reading it does not make the key problem feel less like a hard opaque ball.

Introducing realistic complications can make it hard to build collective discourse about which ideas have which consequences.

One of the watershed moments in the history of AI alignment theory was Marcus Hutter's proposal for AIXI, not just because it was the first time anyone had put together a complete specification of an unbounded agent (in a [cartesian Cartesian setting]), but also because it was the first time that non-value-alignment could be pointed out in a completely pinned-down way.

(work in progress)

Comments

Jeremy Perret

Hugh has access to the internet\. He can send and receive emails, can browse the internet, and so on\. He can attach money to emails painlessly\. In principle, the people of Earth could all agree not to accept Hugh’s money\. On average, this would make them better off\. But it’s not going to happen\. The world isn’t on Hugh’s side, either\. He doesn’t have any super\-trustworthy assistants, much less thousands of them\. There are just a lot of people who want their share of $10M / day\. In the easy version of the problem, there are also some people who share Hugh’s vision—but even then, who can tell the difference? Hugh has written a lot about his goals and his outlook\. Anyone who’s curious can go learn about Hugh on the internet; lots of people have\. If someone doesn’t want Hugh to learn something, they can pay a news site not to cover it\. They can launch a DDoS against Google\. They could even go to more extreme lengths\. But we’ll assume that Hugh’s connection to the internet is itself tamper\-proof\. And just like the world won’t coordinate to turn down Hugh’s money, they won’t coordinate to set up a parallel version of the internet just to delude him\. Hugh has no prospect of fixing his debilitating sleep disorder\. He could leave his room if he wanted, but what is he going to do in 10 minutes, anyway? He’s already arranged for his room to be secure and well\-stocked, and for his infrastructure to be reliable\. Hugh’s goals do not require the cooperation of any specific person, or any unverifiable private information\. Everything that Hugh wants to do could be done by one of several people $and, as per assumption \#2, we assume that these people will not all collude$\. All of the information that Hugh needs could in principle be verified by Hugh, if he had enough time to spend veryfing it\.

I'm confused by this sentence. How not accepting Hugh's money lead to the people of Earth being better off? Is it the fact that they would have to achieve perfect coordination to do that?