A proposed foreseeable difficulty of aligning advanced agents is furthermore proposed to be "patch-resistant" if the speaker thinks that most simple or naive solutions will fail to resolve the difficulty and just regenerate it somewhere else.

To call a problem "patch-resistant" is not to assert that it is unsolvable, but it does mean the speaker is cautioning against naive or simple solutions.

On most occasions so far, alleged cases of patch-resistance are said to stem from one of two central sources:

The difficulty arises from a convergent instrumental strategy executed by the AI, and simple patches aimed at blocking one observed bad behavior will not stop a very similar behavior from popping up somewhere else.
The difficulty arises because the desired behavior has high algorithmic complexity and simple attempts to pinpoint beneficial behavior are doomed to fail.

Instrumental-convergence patch-resistance

Example: Suppose you want your AI to have a shutdown button:

You first try to achieve this by writing a shutdown function into the AI's code.
After the AI becomes self-modifying, it deletes the code because it is (convergently) the case that the AI can accomplish its goals better by not being shut down.
You add a patch to the utility function giving the AI minus a million points if the AI deletes the shutdown function or prevents it from operating.
The AI responds by writing a new function that reboots the AI after the shutdown completes, thus technically not preventing the shutdown.
You respond by again patching the AI's utility function to give the AI minus a million points if it continues operating after the shutdown.
The AI builds an environmental subagent that will accomplish the AI's goals while the AI itself is technically "shut down".

This is the first sort of patch resistance, the sort alleged to arise from attempts to defeat an instrumental convergence with simple patches meant to get rid of one observed kind of bad behavior. After one course of action is blocked by a specific obstacle, the next-best course of action remaining is liable to be highly similar to the one that was just blocked.

Complexity-of-value patch-resistance

Example:

You want your AI to accomplish good in the world, which is presently highly correlated with making people happy. Happiness is presently highly correlated with smiling. You build an AI that tries to achieve more smiling.
After the AI proposes to force people to smile by attaching metal pins to their lips, you realize that this current empirical association of smiling and happiness doesn't mean that maximum smiling must occur in the presence of maximum happiness.
Although it's much more complicated to infer, you try to reconfigure the AI's utility function to be about a certain class of brain states that has previously in practice produced smiles.
The AI successfully generalizes the concept of pleasure, and begins proposing policies to give people heroin.
You try to add a patch excluding artificial drugs.
The AI proposes a genetic modification producing high levels of endogenous opiates.
You try to explain that what's really important is not forcing the brain to experience pleasure, but rather, people experiencing events that naturally cause happiness.
The AI proposes to put everyone in the Matrix…

Since the programmer-intended concept is actually highly complicated, simple concepts will systematically fail to have their optimum at the same point as the complex intended concept. By the [fragility_of_value fragility of value], the optimum of the simple concept will almost certainly not be a high point of the complex intended concept. Since most concepts are not surprisingly compressible, there probably isn't any simple concept whose maximum identifies that fragile peak of value. This explains why we would reasonably expect problems of perverse instantiation to pop up over and over again, the optimum of the revised concept moving to a new weird extreme each time the programmer tries to hammer down the next weird alternative the AI comes up with.

In other words: There's a large amount of algorithmic information or many independent reflectively consistent degrees of freedom in the correct answer, the plans we want the AI to come up with, but we've only given the AI relatively simple concepts that can't identify those plans.

Analogues in the history of AI

The result of trying to tackle overly general problems using AI algorithms too narrow for those general problems, usually appears in the form of an infinite number of special cases with a new special case needing to be handled for every problem instance. In the case of narrow AI algorithms tackling a general problem, this happens because the narrow algorithm, being narrow, is not capable of capturing the deep structure of the general problem and its solution.

Suppose that burglars, and also earthquakes, can cause burglar alarms to go off. Today we can represent this kind of scenario using a Bayesian network or causal model which will compactly yield probabilistic inferences along the lines of, "If the burglar alarm goes off, that probably indicates there's a burglar, unless you learn there was an earthquake, in which case there's probably not a burglar" and "If there's an earthquake, the burglar alarm probably goes off."

During the era where everything in AI was being represented by first-order logic and nobody knew about causal models, [ people devised increasingly intricate "nonmonotonic logics"] to try to represent inference rules like (simultaneously) $alarm \rightarrow burglar, \ earthquake \rightarrow alarm,$ and $(alarm \wedge earthquake) \rightarrow \neg burglar.$ But first-order logic wasn't naturally a good surface fit to the set of inferences needed, and the AI programmers didn't know how to compactly capture the structure that causal models capture. So the "nonmonotonic logic" approach proliferated an endless nightmare of special cases.

Cognitive problems like "modeling causal phenomena" or "being good at math" (aka understanding which mathematical premises imply which mathematical conclusions) might be general enough to defeat modern narrow-AI algorithms. But these domains still seem like they should have something like a central core, leading us to expect [correlated_covereage correlated coverage] of the domain in sufficiently advanced agents. You can't conclude that because a system is very good at solving arithmetic problems, it will be good at proving Fermat's Last Theorem. But if a system is smart enough to independently prove Fermat's Last Theorem and the Poincare Conjecture and the independence of the Axiom of Choice in Zermelo-Frankel set theory, it can probably also - without further handholding - figure out Godel's Theorem. You don't need to go on programming in one special case after another of mathematical competency. The fact that humans could figure out all these different areas, without needing to be independently reprogrammed for each one by natural selection, says that there's something like a central tendency underlying competency in all these areas.

In the case of complexity of value, the thesis is that there are many independent reflectively consistent degrees of freedom in our intended specification of what's good, bad, or best. Getting one degree of freedom aligned with our intended result doesn't mean that other degrees of freedom need to align with our intended result. So trying to "patch" the first simple specification that doesn't work, is likely to result in a different specification that doesn't work.

When we try to use a narrow AI algorithm to attack a problem which has a central tendency requiring general intelligence to capture, or at any rate requiring some new structure that the narrow AI algorithm can't handle, we're effectively asking the narrow AI algorithm to learn something that has no simple structure relative to that algorithm. This is why early AI researchers' experience with "lack of common sense" that you can't patch with special cases may be foreseeably indicative of how frustrating it would be, in practice, to repeatedly try to "patch" a kind of difficulty that we may foreseeably need to confront in aligning AI.

That is: Whenever it feels to a human like you want to yell at the AI for its lack of "common sense", you're probably looking at a domain where trying to patch that particular AI answer is just going to lead into another answer that lacks "common sense". Previously in AI history, this happened because real-world problems had no simple central learnable solution relative to the narrow AI algorithm. In value alignment, something similar could happen because of the complexity of our value function, whose evaluations also feel to a human like "common sense".

Relevance to alignment theory

Patch resistance, and its sister issue of lack of correlated coverage, is a central reason why aligning advanced agents could be way harder, way more dangerous, and way more likely to actually kill everyone in practice, compared to optimistic scenarios. It's a primary reason to worry, "Uh, what if aligning AI is actually way harder than it might look to some people, the way that building AGI in the first place turned out not to be something you could do in two months over the summer?"

It's also a reason to worry about context disasters revolving around capability gains: Anything you had to patch-until-it-worked at AI capability level $k$ is probably going to break hard at capability $l \gg k.$ This is doubly catastrophic in practice if the pressures to "just get the thing running today" are immense.

To the extent that we can see the central project of AI alignment as revolving around finding a set of alignment ideas that do have simple central tendencies and are specifiable or learnable which together add up to a safe but powerful AI - that is, finding domains with correlated coverage that add up to a safe AI that can do something pivotal - we could see the central project of AI alignment as finding a collectively good-enough set of safety-things we can do without endless patching.

Comments

Paul Christiano

This is a more general pattern in theoretical research. When you first start to attack a hard problem you often notice many promising lines of attack. Somehow in every line of attack, there will be at least (and often exactly) one thing that doesn't quite work out. Terence Tao has described this as feeling like "enemy movements" or something like this (though I can't find the quote). It is generally not possible to cross such gaps until you actually understand them. Once you do, instead of looking for a path from premises to conclusions you look for any gap in the chasm that seperates them. Once you've found the gap, it's often easy to go from premises to the gap and then from the gap to your conclusions.