"Eliezer seems to have, and ..."

Eliezer seems to have, and this page seems to reflect, strong intuitions about "self-modification" beyond what you would expect from synonymy with "AI systems doing AI design and implementation." In my view of the world, there is no meaningful distinction between these things, and this post sounds confused. I think it would be worth pushing more on this divergence.

AI work is already done with the aid of powerful computational tools. It seems clear that these tools will become more powerful over time, and that at some point human involvement won't be helpful for further AI progress. (It's not clear how discontinuous progress will be on those tools. I think it will probably be reasonably smooth. I'm open to the possibility of abrupt progress but it's not clear to me how that really changes the picture.) Improvements in tools could yield either more or less human understanding and effective control of the AI systems they improve, depending on the character of those tools.

If you can solve the control/alignment problem with a "KANSI" agent, then it's not clear to me how the introduction of "self-modification" changes the character of the problem.

Here is my understanding of Eliezer's picture (translated into my worldview): we might be able to build AI systems that are extremely good at helping us build capable AI systems, but not nearly as good at helping us solve AI alignment/control or building alignable/controllable AI. In this case, we will either need to have a very generally scalable solution to alignment/control in place (which we can apply to new AI systems as they are developed, without further help from the designers of those new AI systems), or else we may simply be doomed (if no such very scalable solution is possible, e.g. because the only way to solve alignment is to build a certain kind of AI system).

Interestingly, this difficulty is not directly related to the fact that the tools are themselves AI systems which pose a alignment/control problem. Instead the difficulty comes from the uneven capabilities of these systems (from the human perspective), namely that they are very good at AI design but not very good at helping with AI control.

This is at odds with what is written above, so it seems like I don't yet see the real picture. But I'll press on anyway.

One approach to this scenario is to refrain from getting help from our AI-designer AI systems, and instead sticking with weak AI systems and proceeding along a slower development trajectory. The world could successfully follow such a trajectory only by coordinating pretty well, which might be achieved either with political progress or with a sudden world takeover.

This overall picture makes sense to me. But, it doesn't seem meaningfully distinct from the rest of the broad category "maybe we could build highly inefficient AI systems and then coordinate to avoid competitive pressures to use more efficient alternatives." As usual, this approach seems clearly doomed to me, only accessible or desirable if the world becomes convinced that the AI situation is extraordinarily dire.

The distinction arises because maybe, even once we are coordinating to do AI development slowly, AI systems may design new AI systems of their own accord (and those systems may not be well-controlled). But this seems to be saying: if we mess up the alignment/control problem, then we may find ourselves with a new AI which is not aligned/controlled. But so what? We've already lost the game once our AI is doing things we don't want it to, it's not like we are losing any more.

To make the distinction really relevant, it seems to me you need an extreme view of takeoff speed. Then maybe the possibility of self-modification can turn a local failure into a catastrophe. Translated into my worldview, the story would be something like: once we are developing AI slowly, our project is vulnerable to more reckless competitors. Even if we successfully coordinate to stop all external competitors, our AI project may itself spawn some competitors internally. Despite our apparent strategic advantage, these internal competitors will rapidly become powerful enough to jeopardize the project (or else conceal themselves while they grow more powerful). And so we want to do additional research to ensure that no such internal competitor will emerge.

I don't think this really meshes with Eliezer's view, I'm just laying out my understanding of the view so that it can be corrected.

Comments

Eliezer Yudkowsky

Here is my understanding of Eliezer's picture (translated into my worldview): we might be able to build AI systems that are extremely good at helping us build capable AI systems, but not nearly as good at helping us solve AI alignment/control or building alignable/controllable AI.

This indeed is the class of worrisome scenarios, and one should consider that (a) Eliezer thinks that aligning the rocket is harder than fueling it in general, and (b) that this was certainly true of e.g. Eurisko which was able to get some amount of self-improvement but with all control issues being kicked squarely back to Douglas Lenat. We can also see natural selection's creation of humans in the same light, etcetera. On my view it seems extremely probable that, whatever we have in the way of AI algorithms (short of full FAI) creating other AI algorithms, they'll be helping out not at all with alignment and control and things like reflective stability and so on.

The case where KANSI becomes important is where we get to the level where AGI becomes possible, at a point where there are not huge foregone advantages from whatever types of AI creation of AI algorithms of a type where existing transparency or control work doesn't generalize. You can define a neural network undergoing gradient descent as "improving itself" but relative to current systems this doesn't change the algorithm to the point where we no longer understand what's going on. KANSI is relevant in the scenario where we first reach possible-advanced-AGI levels at a point where an organization with lots of resources and maybe a realistically-sized algorithmic lead, that foregoes the class of AI-improving-AI benefits that would make important subprocesses very hard to understand, is not at a disadvantage relative to a medium-sized organization with fewer resources. This is the level where we can put a big thing together out of things vaguely analogous to deep belief networks or whatever, and just run our current algorithms or minor variations on them, and have the AI's representation be reasonably transparent and known so that we can monitor the AI's thoughts - without some huge amount of work having gone into making transparency reflectively stable and corrigible through self-improvement or getting the AI to help us out with that, etcetera, because we're just taking known algorithms and running on them on a vast amount of computing power.

Paul Christiano

on my view it seems extremely probable that, whatever we have in the way of AI algorithms short of full FAI creating other AI algorithms, they'll be helping out not at all with alignment and control

You often say this, but I'm obviously not yet convinced.

As I see it the biggest likely gap is that you can empirically validate work in AI, but maybe cannot validate work on alignment/control except by consulting a human. This is problematic if either human feedback ends up being a major cost/obstacle (e.g. because AI systems are extremely cheap/fast, or because they are too far beyond humans for humans to provide meaningful oversight), or if task definitions that involve human feedback end up being harder by virtue of being mushier goals that don't line up as well with the actual structure of reality.

These objections are more plausible for establishing that control work is a comparative advantage of humans. In that context I would accept them as plausible arguments, though I think there is a pretty good chance of working around them.

But those considerations don't seem to imply that AI will help out "not at all." It seems pretty plausible that you are drawing on some other intuitions that I haven't considered.

Another possible gap is that control may just be harder than capabilities. But in that case the development of AI wouldn't really change the game, it would just make the game go faster, so this doesn't seem relevant to the present discussion. (If humans can solve the control problem anyway, humans+AI systems would have a comparable chance.)

Another possible gap is that there are many more iterations of AI design, and a failure at any time cascades into future iterations. I've pointed out that there can't be many big productivity improvements before any earlier thinking about AI is thoroughly obsolete, but I'm certainly willing to grant that forcing control to keep up for a while does make the problem materially harder (moreso the more that our solutions to the control problem are closely tied to details of the AI systems we are building). I agree that sticking with the same AI designs for longer can in some respects make the control problem easier. But it seems like you are talking about a difference-in-kind for safety work, rather than another way to slightly improve safety at the expense of efficacy.

Note: I'm saying that if you can solve the AI control/alignment problem for the AI systems in year N, then the involvement of those AI systems in subsequent AI design doesn't exert a significant additional pressure that makes it harder to solve the control/alignment problem in year N+1. It seems like this is the relevant question in the context of the OP.