Known-algorithm non-self-improving agent

"Known-algorithm non-self-improving" (KANSI) is a strategic scenario and class of possibly-attainable AI designs, where the first pivotal powerful AI has been constructed out of known, human-understood algorithms and is not engaging in extensive self-modification. Its advantage might be achieved by, e.g., being run on a very large cluster of computers. If you could build an AI that was capable of cracking protein folding and building nanotechnology, by running correctly structured algorithms akin to deep learning on Google's or Amazon's computing cluster, and the builders were sufficiently paranoid/sensible to have people continuously monitoring this AI's processes and all the problems it was trying to solve and not having this AI engage in self-modification or self-improvement, this would fall into the KANSI class of scenarios. This would imply that huge classes of problems in reflective stability, ontology identification, limiting potentially dangerous capabilities, etcetera, would be much simpler.

Restricting 'good' or approved AI development to KANSI designs would mean deliberately foregoing whatever capability gains might be possible through self-improvement. It's not known whether a KANSI AI could be first to some pivotal level of capability. This would depend on unknown background settings about how much capability can be gained, at what stage, by self-modification. Depending on these background variables, making a KANSI AI be first to a capability threshold might or might not be something that could be accomplished by any reasonable level of effort and coordination. This is one reason among several why MIRI does not, e.g., restrict its attention to KANSI designs.

Just intending to build a non-self-improving AI out of known algorithms is insufficient to ensure KANSI as a property; this might require further solutions along the lines of Corrigibility. E.g., humans can't modify their own brain functions, but because we're [ general consequentialists] and we don't always think the way we want to think, we created quite simple innovations like, e.g., calculators, out of environmental objects in a world that didn't have any built-in calculators, so that we could think about arithmetic in a different way than we did by default. A KANSI design with a large divergence between how it thinks and how it wants to think might behave similarly, or require constant supervision to detect most cases of the AI starting to behave similarly - and then some cases might slip through the cracks. Since our present study and understanding of reflective stability is very primitive, we're plausibly still in the field of things we should be studying even if we want to build a KANSI agent, just to have the KANSI agent not be too wildly divergent in distance between how it thinks about X, and how it would prefer to think about X if given the choice.

Comments

Paul Christiano

Eliezer seems to have, and this page seems to reflect, strong intuitions about "self-modification" beyond what you would expect from synonymy with "AI systems doing AI design and implementation." In my view of the world, there is no meaningful distinction between these things, and this post sounds confused. I think it would be worth pushing more on this divergence.

AI work is already done with the aid of powerful computational tools. It seems clear that these tools will become more powerful over time, and that at some point human involvement won't be helpful for further AI progress. (It's not clear how discontinuous progress will be on those tools. I think it will probably be reasonably smooth. I'm open to the possibility of abrupt progress but it's not clear to me how that really changes the picture.) Improvements in tools could yield either more or less human understanding and effective control of the AI systems they improve, depending on the character of those tools.

If you can solve the control/alignment problem with a "KANSI" agent, then it's not clear to me how the introduction of "self-modification" changes the character of the problem.

Here is my understanding of Eliezer's picture (translated into my worldview): we might be able to build AI systems that are extremely good at helping us build capable AI systems, but not nearly as good at helping us solve AI alignment/control or building alignable/controllable AI. In this case, we will either need to have a very generally scalable solution to alignment/control in place (which we can apply to new AI systems as they are developed, without further help from the designers of those new AI systems), or else we may simply be doomed (if no such very scalable solution is possible, e.g. because the only way to solve alignment is to build a certain kind of AI system).

Interestingly, this difficulty is not directly related to the fact that the tools are themselves AI systems which pose a alignment/control problem. Instead the difficulty comes from the uneven capabilities of these systems (from the human perspective), namely that they are very good at AI design but not very good at helping with AI control.

This is at odds with what is written above, so it seems like I don't yet see the real picture. But I'll press on anyway.

One approach to this scenario is to refrain from getting help from our AI-designer AI systems, and instead sticking with weak AI systems and proceeding along a slower development trajectory. The world could successfully follow such a trajectory only by coordinating pretty well, which might be achieved either with political progress or with a sudden world takeover.

This overall picture makes sense to me. But, it doesn't seem meaningfully distinct from the rest of the broad category "maybe we could build highly inefficient AI systems and then coordinate to avoid competitive pressures to use more efficient alternatives." As usual, this approach seems clearly doomed to me, only accessible or desirable if the world becomes convinced that the AI situation is extraordinarily dire.

The distinction arises because maybe, even once we are coordinating to do AI development slowly, AI systems may design new AI systems of their own accord (and those systems may not be well-controlled). But this seems to be saying: if we mess up the alignment/control problem, then we may find ourselves with a new AI which is not aligned/controlled. But so what? We've already lost the game once our AI is doing things we don't want it to, it's not like we are losing any more.

To make the distinction really relevant, it seems to me you need an extreme view of takeoff speed. Then maybe the possibility of self-modification can turn a local failure into a catastrophe. Translated into my worldview, the story would be something like: once we are developing AI slowly, our project is vulnerable to more reckless competitors. Even if we successfully coordinate to stop all external competitors, our AI project may itself spawn some competitors internally. Despite our apparent strategic advantage, these internal competitors will rapidly become powerful enough to jeopardize the project (or else conceal themselves while they grow more powerful). And so we want to do additional research to ensure that no such internal competitor will emerge.

I don't think this really meshes with Eliezer's view, I'm just laying out my understanding of the view so that it can be corrected.

Alexei Andreev

One can imagine an agent that is smart about finding and training itself on new features. You seed it with one set of features, but over time it replaces that set with much better features fitting the data. To me it even seems possible that something like that could get to AGI level. This is not "self-modification" in the classic sense, so I'm wondering where that falls in this classification scheme.