Autonomous AGI

by Eliezer Yudkowsky Dec 28 2015 updated Jun 6 2016

The hardest possible class of Friendly AI to build, with the least moral hazard; an AI intended to neither require nor accept further direction.

An autonomous or self-directed advanced agent, a machine intelligence which acts in the real world in pursuit of its preferences without further user intervention or steering. In Bostrom's typology of advanced agents, this is a "Sovereign" and distinguished from a "Genie" or an "Oracle". ("Sovereign" in this sense means self-sovereign, and is not to be confused with the concept of a Bostromian singleton or any particular kind of social governance.)

Usually, when we say "Sovereign" or "self-directed", we'll be talking about a supposedly aligned AI that acts autonomously by design. Failure to solve the alignment problem probably means the resulting AI is self-directed-by-default.

Trying to construct an autonomous Friendly AI suggests that we trust the AI more than the programmers in any conflict between them, and we're okay with removing all constraints and off-switches except those the agent voluntarily takes upon itself.

A successfully aligned autonomous AGI would carry the least moral hazard of any scenario, since it hands off steering to some fixed preference framework or objective that the programmers can no longer modify. Nonetheless, being really really really that sure, not just getting it right but knowing we've gotten it right, seems like a large enough problem that perhaps we shouldn't be trying to build this class of AI for our first try, and should first target a Task AGI instead, or something else involving ongoing user steering.

An autonomous superintelligence would be the most difficult possible class of AGI to align, requiring total alignment. Coherent extrapolated volition is a proposed alignment target for an autonomous superintelligence, but again, probably not something we should attempt to do on our first try.


Paul Christiano

This topic consistently frustrates me; the proposed typology is obviously incomplete, and I don't think it produces any useful conclusions except by either equivocating between definitions (e.g. when establishing that X is a sovereign and later that sovereigns have property P), by assuming exhaustiveness without justification, or by straightforwardly smuggling in associations.

Note that "an AI intended to act freely in the world according to its own preferences" need not entail "without further direction," since the preferences of the AI may make reference to human direction. And neither of these directly entail the need to get it right on the first try to any greater extent than any other AI system.

And the complement of these properties doesn't really imply anything at all, certainly not that a system is a genie or an oracle.

Paul Christiano

I obviously disagree with "under intelligence explosion scenarios a Singleton seems like a quite probable result of constructing a Sovereign."

This is true in an uninteresting sense, namely: in the very long run a singleton seems pretty likely. If technological/economic/social change accelerates enough, then from the outside it may look like a singleton appears immediately. But that's not a useful notion for forecasting the character of that singleton or the future trajectory of civilization, and the resulting singleton has little more relation to the early AI than it has to us.

Relatedly, I feel that "sovereign" is a really bad name.