Behaviorist genie

by Eliezer Yudkowsky Jul 14 2015 updated Mar 31 2016

An advanced agent that's forbidden to model minds in too much detail.

[summary: A behaviorist genie is an advanced AI that can understand, e.g., material objects and technology, but does not model human minds (or possibly its own mind) in unlimited detail. If creating a behaviorist agent were possible, it might meliorate several anticipated difficulties simultaneously, like the problems of creating models of humans that are themselves sapient or the AI psychologically manipulating its users. Since the AI would only be able to model humans via some restricted model class, it would be metaphorically similar to a Skinnerian behaviorist from the days when it was considered unprestigious for scientists to talk about the internal mental details of human beings.]

A behaviorist genie is an AI that has been averted from modeling minds in more detail than some whitelisted class of models.

This is possibly a good idea because many possible difficulties seem to be associated with the AI having a sufficiently advanced model of human minds or AI minds, including:

…and yet an AI that is extremely good at understanding material objects and technology (just not other minds) would still be capable of some important classes of pivotal achievement.

A behaviorist genie would still require most of genie theory and corrigibility to be solved. But it's plausible that the restriction away from modeling humans, programmers, and some types of reflectivity, would collectively make it significantly easier to make a safe form of this genie.

Thus, a behaviorist genie is one of fairly few open candidates for "AI that is restricted in a way that actually makes it safer to build, without it being so restricted as to be incapable of game-changing achievements".

Nonetheless, limiting the degree to which the AI can understand cognitive science, other minds, its own programmers, and itself is a very severe restriction that would prevent a number of obvious ways to make progress on the AGI subproblem and the value identification problem even for commands given to Task AGIs (Genies). Furthermore, there could perhaps be easier types of genies to build, or there might be grave difficulties in restricting the model class to some space that is useful without being dangerous.

Requirements for implementation

Broadly speaking, two possible clusters of behaviorist-genie design are:

Breaking the first case down into more detail, the potential desiderata for a behavioristic design are:

These are different goals, but with some overlap between them. Some of the things we might need:

In the KANSI case, we'd presumably be 'naturally' working with limited model classes (on the assumption that everything the AI is using is being monitored, has a known algorithm, and has a known model class) and the goal would just be to prevent the KANSI agent from spilling over and creating other human models somewhere else, which might fit well into a general agenda against self-modification and subagent creation. Similarly, if every new subject is being identified and whitelisted by human monitors, then just not whitelisting the topic of modeling distant superintelligences or devising strategies for programmer manipulation, might get most of the job done to an acceptable level if the underlying whitelist is never being evaded (even emergently). This would require a lot of successfully maintained vigilance and human monitoring, though, especially if the KANSI agent is trying to allocate a new human-modeling domain once per second and every instance has to be manually checked.


Paul Christiano

I can imagine this concept becoming relevant one day. But it seems sufficiently improbable that it doesn't seem worth thinking about until we run out of urgent things to think about. Reasons it seems improbable: