[summary: A behaviorist genie is an advanced AI that can understand, e.g., material objects and technology, but does not model human minds (or possibly its own mind) in unlimited detail. If creating a behaviorist agent were possible, it might meliorate several anticipated difficulties simultaneously, like the problems of creating models of humans that are themselves sapient or the AI psychologically manipulating its users. Since the AI would only be able to model humans via some restricted model class, it would be metaphorically similar to a Skinnerian behaviorist from the days when it was considered unprestigious for scientists to talk about the internal mental details of human beings.]

A behaviorist genie is an AI that has been averted from modeling minds in more detail than some whitelisted class of models.

This is possibly a good idea because many possible difficulties seem to be associated with the AI having a sufficiently advanced model of human minds or AI minds, including:

Mindcrime
Programmer deception and programmer manipulation
[ Recursive self-improvement]
Modeling distant adversarial superintelligences

…and yet an AI that is extremely good at understanding material objects and technology (just not other minds) would still be capable of some important classes of pivotal achievement.

A behaviorist genie would still require most of genie theory and corrigibility to be solved. But it's plausible that the restriction away from modeling humans, programmers, and some types of reflectivity, would collectively make it significantly easier to make a safe form of this genie.

Thus, a behaviorist genie is one of fairly few open candidates for "AI that is restricted in a way that actually makes it safer to build, without it being so restricted as to be incapable of game-changing achievements".

Nonetheless, limiting the degree to which the AI can understand cognitive science, other minds, its own programmers, and itself is a very severe restriction that would prevent a number of obvious ways to make progress on the AGI subproblem and the value identification problem even for commands given to Task AGIs (Genies). Furthermore, there could perhaps be easier types of genies to build, or there might be grave difficulties in restricting the model class to some space that is useful without being dangerous.

Requirements for implementation

Broadly speaking, two possible clusters of behaviorist-genie design are:

A cleanly designed, potentially self-modifying genie that can internally detect modeling problems that threaten to become mind-modeling problems, and route them into a special class of allowable mind-models.
A known-algorithm non-self-improving AI, whose complete set of capabilities have been carefully crafted and limited, which was shaped to not have much capability when it comes to modeling humans (or distant superintelligences).

Breaking the first case down into more detail, the potential desiderata for a behavioristic design are:

(a) avoiding mindcrime when modeling humans
(b) not modeling distant superintelligences or alien civilizations
(c) avoiding programmer manipulation
(d) avoiding mindcrime in internal processes
(e) making self-improvement somewhat less accessible.

These are different goals, but with some overlap between them. Some of the things we might need:

A working Nonperson predicate that was general enough to screen the entire hypothesis space AND that was resilient against loopholes AND passed enough okay computations to screen the entire hypothesis space
A working Nonperson predicate that was general enough to screen the entire space of potential self-modifications and subprograms AND was resilient against loopholes AND passed enough okay computations to compose the entire AI
An allowed class of human models, that was clearly safe in the sense of not being sapient, AND a reliable way to tell every time the AI was trying to model a human (including modeling something else that was partially affected by humans, etc) (possibly with the programmers as a special case that allowed a more sophisticated model of some programmer intentions, but still not one good enough to psychologically manipulate the programmers)
A way to tell whenever the AI was trying to model a distant civilization, which shut down the modeling attempt or avoided the incentive to model (this might not require healing a bunch of entanglements, since there are no visible aliens and therefore their exclusion shouldn't mess up other parts of the AI's model)
A reflectively stable way to support any of the above, which are technically epistemic exclusions

In the KANSI case, we'd presumably be 'naturally' working with limited model classes (on the assumption that everything the AI is using is being monitored, has a known algorithm, and has a known model class) and the goal would just be to prevent the KANSI agent from spilling over and creating other human models somewhere else, which might fit well into a general agenda against self-modification and subagent creation. Similarly, if every new subject is being identified and whitelisted by human monitors, then just not whitelisting the topic of modeling distant superintelligences or devising strategies for programmer manipulation, might get most of the job done to an acceptable level if the underlying whitelist is never being evaded (even emergently). This would require a lot of successfully maintained vigilance and human monitoring, though, especially if the KANSI agent is trying to allocate a new human-modeling domain once per second and every instance has to be manually checked.

Comments

Paul Christiano

I can imagine this concept becoming relevant one day. But it seems sufficiently improbable that it doesn't seem worth thinking about until we run out of urgent things to think about. Reasons it seems improbable:

It would be shocking if people were willing to take such a massive efficacy hit for the sake of safety. This seems to require the "very well-coordinated group takes over world" / "world becomes very well-coordinated," as well "all reasonable approaches to AI control fail."
It doesn't look like this makes the problem much easier. It's hard for me to imagine a capability state where you can kind of solve AI control, but then you have trouble if the AI starts thinking about people. That seems like a super scary bug that indicates something deeply wrong that will probably bite you one way or another. (I would assume that this is the MIRI view.)