Consequentialist cognition

by Eliezer Yudkowsky Jul 1 2015 updated Jun 11 2016

The cognitive ability to foresee the consequences of actions, prefer some outcomes to others, and output actions leading to the preferred outcomes.

[summary(Gloss): "Consequentialism" is picking out immediate actions on the basis of which future outcomes you predict will result.

E.g: Going to the airport, not because you really like airports, but because you predict that if you go to the airport now you'll be in Oxford tomorrow. Or throwing a ball in the direction that your cerebellum predicts will lead to the future outcome of a soda can being knocked off the stump.

An extremely basic and ubiquitous idiom of cognition.]

[summary: "Consequentialism" is the name for the backward step from preferring future outcomes to selecting current actions.

E.g: You don't go to the airport because you really like airports; you go to the airport so that, in the future, you'll be in Oxford. (If this sounds extremely basic and obvious, it's meant to be.) An air conditioner isn't designed by liking metal that joins together at right angles, it's designed such that the future consequence of running the air conditioner will be cold air.

Consequentialism requires:

One might say that humans are empirically more powerful than mice because we are better consequentialists. If we want to eat, we can envision a spear and throw it at prey. If we want the future consequence of a well-lit room, we can envision a solar power panel.

Many of the issues in AI alignment and the safety of advanced agents arise when a machine intelligence starts to be a consequentialist across particular interesting domains.]

Consequentialist reasoning selects policies on the basis of their predicted consequences - it does action $~$X$~$ because $~$X$~$ is forecasted to lead to preferred outcome $~$Y$~$. Whenever we reason that an agent which prefers outcome $~$Y$~$ over $~$Y'$~$ will therefore do $~$X$~$ instead of $~$X',$~$ we're implicitly assuming that the agent has the cognitive ability to do consequentialism at least about $~$X$~$s and $~$Y$~$s. It does means-end reasoning; it selects means on the basis of their predicted ends plus a preference over ends.

E.g: When we infer that a paperclip maximizer would try to improve its own cognitive abilities given means to do so, the background assumptions include:

(Technically, since the forecasts of our actions' consequences will usually be uncertain, a coherent agent needs a utility function over outcomes and not just a preference ordering over outcomes.)

The related idea of "backward chaining" is one particular way of solving the cognitive problems of consequentialism: start from a desired outcome/event/future, and figure out what intermediate events are likely to have the consequence of bringing about that event/outcome, and repeat this question until it arrives back at a particular plan/policy/action.

Many narrow AI algorithms are consequentialists over narrow domains. A chess program that searches far ahead in the game tree is a consequentialist; it outputs chess moves based on the expected result of those chess moves and your replies to them, into the distant future of the board.

We can see one of the critical aspects of human intelligence as [cross_consequentialism cross-domain consequentialism]. Rather than only forecasting consequences within the boundaries of a narrow domain, we can trace chains of events that leap from one domain to another. Making a chess move wins a chess game that wins a chess tournament that wins prize money that can be used to rent a car that can drive to the supermarket to get milk. An Artificial General Intelligence that could learn many domains, and engage in consequentialist reasoning that leaped across those domains, would be a sufficiently advanced agent to be interesting from most perspectives on interestingness. It would start to be a consequentialist about the real world.


Some systems are [-pseudoconsequentialist] - they in some ways behave as if outputting actions on the basis of their leading to particular futures, without using an explicit cognitive model and explicit forecasts.

For example, natural selection has a lot of the power of a cross-domain consequentialist; it can design whole organisms around the consequence of reproduction (or rather, inclusive genetic fitness). It's a fair approximation to say that spiders weave webs because the webs will catch prey that the spider can eat. Natural selection doesn't actually have a mind or an explicit model of the world; but millions of years of selecting DNA strands that did in fact previously construct an organism that reproduced, gives an effect sort of like outputting an organism design on the basis of its future consequences. (Although if the environment changes, the difference suddenly becomes clear: natural selection doesn't immediately catch on when humans start using birth control. Our DNA goes on having been selected on the basis of the old future of the ancestral environment, not the new future of the actual world.)

Similarly, a reinforcement-learning system learning to play Pong might not actually have an explicit model of "What happens if I move the paddle here?" - it might just be re-executing policies that had the consequence of winning last time. But there's still a future-to-present connection, a pseudo-backwards-causation, based on the Pong environment remaining fairly constant over time, so that we can sort of regard the Pong player's moves as happening because it will win the Pong game.

Ubiquity of consequentialism

Consequentialism is an extremely basic idiom of optimization:

Anything that Aristotle would have considered as having a "final cause", or teleological explanation, without being entirely wrong about that, is something we can see through the lens of cognitive consequentialism or pseudoconsequentialism. A plan, a design, a reinforced behavior, or selected genes: Most of the complex order on Earth derives from one or more of these.

Interaction with advanced safety

Consequentialism or pseudoconsequentialism, over various domains, is an advanced agent property that is a key requisite or key threshold in several issues of AI alignment and advanced safety:

Above all: The human ability to think of a future and plan ways to get there, or think of a desired result and engineer technologies to achieve it, is the source of humans having enough cognitive capability to be dangerous. Most of the magnitude of the impact of an AI, such that we'd want to align in the first place, would come in a certain sense from that AI being a sufficiently good consequentialist or solving the same cognitive problems that consequentialists solve.

Subverting consequentialism?

Since consequentialism seems tied up in so many issues, some of the proposals for making alignment easier have in some way tried to retreat from, limit, or subvert consequentialism. E.g:

But since consequentialism is so close to the heart of why an AI would be sufficiently useful in the first place, getting rid of it tends to not be that straightforward. E.g:

Since 'consquentialism' or 'linking up actions to consequences' or 'figuring out how to get to a consequence' is so close to what would make advanced AIs useful in the first place, it shouldn't be surprising if some attempts to subvert consequentialism in the name of safety run squarely into an unresolvable safety-usefulness tradeoff.

Another concern is that consequentialism may to some extent be a convergent or default outcome of optimizing anything hard enough. E.g., although natural selection is a pseudoconsequentialist process, it optimized for reproductive capacity so hard that it eventually spit out some powerful organisms that were explicit cognitive consequentialists (aka humans).

We might similarly worry that optimizing any internal aspect of a machine intelligence hard enough would start to embed consequentialism somewhere - policies/designs/answers selected from a sufficiently general space that "do consequentialist reasoning" is embedded in some of the most effective answers.

Or perhaps a machine intelligence might need to be consequentialist in some internal aspects in order to be smart enough to do sufficiently useful things - maybe you just can't get a sufficiently advanced machine intelligence, sufficiently early, unless it is, e.g., choosing on a consequential basis what thoughts to think about, or engaging in consequentialist engineering of its internal elements.

In the same way that expected utility is the only coherent way of making certain choices, or in the same way that natural selection optimizing hard enough on reproduction started spitting out explicit cognitive consequentialists, we might worry that consequentialism is in some sense central enough that it will be hard to subvert - hard enough that we can't easily get rid of instrumental convergence on problematic strategies just by getting rid of the consequentialism while preserving the AI's usefulness.

This doesn't say that the research avenue of subverting consequentialism is automatically doomed to be fruitless. It does suggest that this is a deeper, more difficult, and stranger challenge than, "Oh, well then, just build an AI with all the consequentialist aspects taken out."


Eric Rogstad

You don't go to the airport because you really like airports; you go to the airport so that, in the future, you'll be in Oxford\. An air conditioner is an artifact selected from possibility space such that the future consequence of running the air conditioner will be cold air\. A butterfly, by virtue of its DNA having been repeatedly selected to have previously brought about the past consequence of replication, will, under stable environmental conditions, bring about the future consequence of replication\. A rat that has previously learned a maze, is executing a policy that previously had the consequence of reaching the reward pellets at the end: A series of turns or behavioral rule that was neurally reinforced in virtue of the future conditions to which it led the last time it was executed\. This policy will, given a stable maze, have the same consequence next time\. Faced with a superior chessplayer, we enter a state of Vingean uncertainty in which we are more sure about the final consequence of the chessplayer's moves \- that it wins the game \- than we have any surety about the particular moves made\. To put it another way, the main abstract fact we know about the chessplayer's next move is that the consequence of the move will be winning\. As a chessplayer becomes strongly superhuman, its play becomes instrumentally efficient in the sense that no abstract description of the moves takes precedence over the consequence of the move\. A weak computer chessplayer might be described in terms like "It likes to move its pawn" or "it tries to grab control of the center", but as the chess play improves past the human level, we can no longer detect any divergence from "it makes the moves that will win the game later" that we can describe in terms like "it tries to control the center \(whether or not that's really the winning move\)"\. In other words, as a chessplayer becomes more powerful, we stop being able to describe its moves that will ever take priority over our beliefs that the moves have a certain consequence\.

I'm not quite sure of this.

Suppose there are two different super-human chess AI's with different styles -- call them UberTal %note:Widely regarded as a creative genius and the best attacking player of all time, Tal played in a daring, combinatorial style. and UberPetrosian %note: He was nicknamed "Iron Tigran" due to his almost impenetrable defensive playing style, which emphasised safety above all else. -- such that a human chess (and AI) expert who watched a match between the two could reliably guess who was who, without being told which AI was playing white and which was playing black (and of course without being able to beat either one).

Would such a situation contradict the claim you are making here?

Or would you argue that we might see such a situation with only weakly-superhuman AI's, but that the further the AI's advanced beyond human abilities, the less we'd be able to detect a characteristic style?