# Introduction to Logical Decision Theory for Analytic Philosophers

https://arbital.com/p/ldt_intro_phil

by Eliezer Yudkowsky Jul 21 2016 updated Jun 1 2018

Why "choose as if controlling the logical output of your decision algorithm" is the most appealing candidate for the principle of rational choice.

(A draft with many missing links and some missing diagrams.)

Is it rational to vote in elections?

Suppose that roughly a hundred thousand people are voting in an election. Then surely the chance of the election coming down to any one vote is tiny. Say there are 50,220 votes for Kang and 50,833 votes for Kodos. This is a close election as such things go. But if you do vote for Kang, that just means 50,221 votes for Kang vs. 50,833 votes for Kodos. If we're to select actions on the basis of their probable consequences, it seems that with overwhelming probability, the consequences of 'voting' and 'not voting' are nearly identical.

An economist who argues that voting is 'irrational' on this basis is deploying a formal answer from the most commonly accepted version of decision theory in contemporary philosophy, namely an answer from Causal decision theories (CDT). To answer "What would have happened if I'd voted?", CDT says to imagine a world in which everything that happened before your vote stays constant; and the physical, causal consequences downstream of your vote are recomputed. The rules for imagining this are further formalized within the theory of [causal_models causal models].

When the dust settles, CDT's formally computed consequences for "voting" vs. "not voting" are 50,221 vs. 50,220 votes for Kang. The only exception is if your physical act of voting caused someone else to vote (in which case the consequence is 50,222 votes for Kang).

An odd consequence of this view is that if the election is settled by one vote, say 8,001 for Kang vs. 8,000 for Kodos, then all 8,001 Kang voters should each view themselves as having individually swung the whole election - since if counterfactually they had voted for Kodos, the election would have gone to Kodos. Conversely, in an election decided by 8,003 to 8,000 votes, nobody's vote changed anything.

Dilemmas such as these are part of a larger class of scenarios known as Newcomblike decision problems where the world contains other agents that are similar to you, or predicting your own reasoning with significant accuracy. This problem class also includes the Prisoner's Dilemma; whether to turn down a lowball offer when bargaining; and thought experiments involving powerful aliens who are excellent predictors of human behavior.

Logical decision theories are a recently invented family of decision theories, claiming to pose a more attractive alternative to causal decision theory. Logical decision theory asserts that the principle of rational choice is "Decide as though you are choosing the logical output of your decision algorithm." Allegedly, there are major and significant reasons why (a) this is a decision theory that economists and computer scientists should prefer to use; and (b) this rule and its formalizations are more appealing candidates for the principle of rational choice.

This introduction will:

• Overview Newcomblike problems and the contemporary view of them in analytic philosophy;
• Introduce logical decision theories and how they differ formally from other well-known decision theories;
• Reconsider well-known Newcomblike problems in the light of logical decision theory;
• Make the case for logical decision theory as the principle of rational choice;
• Overview some of the more interesting results in logical decision theory;
• And point to further reading.

# Newcomblike problems

Newcomblike problems can be viewed as a set of scenarios where your decision can correlate with events outside you, without your action physically causing those events. These include cases where another agent is similar to you; where the environment contains somebody trying to predict you; or where your goals or decision algorithms correlate with other facts about you.

## Newcomb's Problem

The original Newcomb's Problem was as follows:

An alien named Omega presents you with two boxes, a transparent box A containing \$1,000, and an opaque Box B. Omega then flies away, leaving you with the choice of whether to take only Box B ('one-box') or to take both Box A plus Box B ('two-box'). Omega has put$1,000,000 in Box B if and only if Omega predicted that you would take only one box. Otherwise Box B is empty. %note: The original formulation of Newcomb's Problem also specified that if Omega predicts you will decide to try to flip a coin, Omega leaves Box B empty.%

Omega has already departed, so Box B is already empty or already full.

Omega is an excellent predictor of human behavior and has never been observed to be mistaken. %note: E.g., we can suppose Omega has run this experiment 73 times previously and predicted correctly each time. Since people do seem to form strongly held views about what they'd do in Newcomb's Problem, it's not implausible that Omega could get this level of predictive accuracy by looking at your brain a few hours previously.%

Do you take both boxes, or only Box B?

• Argument 1: People who take only Box B tend to walk away rich. People who two-box tend to walk away poor. It is better to be rich than poor.
• Argument 2: Omega has already made its prediction. Box B is already empty or already full. It would be [dt_rational irrational] to leave behind Box A for no reason. It's true that Omega has chosen to reward people with irrational [dt_disposition dispositions] in this setup, but Box B is now already empty, and irrationally leaving Box A behind would just [causal_counterfactual counterfactually] result in your getting \$0 instead of \$1,000.

This setup went on to generate an incredible amount of debate. Conventionally, Newcomb's Problem is seen as exhibiting a split between evidential decision theory and causal decision theory.

## Evidential and causal decision theory

Almost everyone in present and historical debate on decision theory has agreed that [dt_rational rational] agents choose by calculating the expected utility 'conditional' on each possible decision. The central question of decision theory turns out to be, "How exactly do we condition our probabilities on our possible decisions?"

Usually, when the expected utility formula is mentioned outside of decision theory, it is shown as follows:

$$\mathbb E[\mathcal U|a_x] = \sum_{o_i \in \mathcal O} \mathcal U(o_i) \cdot \mathbb P(o_i|a_x)$$

where

• $\mathbb E[\mathcal U|a_x]$ is our average expectation of utility, if action $a_x$ is chosen;
• $\mathcal O$ is the set of possible outcomes;
• $\mathcal U$ is our utility function, mapping outcomes onto real numbers;
• $\mathbb P(o_i|a_x)$ is the conditional probability of outcome $o_i$ if $a_x$ is chosen.

This formula is widely agreed to be wrong.

The problem is the use of standard evidential conditioning in $\mathbb P(o_i|a_x).$ On this formula we are behaving as if we're asking, "What would be my revised probability for $\mathbb P(o_i),$ if I was told the news or observed the evidence that my action had been $a_x$?"

Causal decision theory says we should instead use the counterfactual conditional $\ \mathbb P(a_x \ \square \! \! \rightarrow o_i).$

The difference between evidential and counterfactual conditioning is standardly contrasted by these two sentences:

• If Lee Harvey Oswald didn't shoot John F. Kennedy, somebody else did.
• If Lee Harvey Oswald hadn't shot John F. Kennedy, somebody else would have.

In the first sentence, we're being told as news that Oswald didn't shoot Kennedy, and updating our beliefs to integrate this with the rest of our observations.

In the second world, we're imagining how a counterfactual world would have played out if Oswald had acted differently. That is, to visualize the causal counterfactual:

• We imagine everything in the world being the same up until the point where Oswald decides to shoot Kennedy.
• We surgically intervene on our imagined world to change Oswald's decision to not-shooting, without changing any other facts about the past.
• We rerun our model of the world's mechanisms forward from the point of change, to determine what would have happened.

If $K$ denotes the proposition that somebody else shot Kennedy and $O$ denotes the proposition that Oswald shot him, then the first sentence and second sentence are respectively talking about:

• $\mathbb P(K| \neg O)$
• $\mathbb P(\neg O \ \square \!\! \rightarrow K)$

(Further formalizations of how to [causal_counterfactuals compute causal counterfactuals] are given by Judea Pearl et. al.'s theory of [ causal models].)

So according to causal decision theory, the expected utility formula should read:

$$\mathbb E[\mathcal U|a_x] = \sum_{o_i \in \mathcal O} \mathcal U(o_i) \cdot \mathbb P(a_x \ \square \!\! \rightarrow o_i)$$

## Newcomblike problems standardly seen as prying apart evidential and causal decisions

The current majority view of Newcomblike problems is that their distinctive feature is prying apart the verdicts of evidential and causal decision theory.

For example, on the conventional analysis of the original Newcomb's Problem, taking only Box B is good news about whether Box B contains \$1,000,000, but cannot cause Box B to contain \$1,000,000.

In the caise of Newcomb's Problem, EDT agents end up 'rich' in a certain sense, which may lighten the force of Newcomb's Problem seen as a critique of EDT. For a standard converse example in which the verdict of evidential decision theory seems less reasonable, consider the following dilemma:

Suppose that toxoplasmosis, a parasitic infection carried by cats, can cause toxoplasmosis-infected humans to become fonder of cats. %note: "Toxoplasmosis makes humans like cats" was formerly thought to actually be true. More recently, this result may have failed to replicate, which is unfortunate because we liked the really crisp example.% You are now faced with a cute cat that has been checked by a veterinarian who says this cat definitely does not have toxoplasmosis.

If you decide to pet the cat, an impartial observer watching you will conclude that you are 10% more likely to have toxoplasmosis, which can be a fairly detrimental infection. If you don't pet the cat, you'll miss out on the hedonic enjoyment of petting it. Do you pet the cat?

In this case, evidential decision theory says not to pet the cat, since a 10% increase in the probability that you have toxoplasmosis is a significantly larger downside than the one-time enjoyment of petting a single cat.

Causal decision theory says that petting the cat can't cause you to contract toxoplasmosis. After observing your own action, you may realize that you had toxoplasmosis all along, which isn't good; but in the counterfactual case where you didn't pet the cat, you would still have toxoplasmosis.

(In the standard philosophical literature this dilemma is usually presented under the heading of [ Solomon's Problem], in which King Solomon must decide whether to be the kind of person who commits adultery which makes his rule more likely to be overthrown. It may also be presented as the [ Smoking Lesion] dilemma, in which the impulse to smoke stems from a gene which also causes lung cancer, but lung cancer does not actually cause smoking. Our own discussion uses the Toxoplasmosis dilemma because this seems less liable to cause confusion.)

Evidential decision theory is widely viewed as leading to 'an irrational policy of managing the news'. %note: Logical decision theorists agree with this view.%

On the majority view within contemporary decision theory, this is the reply to the "If you're so rational, why aincha rich?" argument in favor of one-boxing on Newcomb's Problem. Somebody who actually takes only Box B is merely 'managing the news' about Box B, not actually acting to maximize the causal impacts of their actions. Omega choosing to reward people who only take Box B is akin to happening to already have toxoplasmosis at the start of the decision problem, or Omega deciding to reward only evidential decision theorists. Evidential agents only seem to win in 'Why aincha rich?' scenarios because they're managing the news in a way that an artificial problem setup declares to be news about wealth.

Since many philosophers continue to find two-boxing on Newcomb's Problem to be an exceptionally unappealing decision, multiple attempts have been made to 'hybridize' evidential and causal decision theory,[todo: put citations here, or a greenlink to a page on hybridization attempts] to construct a theory which behaves 'evidentially' in some cases (like Newcomb's Problem) and 'causally' in other cases (like the [ Toxoplasmosis Dilemma] / [ Solomon's Problem]). Robert Nozick once suggested that evidential and causal expected utilities be averaged together, so that an evidential gain of \$1,000,000 could outweigh a causal loss of \$1,000 on Newcomb's Problem.

## When wealth doesn't follow evidential reasoning

Logical decision theorists deny that decision theory ought to be analyzed as a conflict between evidential and causal utilities. Arguendo, it is possible to pry apart both theories from the behavior that corresponds to being the richest agent at the end of the problem. Consider e.g. Parfit's Hitchhiker:

### Parfit's Hitchhiker

You are lost in the desert, your water bottle almost exhausted, when somebody drives up in a lorry. The driver of this lorry is (a) entirely selfish, and (b) very good at detecting lies. %note: Maybe the driver went through Paul Ekman's training for reading facial microexpressions.%

The driver says that they will drive you into town, but only if you promise to give them \$1,000 on arrival. %note: We assume that relative to your situation and the local laws, there's no way that this contract would be enforceable apart from your goodwill.% If you value your life at \$1,000,000 and are otherwise motivated only by self-interest (e.g. you attach no utility to keeping promises as such), then this problem seems isomorphic to Gary Drescher's [ transparent Newcomb's Problem]: in which Box B is transparent, and Omega has already put \$1,000,000 into Box B iff Omega predicts that you will one-box when faced with a visibly full Box B. Both evidential decision theory and causal decision theory say to two-box in the transparent Newcomb's Problem. In particular, the evidential agent that has already updated on observing a full Box B will not update to an empty Box B after observing themselves two-box; they will instead conclude that Omega sometimes makes mistakes. Similarly, an evidential agent who has already reached the safety of the city will conclude that the driver has made an error of reading faces, if they observe themselves refuse to pay the \$1,000.

Thus the behavioral disposition that corresponds to ending up rich (the disposition to pay when you reach town, or to one-box after seeing a full Box B) has been pried apart from both causal and evidential decision theories.

We might also observe that the driver in Parfit's Hitchhiker is not behaving as an arbitrary alien philosopher-troll like Omega. The driver's reasoning seems entirely understandable in terms of self-interest.

### The Termites Dilemma

Suppose I have a strong reputation for being truthful, and also a strong reputation for being able to predict other people's behavior (especially the behavior of people who have publicly shown themselves to be evidential decision theorists). You are the owner of a large apartment complex. I send you the following letter:

### Toxoplasmosis Dilemma

There are complications in setting this up formally, since we need a background mechanic that succeeds in creating a correlation between "pet the cat" and "has toxoplasmosis"--if everyone is an LDT agent and everyone decides to pet the cat, then there won't be any correlation in the first place.

In [functional_dt functional decision theory], we assume that $\mathsf Q$ knows its own formula $\ulcorner \mathsf Q \urcorner$ (e.g. via [godelian_diagonalization]). So if we say that different agents have slightly different utility functions correlated with toxoplasmosis, then in functional decision theory, each agent ought to already know its own algorithm and to have [tickle_defense already updated about toxoplasmosis]. (FDT is not always a good descriptive theory of human behavior!) %note: It's an open problem to formulate a more realistic LDT that may not have full knowledge about its own quoted algorithm.%

To hack our way to a roughly similar setup, we can suppose that there's some mix of EDT agents and LDT agents encountering the problem; and that Omega has told us, "Through no fault or virtue of their own, it just so happens that in this particular random sample, agent types that don't pet the cat after being given this information already have toxoplasmosis with 10% frequency, and agent types who do pet the cat already have toxoplasmosis with 20% frequency."

Then our graph might look something like this:

• ($\mathsf Q(warning)$ | toxoplasmosis frequency) -> (Omega warning?), (pet cat?) -> (payoff)

In this setup, our decision to pet the cat and toxoplasmosis both affect Omega's warning, our decision to pet the cat affects whether we get cat hedons, and cat hedons and toxoplasmosis both affect our payoff.

We compute, "If-counterfactually people like me didn't pet the cat, then (a) I wouldn't have received cat-petting hedons, (b) I'd still have toxoplasmosis with the same probability, and (c) Omega would've given us a different statistical summary."

On the more regular toxoplasmosis problem, this might analogously work out to thinking, "If-counterfactually people like me didn't pet cats in this situation, then there wouldn't be any correlation between toxoplasmosis and petting in the first place; but actual toxoplasmosis wouldn't be reduced in any way."

### Termites Dilemma

• ($\mathsf Q(message)$) -> (agent's decision to try blackmail, Termites -> agent's message, whether we pay) -> ( | | Termites -> payoff)

Since an LDT agent doesn't pay in the Termites dilemma, nobody sends us a message in the first place.

But if we did get the message, we'd compute $\mathsf Q$ by noticing that the policy (message -> pay) results in our getting a message and our paying if there are no termites, while if there are termites, the agent wouldn't send us a message. Regardless of the prior probability of termites, this does more poorly than the policy of not paying.

# LDT as the principle of rational choice

It should now be clear that the family of logical decision theories gives different answers in Newcomblike problems compared to some widely-analyzed previous theories. Are these better answers? Are they more rational answers?

The argument for considering "Choose as if controlling the logical output of your decision algorithm" as the principle of rational choice--rather than being 'useful irrationality' or some such--rests on three main pillars:

• The argument that CDT counterfactuals are not inherently any more sensible than LDT counterfactuals, since it's not like there are actual counterfactual worlds floating out there or a previously God-given rule that we must decide based on a particular kind of counterfactual;
• The argument that 'Why aincha rich?' ought to have considerable force here, since Newcomblike problems are not especially unfair or unrealistic (e.g. voting in elections), and we can make our decision algorithm's output be anything we want, to just the same degree as we can control our actions.
• The argument from greater internal coherence and simplicity: CDT agents wistfully wish they were more LDT-ish agents. LDT agents prefer to be LDT, have no need for precommitments to dispute control of their future choices with their future selves, and don't predictably reverse their preferences between different times.

## Freedom of counterfactual imagination

The standard case for causal decision theory rests primarily on the assertion that it is prima facie irrational to act as if, e.g., one-boxing in Newcomb's Problem can cause box B to be full.

Is it not in some sense true, after Parfit's driver has conveyed the LDT agent to the city, that in the counterfactual world where the LDT agent does not choose to pay at the time, the LDT agent remains in the city and does not vanish away into the desert? In this sense, must not the LDT agent be deluded about some question of fact, or be acting as if so deluded?

The logical decision theorist's response has two major subthrusts:

• Since there are no actual counterfactual worlds floating out there in the void where I performed a different action, describing the world where I acted differently is just an act of imagination. It isn't false if I have some lawful, simple, coherent rule for imagining the conditional results of my actions that isn't a classical causal counterfactual, and this rule makes the rest of my decision theory work well. "Counterfactuals were made for humanity, not humanity for counterfactuals."
• I don't one-box on Newcomb's Problem because I think it physically causes Box B to be full. I one-box on Newcomb's Problem because I have computed this output in an entirely different way. It [petitio_principii begs the question] to assume a rational agent must make its decision by carrying out a particular ritual of cognition about which things physically cause other things, and then criticize me for "acting as if" I falsely believe that my choice physically causes Box B to be full.

The first element of the response says that there are not actually alternate Earths floating alongside our planet and clearly visible from here, letting us see with our naked eyes what our action-conditionals should be. Critiquing an action-conditional on the grounds, "That counterfactual is false," is not as straightforward as saying, e.g., "Your assertion that most humans on Earth have eight legs is false under a [ correspondence theory of truth], because we can look around the Earth and see that most people don't have eight legs."

This might be a jarring step in the argument, from the standpoint of a philosophical tradition that's accustomed to, e.g., considering statements about modal necessity to have a truth-value that is evaluated to some heaven-sent set of possible worlds. But again, on the [ Standard Model of physics], there are not actually any counterfactual worlds floating out there. %note: Even the many-worlds interpretation of quantum mechanics doesn't modify this. There is no rule saying that there must be a world floating out there for each kind of possible decision we could take, where nothing else has changed except that decision. And from an LDT agent's standpoint, we are asking about the decision-algorithm Q, and its alternate outputs are logical impossibilities; see below.%

We can fix some logical rule for evaluating a particular kind of $\operatorname{counterfactual}_1$, such as [ Pearl's intervention] $\operatorname {do}().$ It can then be a [logical_validity valid] deduction given the [correspondence_truth true] history of our Earth that "If-$\operatorname{counterfactual}_1$ Lee Harvey Oswald had not shot John F. Kennedy, nobody else would have." If we understand the fixed logical sense of "If… hadn't" in terms of $\operatorname{counterfactual}_1$, then it can be informative about the history of the actual world to be told, "If Oswald hadn't shot Kennedy, nobody else would've."

The LDT agent is thinking about a different rule, $\operatorname{counterfactual}_2$ (which happens to yield the same answer in the case of Kennedy and Oswald). The logical decision theorist observes both "There are no actual counterfactual worlds floating out there, at least not where we can see them, so critiquing my output isn't as simple as pointing to an actual-world statement being false" and "The point we're debating is exactly whether a rational agent ought to use $\operatorname{counterfactual}_1$ or $\operatorname{counterfactual}_2,$ so you can't point to $\operatorname{counterfactual}_2$'s outputs and declare them 'false' or 'irrational' by comparing them with $\operatorname{counterfactual}_1.$"

In fact, the logical decision theorist can turn around this argument and deliver a critique of classical causal decision theory: Any expected utility agent does calculate one conditional where a correspondence theory of truth directly applies to the answer, namely the conditional on the action it actually takes. CDT's calculation of this counterfactual conditional on Newcomblike problems is often wrong compared to the actual world.

For example, in Newcomb's Problem, suppose that the base rate of people one-boxing is 2/3. Suppose we start with a CDT agent not yet knowing its own decision, %note: If the CDT agent does already know its own decision, why would it still be trying to compute it?% that uses the standard $\operatorname {do}()$ rules for [counterfactual_do counterfactual surgery]. This agent will calculate that its expected value is (2/3 * \$1M + \$1K) if it takes both boxes and (2/3 * \$1M) otherwise. This yields the classic CDT answer of 'take both boxes', but it does so by calculating a conditional expected utility premised on 'take both boxes', which yields the quantitatively wrong answer. Even if afterwards the CDT agent realizes that box B is empty, it will still have calculated an objectively false conditional in order to make its decision. (As an obvious patch, causal decision theorists have suggested a patched CDT that can observe its own suspected action, update, and then recalculate expected utilities to choose again. But this patched algorithm is known to go into infinite loops on some Newcomblike problems! A twice-patched algorithm can prevent infinite loops by randomizing its actions in some cases, in which case a stable solution is guaranteed to exist. But then the expected utility, calculated conditional on that mixed strategy, is again wrong for the actual world! See the analysis of Death in Damascus.) Taking a step back and looking at the issue from inside the perspective of LDT, some decision algorithm $\mathsf Q$ is asking about worlds conditional on various actions $a$ or policies $\pi.$ All but one of these worlds are logically impossible - it is no more possible for $\mathsf Q$ to have some different output than it actually has, than for 2 + 2 to equal 5. Usually, while we are deciding, we will not know which of our seemingly potential choices are logically impossible; but all except one of them are. %note: If you already know your decision, why are you still trying to decide? If you know you definitely won't do some action, why bother spending the computing power to evaluate its expected utility? Some Newcomblike dilemmas can pose [ apparent exceptions to this rule], but that's a longer story.% Since we are asking about worlds that are mostly logically impossible in any case, we are free to visualize the logically impossible ones in a way that is conducive to ending up rich (see the next subsection) and that has good coherence properties (see the subsection after that). But even if we're asking about the 'reasonableness' of the visualizations qua visualizations, a logical decision theorist might say at least the following: • Our visualization of the conditional that is logically possible, and matches actual reality in that regard, ought to match the rest of actual reality (which CDT does not). • If I'm similar to another 900 people deciding whether to vote using sufficiently similar algorithms to $\mathsf Q,$ then it is more 'reasonable' to visualize a world where all the outputs of $\mathsf Q$ move in lockstep, then to visualize only one output varying. That is: If you must imagine a world where 91 is a prime number, at least have it be prime all the time, not prime on some occasions and composite on others. To imagine "91 is sometimes prime and sometimes composite" is wrong in an immediately visible way, much faster than we can think of the prime factors 7 and 13. Supposing "Maybe $\mathsf Q$ decides not to vote after all?" is imagining an 'opaque' impossibility that we haven't yet realized to be impossible. Supposing "Maybe my $\mathsf Q$ outputs 'don't vote' but all the other instances of $\mathsf Q$ output 'do vote'?" is transparently impossible. (Of course this is all a viewpoint from within LDT. A causal decision theorist could reply that they are just imagining a physical variable changing, and not thinking of any logical algorithms at all.) %%comment: %todo: Move this part to a longer discussion of 'reasonable' counterfactuals. It has caveats about trying to drive down the probability of worlds you're already inside, in order to resist blackmail and so on.% Although this point is still a bit controversial among logical decision theorists, some logical decision theorists would assert that *on any particular reasoning step,* there's no reason for a rational algorithm to visualize a world that algorithm already knows to be impossible. E.g., even if Parfit's driver has already conveyed you into the city, you are bothering to imagine 'What if I don't pay?' in order to *verify* that you can't get an even better outcome where you don't pay and are still in the city. If you're bothering to calculate your actions at all, then your algorithm$\mathsf Q$doesn't already know this. A CDT agent imagines worlds where it two-boxes but still gets \$1,001,000; this is a decision rule that reasons about *transparently* impossible worlds on its intermediate steps.

%todo:
More points to talk about in a section on reasonable counterfactuals:  A CDT agent could reply that they're just imagining some local exception to the laws of physics; but then they're not exactly visualizing OMG MAGIC in those worlds, so they are trying to visualize the logical impossibility after all.
%
%%

## Newcomblike dilemmas are a fair problem class

The classic objection to causal decision theory has always been, "If you're so rational, why aincha rich?" The classic reply is summarized in e.g. "Foundations of Causal Decision Theory" by James Joyce: %note: No, not that James Joyce.%

Rachel has a perfectly good answer to the "Why ain't you rich?" question. "I am not rich," she will say, "because I am not the kind of person the psychologist thinks will refuse the money. I'm just not like you, Irene. Given that I know that I am the type who takes the money, and given that the psychologist knows that I am this type, it was reasonable of me to think that the \$1,000,000 was not in my account. The \$1,000 was the most I was going to get no matter what I did. So the only reasonable thing for me to do was to take it."

Irene may want to press the point here by asking, “But don’t you wish you were like me, Rachel? Don’t you wish that you were the refusing type?” There is a tendency to think that Rachel, a committed causal decision theorist, must answer this question in the negative, which seems obviously wrong (given that being like Irene would have made her rich). This is not the case. Rachel can and should admit that she does wish she were more like Irene. “It would have been better for me,” she might concede, “had I been the refusing type.” At this point Irene will exclaim, “You’ve admitted it! It wasn’t so smart to take the money after all.” Unfortunately for Irene, her conclusion does not follow from Rachel’s premise. Rachel will patiently explain that wishing to be a refuser in a Newcomb problem is not inconsistent with thinking that one should take the $1,000 whatever type one is. When Rachel wishes she was Irene’s type she is wishing for Irene’s options, not sanctioning her choice. This accuses Omega of simply being prejudiced against rational agents--by the time the experiment starts, rational agents have already been disadvantaged. But Omega is not per se disadvantaging rational agents, or causal decision agents in particular. We can imagine that Omicron goes about whacking CDT agents in the head with a sledgehammer, regardless of what they choose, and this would indeed seems 'unfair', in the sense that it seems we can't deduce anything about the fitness of CDT agents from their doing worse in Omicron's dilemma. We can imagine that Upsilon puts a million dollars in Box B only if you are a sort of agent that chooses between 'one-boxing' and 'two-boxing' by choosing the first option in alphabetical order, and if you one-box for any other reason, Upsilon empties the box. Omicron and Upsilon are indeed privileging particular algorithms; their rules make explicit mention of "CDT" or "alphabetical ordering"; they care why you behave a certain way and not just that you do so. But Omega does not care why you one-box. %note: In the original formulation of Newcomb's Problem, it was said that Omega leaves Box B empty if you try to choose by flipping a coin. This is one reason to prefer the formulation where Omega can predict coinflips.% You can one-box because of LDT, or EDT, or because you're the sort of agent that always prefers the option highest in alphabetical order; Omega will fill Box B just the same. %%comment: %note: On some views of the nature of rationality, it's exactly *because* Omega only cares what we do, and not why we do it, that Newcomb's Problem is a test of rationality; 'rationality' *is* exactly what we do when we only care about the results and not the reasoning we use to get them.% %% According to a logical decision theorist, when a problem depends just on our behavior or "the type of decisions that we make, being the people that we are" and not on any other properties of our algorithm apart from that, then that seems like a sufficient condition to designate the problem as 'fair'. Indeed, we can see decision theories as corresponding to a class of problems that they think are fair: • CDT thinks a problem is 'fair' if your results depend only on your physical act, and not on anyone else's predictions about your physical act or any other logical correlations that don't stem from the physical act. On problems in this class, CDT agents always end up as rich as any other agent encountering the problem. • [functional_dt Functional decision theory] thinks it's 'fair' for a problem to depend at any point, on any logical or physical consequence, of any disposition you have to behave in any way, in any situation; so long as the problem depends only on this behavioral disposition and not on any other aspect of your code or algorithm apart from that. • In [proofbased_dt] and [modal_agents], your payoff may depend on whether other agents can prove that you behave a certain way, not just whether you actually behave that way. For example, other agents may cooperate with you only if they can prove that you cooperate with them. A causal decision theorist might argue that on an ideal version of Parfit's Hitchhiker, the driver is really making their decision by looking at our face; and that if we were ideal CDT agents, we'd be able to control this behavior and optimize the real, physical channel by which we are influencing the driver. A logical decision theorist replies, "Okay, but maybe I don't have perfect control of my face, and therefore the output of my algorithm $\mathsf Q(city)$ is affecting both what I buy in the city and my current prediction of what I'll buy in the city, which in turn affects my facial expression. In real life, I'm not good enough at deception or self-deception to break this logical correlation. So the logical correlation is actually there and we need to use a decision theory that can handle the wider problem class. Why is that so terribly unfair?" Or similarly in the case of voting in elections: maybe we just are in fact logically entangled with a cohort of people thinking similarly to ourselves, and nothing in particular is going to move us into the problem class where this correlation doesn't exist. Why is that unfair? (The LDT class of 'fair' problems comes very close to dominating the CDT class, that is, it is very nearly true to say "CDT agents receive maximum rewards on non-Newcomblike problems, and LDT agents receive maximum rewards on both non-Newcomblike and Newcomblike problems." Or, "LDT agents think a strictly larger class of problems are 'fair' and so do well in a strictly wider set of situations." But since (on the LDT view) CDT agents are blind to some LDT-relevant correlations, it is possible to construct LDT-unfair dilemmas that CDT agents think are fair. For example, suppose Omega would have bombed an orphanage two days ago if the LDT algorithm in particular yielded the output of picking up a certain \$20 bill in the street, but Omega doesn't similarly discriminate against the CDT algorithm. The CDT agent cheerfully walks over and picks up the \$20 bill, and believes that the superstitious LDT agent would have received just the same payoff for this same physical action, making the problem CDT-fair. For an arguably structually similar, but much more natural-seeming problem, see here.) Another way of seeing decision theories as corresponding to problem classes is by looking at the considerations that the decision theory allows itself to take into account; decision theories generally think that considerations they are not allowed to take into account are unfair. In the passage from before on Rachel and Irene, James Joyce continues: Rational Rachel recognizes that, whether she is the type that was predicted (on Friday) to take the money or the type that was predicted to refuse it, there is nothing she can do now to alter her type. She thus has no reason to pass up the extra \$1,000. [emphasis added.]

An LDT agent sees Rachel dooming herself to two-box only because she believes herself to be powerless; if she believed herself to be in control of her type, she could [logical_control control] her type.

More precisely, a [timeless_dt TDT] agent becomes the one-boxing type because the logical algorithm $\mathsf Q$ computes its logical output--the 'type' that everything dependent on the logical output of $\mathsf Q$ depends on--by taking into account all the consequences of that 'type'. As for any disempowering thoughts about it being "too late" to control consequences of $\mathsf Q$ that occur before some particular time, $\mathsf Q$ isn't built to specially exclude those consequences from its calculation. If someone objects to the term 'control', it can at least be said that $\mathsf Q$ has a symmetry in that everything affected by its 'type' is being modeled in the calculation that determines the type.

If sometimes logical correlations do in fact exist--as in the case of voting; or as in the case of not being able to perfectly control our facial muscles or beliefs when facing the driver in Parfit's Hitchhiker; or as in the case of a machine agent e.g. running on Ethereum or whose code has become known by some other channel--then in what sense is it rational for an algorithm to wantonly exclude some consequences of the algorithm's 'type', from being weighed into the calculation that determines the algorithm's 'type'?

%%comment:

%todo: move this to a separate section on control, maybe in the defense of counterfactuals. %

Rachel's algorithm $\mathsf R$ is computing its answer right that very moment, deciding right then what type of agent to be.  While $\mathsf R$ is fated in some sense to two-box, it 'could' have been a one-boxing algorithm to just the same extent that any of us 'could' perform some action that is the better for us.  If we can regard ourselves as [logical_control controlling] our physical acts, we might as well regard ourselves as controlling 'the type of decision we make'.

As for the charge that the key proposition lies in the physical past, Gary Drescher observes that we can by raising our hand today, [logical_control control] a certain proposition about the state of the universe a billion years earlier: namely, 'The state of the universe today is such that its development under physical law will lead to (your name here) raising their hand a billion years later.'  Similarly, we can today, by an act of will control 'the sort of decision we will make a day later, being the person that we are' as it was true about our physical state yesterday. %%

## Consistency and elegance

A CDT agent, given the chance to make preparations before Omega's arrival, might pay a \$100 fee (or \$100,000 fee) to have an assistant stand nearby and threaten to shoot them if they don't leave behind Box A.

Then, given the chance later by surprise, the same CDT agent would pay \$100 to make the gun-toting assistant go away--even believing that Omega has accurately predicted this, and that Box B is therefore empty. This is an example of what an economic psychologist or behavioral economist would call a [ dynamic inconsistency]. If Omega visits at 7:30am, then the CDT agent at 7am and the same CDT agent at 8am have different preferences about what they respectively want the CDT agent to do at 8am. The CDT agent at 7am will pay precommitment costs to try to wrench control away from the CDT agent at 8am; the CDT agent at 8am will pay costs to wrench control back. If CDT is dynamically inconsistent then it is of necessity reflectively inconsistent; that is, the CDT algorithm does not want to use the CDT algorithm. If one were to suppose a self-modifying AI that started out using CDT, it would immediately modify itself to use a different theory instead (if it anticipated facing any Newcomblike problems). Suppose you began life as a computational agent using CDT, and then, at 7am, gained the ability to modify your own algorithm or rewrite your own source code. You would then want your modified self to one-box on any Newcomb's Problem in which Omega gained information about your brain-state after 7am, but two-box on any Newcomb's Problem in which Omega had learned about your brain state before 7am. That is, when as a CDT agent you are considering the causal effect of your decision to self-modify at 7am, you think you can causally affect any predictions Omega makes at 9am if those predictions result from Omega scanning your brain at 8am, but not any predictions that Omega makes at 9am that result from scanning your brain at 6am. This is not because Omega has different information about you; naturally Omega predicts your self-modifications. Omega expects you to encounter the problem with the same 9am algorithm either way. But yourself of 7am thinks you can affect Omega's predictions in one case but not the other. You do not need to know in advance which Newcomblike problems your future self will encounter after self-modifying--there's no need to include as a special case that your future self will one-box in Newcomb's Problem. You will simply want your future self to optimize its payoffs like LDT when it's making decisions that other entities might predict based on information about your code gained after 7am, or when you're moving in unison with logically similar entities that became physically correlated with you after 7am. If you were playing the Prisoner's Dilemma against a selfish clone, you would Cooperate if you had been cloned after 7am and Defect if you had been cloned before 7am. (Again, even if you were cloned at 6am, your clone likewise modifies to an identical decision theory to yours when it gains its own self-modification powers at 7am.) Logical decision theorists use "[son_of_CDT Son-of-CDT]" to denote the algorithm that CDT self-modifies to; in general we think this algorithm works out to "LDT about correlations formed after 7am, CDT about correlations formed before 7am". From the perspective of the currently mainstream view of rationality, notions of precommitment, 'useful irrationality' during bargaining, and so on, people have become accustomed and habituated to thinking that rational agents might not want to be rational. It sounds odd even to claim that rational agents should want to be rational! But consider the scenario from the standpoint of a logical decision theorist. When it comes to identifying the principle of rationality, we have two major candidates: • The CDT candidate for rationality: There are two separate subjects of study, "Rational decisions" (CDT) and "The complicated irrational algorithm a rational agent would try to force its future self to run" (Use LDT about correlations from after 7am, CDT about correlations from before 7am). Precommitments, 'usefully irrational' bargaining behavior, etcetera are special cases of the algorithm in the second category. • The LDT candidate for rationality. There is a single subject matter, the principle of rational choice, which describes how rational agents would want to choose in both the present and the future. There is more to a candidate principle of rationality than coherence and simplicity. "Always choose the least option in alphabetical order" is very simple. It's even reflectively coherent, since Alphabetical Decision Theory is higher in the alphabet than Causal Decision Theory or Logical Decision Theory. We might draw an analogy to epistemic rationality: since reality itself is consistent, sufficiently good maps will be coherent with each other; but this doesn't mean that maps agreeing with each other is a sufficient condition for them to be true. Occam's Razor says that simpler theories have higher prior probabilities, but this doesn't mean that a map with an equilateral triangle on it will always be the best map of the city. If LDT agents needed to believe false things about the world--be poor epistemic rationalists--then this would weigh heavily against LDT as a principle of rationality. But the first pillar of the defense of LDT as the principle of rational choice replies that LDT agents do not need to believe falsely; indeed, it turns this charge around into a critique of the reasonableness of CDT's counterfactuals, in which conditioning on the actual action taken fails to match the actual real world. If LDT agents tended to end up much poorer than CDT agents when being run through decision-theoretic dilemmas, then we would not be impressed with the mere coherence and mere simplicity of the LDT principle of rational choice. But on the second pillar of the defense of CDT as a principle of rationality, LDT agents end up rich on a systematically wider class of problem dilemmas than CDT agents. This being the case, the greater coherence and simplicity of the situation under LDT does seem like a property that is standardly desired in principles of rationality. When agents exhibit non-Bayesian behavior on the [allais_paradox Allais Paradox], paying \$1 to throw a switch and then paying another \$1 to throw it back, behavioral economists consider this dynamic inconsistency a sign of irrationality. We sometimes learn new information that changes our instrumental preferences, but an instrumentally rational agent with a constant utility function should not be able to predict preference reversals any more than an epistemically rational agent should be able to predict a net directional change in its probability estimates. If there were a purported principle of epistemic rationality which said that Occam's Razor and Bayesian updating were the way to think about all factual matters except those dealing with Fred from accounting where we should believe everything Fred says, we would be suspicious of this strange non-simple exception to an otherwise simple epistemic principle. It's worth noting historically that logical decision theory was initially developed by people approaching the issue from a computer-science perspective. From that standpoint especially, reflective inconsistency aka "the output of the code is fighting the code" aka "no sufficiently advanced AI would use that decision theory for longer than 500 milliseconds" looms as a severe objection. Concerns about [cdt_loops CDT going into infinite loops], or theories introducing complications that would make them needlessly difficult to code especially in the presence of other natural complications, also assume greater prominence. # Interesting results and problems in LDT This section is a lightning overview of some of the interesting consequences, ideas, and unsolved problems that become available as a result of adopting a generally LDT perspective. ## Cooperation on the oneshot Prisoner's Dilemma via common knowledge of code %%!knows-requisite(Prisoner's Dilemma): In the classic presentation of the Prisoner's Dilemma, you and your fellow bank robber have been arrested and imprisoned. You cannot communicate with each other. You are facing a prison sentence of one year each. Both of you have been offered a chance to betray the other (Defect); someone who Defects gets one year off their own prison sentence, but adds two years onto the other person's prison sentence. Alternatively, you can Cooperate with the other prisoner by remaining silent. Each year in prison corresponds to negative utility. A structurally similar game has moves D and C and positive payoffs. Let \$X denotes "X utility", and let $(o_1, o_2)$ be the outcome for Player 1 and Player 2 respectively. Then we can write the payoff matrix for a Prisoner's Dilemma as:

$$\begin{array}{r|c|c} & D_2 & C_2 \\ \hline D_1 & (\1, \1) & (\3, \0) \\ \hline C_1 & (\0, \3) & (\2, \2) \end{array}$$

When it is said e.g. by economists that it is 'rational' to Defect in the Prisoner's Dilemma, they are answering from within causal decision theory (understandably, since CDT is the current standard insofar as there is one). The overall situation is considered bothersome because the two 'rational' players get the outcome $(\1, \1)$ yet [pareto_dominated both players would prefer] the outcome $(\2, \2).$

%%

An enormous body of literature exists analyzing the Prisoner's Dilemma, including many attempts to defeat or escape the conclusion that two rational agents end up Defecting against each other.

One such response was Tennenholtz's notion of '[program_equilibrium program equilibrium]', a one-shot Prisoner's Dilemma between two computational agents that have [common_knowledge common knowledge] of each other's source code. Tennenholtz's initial paper observed that a program could check whether the other program was identical to itself, and if so, Cooperate.

With LDT, we can treat programs that know about other programs in a much more general way. Suppose that you are a computation agent, and in the Prisoner's Dilemma you find yourself facing a program like this:

def FairBot1(otherAgent):
if (otherAgent(FairBot1) == Cooperate):
return Cooperate
else:
return Defect


On LDT, the obvious answer is to Cooperate with FairBot1. On standard CDT the reply is mutual Defection as usual, since your physical act of cooperation does not physically cause FairBot1 to Cooperate (we can imagine that FairBot1 has already moved before you).

Even on LDT, FairBot1 is not an optimal player in the oneshot Prisoner's Dilemma with common knowledge of code, although FairBot1 does induce mutual cooperation with optimal players. One reason is that FairBot1 cooperates with CooperateBot:

def CooperateBot(otherAgent):
return Cooperate


(Visualizing CooperateBot as a stone with the word "cooperate" written can help pump the intuition that you ought to defect when there's only a stone playing on the other side.)

One also notes that when FairBot1 plays against a copy of itself, both agents go into an infinite loop and never return a value.

There is a way of fixing this infinite-loop problem that is much more realistic than one might at first expect. First, fix a simple proof system, such as first-order arithmetic. We'll dub this proof system $\mathcal T$ for 'theory'.

Denote "$\mathcal T$ proves the quoted sentence $S$" by writing $\operatorname{Prov}(\mathcal T, \ulcorner S \urcorner)$.

Then rewrite Fairbot2 to say:

def Fairbot2(otherAgent):
if (Prov(T, "otherAgent(Fairbot2) == Cooperate")):
return Cooperate
else:
return Defect


…that is, "I cooperate, if I prove the other agent cooperates."

This turns out not to lead to an infinite regress! It might seem at first equally consistent to suppose either "Two Fairbot2s both defect, and therefore both fail to prove the other cooperates" or "Both FairBot2s cooperate, and therefore both prove the other cooperates" with the first equilibrium seeming more likely because of the chicken-and-egg problem. Actually, Löb's theorem implies that both agents prove Cooperation and therefore both Cooperate.

Also conveniently, when we have complex systems of "$A$ is true if $B$ is provable and $C$ is not provable" plus "$C$ is true if it's provable that the consistency of $\mathcal T$ implies $B$ and $D$" etcetera, we can compute exactly what is and isn't provable in polynomial time.

The upshot is that if we have complicated systems of agents that do X if they can prove other agents would do Y if they were playing against Z, etcetera, we can evaluate almost immediately what actually happens.

One of the early papers in logical decision theory, "Robust Cooperation in the Prisoner's Dilemma: Program Equilibrium Through Provability Logic" exhibited a proof (with running code) that there existed a simple agent PrudentBot which:

• Cooperated with another PrudentBot.
• Cooperated with FairBot2.
• Defected against CooperateBot (aka a rock with the word "Cooperate" written on it).
• Was not exploitable (PrudentBot never plays Cooperate when the other player plays Defect).

…and it was later shown that PrudentBot was a special case of something that looks like a more general, unified decision-making rule, [proof_based_decision_theory].

This strongly suggests that different agents assigning sufficiently high probability to their mutual rationality (that is, LDT-rationality) could indeed end up Cooperating on the Prisoner's Dilemma after reasoning abstractly about the behavior of each other's algorithms.

# Derivation of game theory from first principles

On standard game theory, not involving any coordination, players in multiplayer games are assumed to play at a Nash equilibrium; a stable point such that no player can do better by making a different move. Nash equilibria are guaranteed to exist if mixed strategies (probabilistic moves) are allowed. In the Prisoner's Dilemma, Default-Default is a Nash equilibrium because no player can do individually better by changing their individual move.

For a slightly more complicated example, consider a game with an asymmetric payoff matrix:

$$\begin{array}{r|c|c} & Y_2 & Z_2 \\ \hline W_1 & (\0, \49) & (\49, \0) \\ \hline X_1 & (\1, \0) & (\0, \1) \end{array}$$

If player 1 predictably chooses $W_1,$ player 2 chooses $Y_2$; if predictably $Y_2$ then $X_1,$ if predictably $X_1$ then $Z_2,$ if predictably $Z_2$ then $W_1.$ At the equilibrium point, Player 1 must be indifferent between $W_1$ and $X_1,$ and Player 2 must be indifferent between $Y_2$ and $Z_2.$ These respectively occur when Player 2 plays $Y_2$ with 98% probability and when Player 1 plays $X_1$ with 98% probability. This is the only equilibrium point at which neither player can do better by changing their strategy.

Even this simple equilibrium is hard to derive from first principles if the two players are expected utility maximizers who don't start out with the expectation that the other player plays at the Nash equilibrium. The optimal move of Player 1 depends on what Player 2 is thinking and doing; the optimal move of Player 2 depends on what Player 1 is thinking and doing.

Attempts to model this situation formally--from first principles, without either expected utility maximizer starting out with the presumption that the other player would play at a Nash equilibrium--did in fact get caught in an infinite regress. This was known as the "grain of truth" problem because it could be resolved if both players assigned non-zero probability to the other player's reasoning process. But owing to the infinite regress, if Player 1 had a correct model of Player 2 in its hypothesis space then Player 2 could not have a correct model of Player 1 in its hypothesis space, etcetera.

An LDT-inspired solution was given to this problem, using [ reflective oracles]. Reflective oracles are a class of program which can compute the outputs of other programs using reflective oracles, providing that some results can be randomized (a sort of end-run around [ Turing's Halting Problem]). The theory of reflective oracles was developed while investigating the theory of Reflective stability, one of the motivating considerations behind LDT, and looking for ways that an agent could model itself.

The result shows that expected utility maximizers using reflective oracles, which can entertain the theory that the other agent is an expected utility maximizer using a reflective oracle, will in full generality learn to play Nash equilibria against each other after playing each other and updating on the evidence.

(In other words, even if no attempts are made to coordinate or take into account correlated rational moves, some kind of LDT may be necessary to shortcut the infinite regress of 'expected utility maximizers modeling each other modeling each other' into vanilla game theory.)

## Bargaining in the Ultimatum Game?

In the Ultimatum bargaining game, the experimenter sets up a one-shot, two-player scenario for dividing \$10. One player, the Proposer, proposes how to split the \$10 with the other player, the Responder. If the Responder accepts the proposed split, it goes through. Otherwise both players get nothing. %note: The Ultimatum Game is an important ideal game in economics because it stands in for the problem of dividing gains from trade in non-liquid markets. If I am the only person in town selling a used car, whose use-value to me is \$5000; and you are the only person in town trying to buy a used car, whose use-value to you is \$8000; then in principle the trade could take place at any price between \$5001 and \$7999. In the former case, I have only gained \$1 from the trade and you have captured \$2999 of the value; in the latter case this is reversed. But if you try to offer \$5300, then if I am the sort of person who refuses a \$1 offer in the Ultimatum Game, I may refuse your offer of \$5300 as well; in which case both of us are worse off to some degree. This incentivises you to offer more than \$5300.%

On the standard analysis, a 'rational' (that is, CDT-rational) Responder should accept an offer of \$1 in the Ultimatum Game, since having \$1 is better than having \$0. Then a CDT-rational Proposer that thinks it is facing a CDT-rational Responder will offer only \$1. If the Responder were known to refuse all offers less than \$2, then it would be CDT-rational to offer \$2 to this irrational player; likewise if the Responder irrationally refuses all offers less than \$9. But from the perspective of a CDT-rational Responder, by the time the Proposer's offer is actually made, it is irrational to think that your current reasoning processes can retroactively affect the amount you were offered, etcetera. If an LDT agent plays a CDT agent with common knowledge of code, the LDT agent will of course end up with \$9 from either position. What else can you expect when rationality meets irrationality?

The situation with two LDT agents playing each other is stranger. In the same way that the Proposer could simply make an offer of \$1 and lock in that offer, the Responder could decide on a policy of "Refuse any offer lower than \$9" in such a way that the Proposer would know that was the policy the Responder would decide on. On some possible analyses, the Responder wants to look like a rock with the phrase "Reject all under \$9" written on it, and not an agent at all; the Proposer similarly wishes it could look like a rock with "Offer \$1" written on it. But if both try to deploy such weapons of precommitment warfare, they both end up with nothing.

A formally derived solution to this problem from plausible first principles, does not yet exist. But it's been suggested that the equilibrium between LDT agents ought to end up looking like this:

Suppose Agent A thinks a 'fair' solution (where 'fair' is a term of art) is \$5 apiece for Proposer and Responder, e.g. because \$5 is the Shapley value.

However, Agent A is not certain that Agent B considers the Shapley value to be a 'fair' division of the gains.

Then:

• Agent A doesn't want to blanket-reject all offers below \$5, because in this case two agents with an even slightly different estimate of what is 'fair' (\$5 vs. \$4.99) will receive nothing. • Agent A doesn't want to blanket-accept any offers below \$5, because then another LDT agent who predicts this will automatically offer Agent A an amount below \$5. This suggests that Agent A should definitely accept any offer at or above what Agent A thinks is the 'fair' amount of \$5, and probabilistically accept or reject any offer below \$5 such that the value to the other agent slopes downward as that agent proposes lower splits. E.g., Agent A might accept an offer $q$ beneath \$5 with probability $p$:

$$p = \big ( \dfrac{\5}{\10 - q} \big ) ^ {1.01}$$

So, for example, an offer of \$4 would be accepted with 83% probability, for an expected gain to the Proposer of \$4.98, and an expected gain to the Responder of \$3.32. From the Responder's perspective, the important feature of this response function is not that the Responder get the same amount as the Proposer, but that the Proposer experiences no incentive to give the Responder less than the 'fair' amount (and further-diminished returns as the Responder's returns decrease further). This gives the notion of 'fair', while seeming arbitrary in terms of fundamentals (that is, we can't yet derive a 'fair' function starting from scratch), a very Schelling Point-like status: • You cannot do better by estimating that your 'fair' share is higher than what others think is your fair share. • You cannot do better by estimating that your 'fair' share is lower than what others think is your fair share. • Other agents have no incentive to estimate that your 'fair' share is less than what you think is your fair share. At present, this is still an informal theory, and formalizing 'fairness' further or trying to give some derivation of 'fairness' from plausible first principles remains an open problem. ## A no-blackmail equilibrium? Consider a Prisoner's Dilemma in which both players have nuclear weapons, and Player 2 moves second after seeing Player 1's move. $$\begin{array}{r|c|c|c} & Defect_2 & Cooperate_2 & Nuke_2 \\ \hline Defect_1 & (\1, \1) & (\3, \0) & (-\100, -\100) \\ \hline Cooperate_1 & (\0, \3) & (\2, \2) & (-\100, -\100) \\ \hline Nuke_1 & (-\100, -\100) & (-\100, -\100) & (-\100, -\100) \end{array}$$ If an LDT Player 2 plays a CDT Player 1 with common knowledge of code, we can expect the CDT agent, on simulating its environment, to find with dismay that the environment will irrationally respond with "Nuke!" to any Player 1 move except Cooperate, to which the environment responds with Defect. Now consider two LDT agents playing each other. What if one of those agents tries its hand at precommitment warfare? Player 1 as an LDT agent could do this as well; it could simulate Player 2 and then pre-emptively nuke if Player 2's algorithm says not to Cooperate. Clearly, if you don't want to be exploited by LDT agents that know your code, you had better not be the sort of agent that sighs and gives up when confronted with precommitment warfare. In fact, giving in to blackmail would introduce all sorts of further problems; e.g., you might start to compute that some evidence had negative information-value (Player 1 thinks it doesn't want to know Player 2's code because then it might be successfully blackmailed by Player 2). It seems like this ought to resolve into one of two equilibria: • Everyone simply ignores all attempts at 'blackmail', intuitively a situation in which somebody threatens to deal some harm to themselves (relative to the Nash equilibrium) in order to deal harm to you (relative to the Nash equilibrium). • This may need to be defined in such a way as to give nobody an incentive to try modifying their own utility function. ("No! Really! It's not blackmail! I just inherently enjoy nuking everything, it really is what I'd do by default, and fairness demands that you pay me not to do it!") • As an extension of the hypothesized Ultimatum equilibrium on 'fairness', an LDT agent triggers the worst-case scenario with sufficient probability to render extortion unprofitable relative to the 'fair' equilibrium. E.g., in the Nuclear Prisoner's Dilemma the 'fair' value is presumably \$2. So an LDT agent responds to attempted precommitment blackmail by e.g. Cooperating with 98.8% probability and Nuking with 1.2% probability.
• This becomes potentially more complicated if common knowledge of code is not perfect and there are other agents around who can with some probability be successfully blackmailed. A 'fair' policy equilibrium that provides only a very small disincentive to blackmail, could be outweighed by the probability of successfully blackmailing those other agents.
• It also looks like this might still create negative values of information if the payoff of the 'negotiated' equilibrium ever falls below the payoff from the Nash-equilibrium move (you would prefer not to know the other player's code at all). Indeed, doing worse than the Nash equilibrium under any circumstances seems like a plausible generic sign of a dysfunctional decision theory.

Deriving either of these equilibria from plausible first premises remains an open problem.

## Death in Damascus

In the city of Damascus, a man once encountered the skeletal visage of Death. Death, upon seeing the man, looked surprised; but then said, "I ᴀᴍ ᴄᴏᴍɪɴɢ ғᴏʀ ʏᴏᴜ ᴛᴏᴍᴏʀʀᴏᴡ." The terrified man at once bought a camel and fled to Aleppo. After being killed the next day by falling roof tiles, the man looked around and saw Death waiting.

"I thought you would be looking for me in Damascus," said the man.

"Nᴏᴛ ᴀᴛ ᴀʟʟ," said Death. "Tʜᴀᴛ ɪs ᴡʜʏ I ᴡᴀs sᴜʀᴘʀɪsᴇᴅ ᴛᴏ sᴇᴇ ʏᴏᴜ ʏᴇsᴛᴇʀᴅᴀʏ, ғᴏʀ I ᴋɴᴇᴡ I ʜᴀᴅ ᴀɴ ᴀᴘᴘᴏɪɴᴛᴍᴇɴᴛ ᴡɪᴛʜ ʏᴏᴜ ɪɴ Aʟᴇᴘᴘᴏ."

As Gibbard and Harper (1976) observe:

Death works from an appointment book which states time and place; a person dies if and only if the book correctly states in what city he will be at the stated time. The book is made up weeks in advance on the basis of highly reliable predictions. Suppose, on this basis, the man would take his being in Damascus the next day as strong evidence that his appointment with Death is in Damascus, and would take his being in Aleppo the next day as strong evidence that his appointment is in Aleppo… If… he decides to go to Aleppo, he then has strong grounds for expecting that Aleppo is where Death already expects him to be, and hence it is rational for him to prefer staying in Damascus. Similarly, deciding to stay in Damascus would give him strong grounds for thinking that he ought to go to Aleppo.

Death in Damascus potentially sends some versions of Causal decision theories into infinite loops; in particular it does this to any decision theory that tries to observe its first impulse to choose, and update the background variables on that impulse.

Again, if a causal decision theory does not try to update background variables on observing its first impulse, it will calculate quantitatively wrong expected utilities even when conditioning on the choice it actually makes. For example, in Newcomb's Problem, if the prior rate of full Box Bs is 2/3, then at the moment of making the decision, the CDT agent will expect a 2/3 chance of Box B being full.

If we introduce more complex choices, it seems easy to extract money from a non-self-updating CDT agent. For example, it could be the case that you must press one of four buttons to determine both whether to one-box or two-box, and also whether to pay an extra \$900 fee to make the money (if any) be tax-free. If your marginal tax rate is otherwise 50%, then the payoff chart in after-tax income might look like this: $$\begin{array}{r|c|c} & \text{One-boxing predicted} & \text{Two-boxing predicted} \\ \hline \text{W: Take both boxes, no fee:} & \500,500 & \500 \\ \hline \text{X: Take only Box B, no fee:} & \500,000 & \0 \\ \hline \text{Y: Take both boxes, pay fee:} & \1,000,100 & \100 \\ \hline \text{Z: Take only Box B, pay fee:} & \999,100 & -\900 \end{array}$$ A CDT-agent without self-observant updating, thinking that it has the 2/3 prior chance of Box B being full, will press the button Y. Of course, the dynamically inconsistent CDT agent will know both before and after this choice that agents which press Y tend to end up with \$100; but in the moment of decision, it will seem perfectly plausible to the CDT agent that pressing Y is probably better than pressing W.

But a CDT-agent with self-observation and updating will go into an infinite loop on Death in Damascus, as it concludes that it is better to be in Aleppo, realizes that Death is probably in Aleppo, decides to stay in Damascus, realizes that Death is probably in Damascus, etcetera.

Some causal decision theorists have proposed that Death in Damascus problems must resolve to a mixed-strategy equilibrium; the agent's choice is the mixed strategy of staying in Damascus with 50% probability and going to Aleppo with 50% probability. Needless to say, the personification of Death itself should be assumed to know the output of your random-number generator. The point is that in the instant of making the mixed-policy choice, Aleppo and Damascus seem equally appealing as destinations, and therefore the mixed-policy choice seems as appealing as any other.

One arguable problem with this reply is that after flipping the coin which says to say in Damascus, a CDT agent might realize that it's staying in Damascus where Death will surely come, and try to decide again whether to start for Aleppo. After starting for Aleppo, the CDT agent might need to decide again a minute later whether to turn back.

Another issue is that the agent's calculated expected utility at the instant of making the decision is now wrong again. At the instant of choosing the mixed strategy, the agent still thinks it has a 50% chance of survival. If a compound decision includes a chance to e.g. buy for \$1 a ticket that pays out \$10 if the agent survives, the agent will want to buy that ticket at the instant of making the decision; e.g. if the options are "Aleppo, Damascus, Aleppo + ticket, Damascus + ticket", a CDT agent using a mixed strategy will choose 50% Aleppo+ticket and 50% Damascus+ticket.

So, arguendo by logical decision theorists:

• First-order CDT agents calculate the wrong expectations for the policies they actually output.
• Self-observing and updating CDT agents go into infinite loops.
• Self-observing CDT agents that can output loop-breaking mixed strategies… go back to calculating the wrong expectations for the policies they actually output.

An agent using updateless decision theory (which is now a standard feature in LDTs) would reply to Death in Damascus as follows: "How exactly does Death decide whether to speak to someone?"

After all, depending on the causal setup, Death speaking to someone might not produce a self-fulfilling prophecy. Suppose that Death meets someone in the Damascus marketplace who is fated to die in Damascus, but who will flee to Aleppo if spoken to. Death cannot tell that person they are fated to die, and be correct. So "Death always knows when you will die, always tells you when you will die, and then you always actually die" is not obviously a rule that Death can follow given every agent strategy for replying to Death; some agent policies will render this rule impossible. %note: Preventing impossible setups is also why, in the transparent Newcomb's problem, Omega filling Box B is said to be conditioned on your behavior if you see the box is full. If Omega is trying to fill Box B iff the agent one-boxes given what the agent actually sees, then the agent policy "Take both boxes if Box B full, take only Box B if Box B empty" obviously prevents that.%

Suppose Death follows the rule:

• Each day, check whether telling a person that they have an appointment with Death will cause them to die the next day.
• If so, tell them they have an appointment with Death the next day.
• If not, remain silent, even if this means the person dies with no warning.

Then the policy "In case of a warning by Death, stay in Damascus" is optimal--decisively superior to the policy of fleeing to Aleppo! If you always flee to Aleppo on a warning, then you are killed by any fatal event that could occur in Aleppo (Death gives you warning, you flee, you die) and killed by any fatal event in Damascus (Death stays silent and collects you). You will be aware that Death is coming for you in Damascus, but you will also be aware that if you were the sort of person who fled to Aleppo on warning, (a) you would be receiving no warning now, and (b) you would possibly have already died in Aleppo.

On the other hand, suppose Death follows the rule:

• Each day, check whether telling a person that they have an appointment with Death will cause them to die the next day.
• If so, tell them they have an appointment with Death the next day.
• If not, don't kill them.

In this case, you should, upon being warned by Death, hide yourself in the safest possible circumstances! You'll still expect to die--something odd will happen to you in your hospital bed despite all the attending doctors. On a causal counterfactual, you might well survive if-causal-counterfactually you stayed in Damascus, without this altered decision altering Death's predictions or behavior. But an LDT agent does not think this causal counterfactual is available as a real option. If you're the sort of agent that changes its mind and stays in Damascus, then Death warns you about any fatal event in Damascus and then promptly kills you the next day.

Both of these policy responses might seem counterintuitive as momentary behavior--you know Death is coming for you, and yet you don't change your policy--but agents who implement such policies, are in fact the agents who survive the longest if Death is following the stated rules, and the sort of agent who tries to make a sudden exception dies earlier. (Besides the "Why aincha alive?" question that logical decision theorists give a much higher primacy, this also means that any other rule would not be reflectively consistent--if you were worried you might not act like an LDT agent in the face of Death, you would try to pay costs to precommit to acting like an LDT agent in the face of death. CDT agents want to act like LDT agents if they are considering the issues in advance; they just need to pay extra costs to make sure they actually act like that.)

# Conclusion

Causal decision theory is very widely agreed, including by LDT advocates/researchers, to have been a crucial stage in the development of decision theory past the evidential formula for expected utility. There have been some attempts to reformulate LDT and logical conditionals without using counterfactuals (mostly filed under [-proof_based_dt]), but most LDT researchers think of the key conditionals as being "[ logical counterfactuals]", since they deal in outputs that are not actually output; some would describe LDT as "a causal decision theory about logical facts" or "CDT with some logical facts included in our causal model".

So almost everyone agrees that CDT was a key step in the development of decision theory; was it the end step?

A truly pure CDT agent--e.g. a computational agent whose sense-model-predict-act loop just was CDT--would be unmoved by all the considerations brought here; from that agent's perspective the rational choice in the moment just is the CDT answer. Conditionals that are not the CDT conditionals are giving the wrong picture of 'What happens if I do this?', which the CDT agent can tell because it has computed the 'rational' (aka CDT) conditionals and the other agent's conditionals must match that answer or be 'irrational'. The algorithm the CDT agent wants its future self to use--the algorithm one would rationally want to give an agent--is [son_of_cdt Son-of-CDT] indexed to the time that theory was adopted and enforced; choosing any other algorithm would be irrational. There is no particular reason to believe that rational agents are dynamically consistent or that they avoid precommitment costs; rational agents are simply those which rationally compute the conditionals in the expected utility formula. Some problems have been set up to penalize agents that act rationally; but the rational agent does not always end up rich.

But human beings aren't built with such crisp or simple code; to the extent any of us have already adopted CDT, we did so by explicitly endorsing an ideal theory that seemed to appeal to some pretheoretic viewpoint. The claim made by logical decision theorists is not that LDT seems best under CDT, but that if somebody with a typical pretheoretic viewpoint who had not yet heard of CDT or LDT was considering the arguments side-by-side, they would not find CDT at all convincing in that light. Even somebody who'd previously adopted CDT as their ideal, being able to retain to some extent the pretheoretic intuitions that led them to adopt CDT in the first place, could potentially 'snap out of it' after seeing arguments that would retrospectively have been very persuasive to their pretheoretic intuitions had they been considered at the time of originally deciding for CDT, even if their current explicit ideal of CDT simply computes a shrug.

LDT seems visibly unfinished in several regards: e.g. no truly general way of computing logical counterfactuals in an endogenously generated world-model, no proof yet of a no-blackmail equilibrium, no derivation from first principles of a 'fair' equilibrium in negotiation, etcetera. There may some future decision theory that is to LDT what LDT is to CDT.

Even so, it seems plausible that this post-LDT decision theory will still talk about the logical outputs of decision algorithms, or look more like it's talking about logical facts than local physical acts. LDT did not discard causal reasoning and go back to evidential reasoning, at least on the interpretation of most theorists. Somebody who'd said, "There may someday be a post-CDT theory, but if so, I'm guessing it will be about some kind of counterfactuals, not reverting back to evidential updates," would currently look to have probably been correct. By analogy, it's not unreasonable to guess that a future post-LDT theory may still talk about choices as logical facts.

In any case, right now the principle of choosing as if over the logical output of your decision algorithm, seems like the best current guess for the principle of rational choice (arguendo according to LDT). So at least for today, we should stop going around telling people that rational agents don't vote in elections, or that two rational agents must defect against each other in the oneshot Prisoner's Dilemma, or that only strategically irrational behavior can save us from being exploited at the bargaining table.

Laying out the situation side-by-side, an LDT advocate would argue that the situation looks like this:

LDT's view:

• Rational agents should end up rich; actually ending up with the goal accomplished is an important sign of instrumental rationality.
• Rational agents ought to end up standing alongside the richest agents of any kind in Newcomb's Problem, the transparent Newcomb's Problem, Parfit's Hitchhiker, the Prisoner's Dilemma with information about the other agent's code, the Nuclear Prisoner's Dilemma, the Absent-Minded Driver, Death in Damascus, etcetera. If this isn't happening, something is wrong with your notion of rationality.
• Rational agents should want to be rational, unless somebody is explicitly penalizing quoted algorithms independently of their behaviors.
• Rational agents should be internally coherent, and not end up fighting with past or future selves in a war of precommitments versus attempts to undo precommitments.
• Rational agents at the negotiating table should not need to try to look irrational in order to get fair deals; looking rational is fine.
• Rational agents computing their conditional expected utilities should mostly be visualizing scenarios that they do not yet know to be impossible.
• Rational agents visualizing the conditional for the choice they actually end up making, should find that they have well-modeled the actual world.
• Rational agents should not need to postulate libertarian free will as an explanation for why the conditionals they visualize are reasonable.
• A rational decision algorithm computes its choice in a single sweep, rather than repeatedly needing to observe itself and update and rethink until an equilibrium has been reached.
• The principle of rational choice is simple.

CDT's view:

• Rational agents end up with the best alternative that is being offered them, but many problems are unfair in that they will offer rational agents worse options than irrational agents receive.
• Playing the Prisoner's Dilemma against a Fairbot that knows your code, encountering a lorry-driver in the desert who can read your facial expressions well, and being absent-minded enough to not remember if this is the second time you've seen this intersection, are all unfair problems that present rational agents with poorer choices than those available to irrational agents.
• The thing that it is rational to do right now is computed in a very different way from the algorithm that we'd want our future selves to use, if any future events will be causally affected by our choice of algorithm (e.g. Omega is about to look at our brain, we can make a visible precommitment, somebody has a quantitatively non-zero ability to read facial expressions, our source code is visible on the Ethereum network).
• The 'algorithm that it is rational to want to run', if it can be implemented via self-modification, should always keep track of whether information about it has been obtained through causal processes that started after the moment of self-modification; if not, CDT behavior remains rational.
• Rational agents would not want to accumulate a reputation for being rational; if your choices now affect how people will model your future behavior, you want them to believe your behavior is determined by some non-rational algorithm, so you should behave consistently with that algorithm whenever you are being watched.
• The conditionals that it is rational to compute for expected utility are those which involve visualizing a single physical act as though that act had originated acausally, not changing any background variables or past events, and then visualizing the world running forward from there. There is no particular reason to think that this conditional should involve visualizing worlds not known to be impossible. Some odd circumstances may cause conditioning on our actual action to lead us to visualize something that isn't the actual world, but them's the rules.
• A rational decision algorithm, to incorporate as much information as possible, may need to repeatedly observe itself and update its view of the background variables based on that impulse; randomizing one's own actions may be necessary to prevent infinite loops. That may sound complicated, but such are the fine details of the principle of rational choice.

Arguendo by LDT advocates, this situation overwhelmingly favors LDT. Indeed, nobody considering the full stories side-by-side from a pretheoretic stance should see any factor favoring CDT (arguendo by LDT advocates). The only critique of LDT is that the conditionals LDT uses are 'irrational' in virtue of not being the same as CDT's local-act counterfactuals, and the only argument favoring CDT's conditionals is that they are the truly right conditionals because they match the local-act counterfactuals. This charge is not persuasive unless one has already adopted CDT; and from the standpoint of somebody considering both decision theories for the first time, there is no bootstrap reason to adopt CDT over LDT long enough for this anti-LDT charge to be convincing. Unless you were initially driven to adopt CDT via CDT being presented as a superior alternative to EDT without LDT having been considered, there is no way to end up in a standpoint where LDT seems worse than CDT. It does happen to be the case historically that CDT was adopted as a superior alternative to EDT without LDT being considered at the time, but this is a case of more path dependency and we should snap out of it (arguendo according to LDT advocates).

# Further reading

[todo: add further links here, actually start filling out the redlinks]

## Comments

Ben Plommer

On the majority view within contemporary decision theory, this is the reply to the "If you're so rational, why aincha rich?" argument in favor of one\-boxing on Newcomb's Problem\. Somebody who actually takes only Box B is merely 'managing the news' about Box B, not actually acting to maximize the causal impacts of their actions\. Omega choosing to reward people who only take Box B is akin to happening to already have toxoplasmosis at the start of the decision problem, or Omega deciding to reward only evidential decision theorists\. Evidential agents only seem to win in 'Why aincha rich?' scenarios because they're managing the news in a way that an artificial problem setup declares to be news about wealth\.

This isn't quite right as an exposition of Lewis's argument – it elides the distinction between the irrationality of "managing the news" and the way that (according to Lewis) the scenario pre-rewards an irrational choice. Evidential agents don't just "seem" to win – they really do win, because the scenario is set up to arbitrarily pre-reward them for being the kind of agents who one-box. Furthermore, it's claimed that the behaviour which is thereby arbitrarily pre-rewarded is irrational, because it amounts to managing the news.

The sense in which two-boxing is said to be irrational news-management is that doing so will give you evidence that you have been pre-rewarded, but won't causally affect the contents of the box – if you're an evidential agent, and have been pre-rewarded as such, you would still get the \$1m if you were to miraculously, unforeseeably switch to two-boxing; and if you're a causal agent, and have been pre-punished as such, you would still not get the \$1m if you were to miraculously, unforeseeably switch to one-boxing. The kind of agents that one-box really do do better, but once you've been rewarded for being that kind of person you may as well act contrary to your kind two-box anyway, despite the negative news value of doing so.