High-speed intro to Bayes's rule

https://arbital.com/p/bayes_rule_fast_intro

by Eliezer Yudkowsky Sep 29 2016 updated Dec 25 2017

A high-speed introduction to Bayes's Rule on one page, for the impatient and mathematically adept.


(This is a high-speed introduction to Bayes' rule for people who want to get straight to it and are good at math. If you'd like a gentler or more thorough introduction, try starting at the Bayes' Rule Guide page instead.)

Percentages, frequencies, and waterfalls

Suppose you're screening a set of patients for a disease, which we'll call Diseasitis.%note:Lit. "inflammation of the disease".% Your initial test is a tongue depressor containing a chemical strip, which usually turns black if the patient has Diseasitis.

What fraction of patients with black tongue depressors have Diseasitis?

%%hidden(Answer): 3/7 or 43%, quickly obtainable as follows: In the screened population, there's 1 sick patient for 4 healthy patients. Sick patients are 3 times more likely to turn the tongue depressor black than healthy patients. $~$(1 : 4) \cdot (3 : 1) = (3 : 4)$~$ or 3 sick patients to 4 healthy patients among those that turn the tongue depressor black, corresponding to a probability of $~$3/7 = 43\%$~$ that the patient is sick.%%

(Take your own stab at answering this question, then please click "Answer" above to read the answer before continuing.)

Bayes' rule is a theorem which describes the general form of the operation we carried out to find the answer above. In the form we used above, we:

Bayes' rule in this form thus states that the prior odds times the likelihood ratio equals the posterior odds.

We could also potentially see the positive test result as revising a prior belief or prior probability of 20% that the patient was sick, to a posterior belief or posterior probability of 43%.

To make it clearer that we did the correct calculation above, and further pump intuitions for Bayes' rule, we'll walk through some additional visualizations.

Frequency representation

The frequency representation of Bayes' rule would describe the problem as follows: "Among 100 patients, there will be 20 sick patients and 80 healthy patients."

prior frequency

"18 out of 20 sick patients will turn the tongue depressor black. 24 out of 80 healthy patients will blacken the tongue depressor."

posterior frequency

"Therefore, there are (18+24)=42 patients who turn the tongue depressor black, among whom 18 are actually sick. (18/42)=(3/7)=43%."

(Some experiments show %%note: E.g. "Probabilistic reasoning in clinical medicine" by David M. Eddy (1982).%% that this way of explaining the problem is the easiest for e.g. medical students to understand, so you may want to remember this format for future use. Assuming you can't just send them to Arbital!)

Waterfall representation

The waterfall representation may make clearer why we're also allowed to transform the problem into prior odds and a likelihood ratio, and multiply (1 : 4) by (3 : 1) to get posterior odds of (3 : 4) and a probability of 3/7.

The following problem is isomorphic to the Diseasitis one:

"A waterfall has two streams of water at the top, a red stream and a blue stream. These streams flow down the waterfall, with some of each stream being diverted off to the side, and the remainder pools at the bottom of the waterfall."

unlabeled waterfall

"At the top of the waterfall, there's around 20 gallons/second flowing from the red stream, and 80 gallons/second flowing from the blue stream. 90% of the red water makes it to the bottom of the waterfall, and 30% of the blue water makes it to the bottom of the waterfall. Of the purplish water that mixes at the bottom, what fraction is from the red stream versus the blue stream?"

labeled waterfall

We can see from staring at the diagram that the prior odds and likelihood ratio are the only numbers we need to arrive at the answer:

wide vs narrow waterfall

So only the ratio of red to blue water at the top (prior odds of the proposition), and only the ratio between the percentages of red and blue water that make it to the bottom (likelihood ratio of the evidence), together determine the posterior ratio at the bottom: 3 parts red to 4 parts blue.

Test problem

Here's another Bayesian problem to attempt. If you successfully solved the earlier problem on your first try, you might try doing this one in your head.

10% of widgets are bad and 90% are good. 4% of good widgets emit sparks, and 12% of bad widgets emit sparks. What percentage of sparking widgets are bad?

%%hidden(Answer):

(If you're having trouble using odds ratios to represent uncertainty, see this intro or this page.)

General equation and proof

To say exactly what we're doing and prove its validity, we need to introduce some notation from probability theory.

If $~$X$~$ is a proposition, $~$\mathbb P(X)$~$ will denote $~$X$~$'s probability, our quantitative degree of belief in $~$X.$~$

$~$\neg X$~$ will denote the negation of $~$X$~$ or the proposition "$~$X$~$ is false".

If $~$X$~$ and $~$Y$~$ are propositions, then $~$X \wedge Y$~$ denotes the proposition that both X and Y are true. Thus $~$\mathbb P(X \wedge Y)$~$ denotes "The probability that $~$X$~$ and $~$Y$~$ are both true."

We now define conditional probability:

$$~$\mathbb P(X|Y) := \dfrac{\mathbb P(X \wedge Y)}{\mathbb P(Y)} \tag*{(definition of conditional probability)}$~$$

We pronounce $~$\mathbb P(X|Y)$~$ as "the conditional probability of X, given Y". Intuitively, this is supposed to mean "The probability that $~$X$~$ is true, assuming that proposition $~$Y$~$ is true".

Defining conditional probability in this way means that to get "the probability that a patient is sick, given that they turned the tongue depressor black" we should put all the sick plus healthy patients with positive test results into a bag, and ask about the probability of drawing a patient who is sick and got a positive test result from that bag. In other words, we perform the calculation $~$\frac{18}{18+24} = \frac{3}{7}.$~$

diseasitis frequency

Rearranging the definition of conditional probability, $~$\mathbb P(X \wedge Y) = \mathbb P(Y) \cdot \mathbb P(X|Y).$~$ So to find "the fraction of all patients that are sick and get a positive result", we multiply "the fraction of patients that are sick" times "the probability that a sick patient blackens the tongue depressor".

We're now ready to prove Bayes's rule in the form, "the prior odds times the likelihood ratio equals the posterior odds".

The "prior odds" is the ratio of sick to healthy patients:

$$~$\frac{\mathbb P(sick)}{\mathbb P(healthy)} \tag*{(prior odds)}$~$$

The "likelihood ratio" is how much more relatively likely a sick patient is to get a positive test result (turn the tongue depressor black), compared to a healthy patient:

$$~$\frac{\mathbb P(positive | sick)}{\mathbb P(positive | healthy)} \tag*{(likelihood ratio)}$~$$

The "posterior odds" is the odds that a patient is sick versus healthy, given that they got a positive test result:

$$~$\frac{\mathbb P(sick | positive)}{\mathbb P(healthy | positive)} \tag*{(posterior odds)}$~$$

Bayes's theorem asserts that prior odds times likelihood ratio equals posterior odds:

$$~$\frac{\mathbb P(sick)}{\mathbb P(healthy)} \cdot \frac{\mathbb P(positive | sick)}{\mathbb P(positive | healthy)} = \frac{\mathbb P(sick | positive)}{\mathbb P(healthy | positive)}$~$$

We will show this by proving the general form of Bayes's Rule. For any two hypotheses $~$H_j$~$ and $~$H_k$~$ and any piece of new evidence $~$e_0$~$:

$$~$ \frac{\mathbb P(H_j)}{\mathbb P(H_k)} \cdot \frac{\mathbb P(e_0 | H_j)}{\mathbb P(e_0 | H_k)} = \frac{\mathbb P(e_0 \wedge H_j)}{\mathbb P(e_0 \wedge H_k)} = \frac{\mathbb P(e_0 \wedge H_j)/\mathbb P(e_0)}{\mathbb P(e_0 \wedge H_k)/\mathbb P(e_0)} = \frac{\mathbb P(H_j | e_0)}{\mathbb P(H_k | e_0)} $~$$

In the Diseasitis example, this corresponds to performing the operations:

$$~$ \frac{0.20}{0.80} \cdot \frac{0.90}{0.30} = \frac{0.18}{0.24} = \frac{0.18/0.42}{0.24/0.42} = \frac{0.43}{0.57} $~$$

Using red for sick, blue for healthy, grey for a mix of sick and healthy patients, and + signs for positive test results, the proof above can be visualized as follows:

bayes venn

%todo: less red in first circle (top left). in general, don't have prior proportions equal posterior proportions graphically!%

Bayes' theorem

An alternative form, sometimes called "Bayes' theorem" to distinguish it from "Bayes' rule" (although not everyone follows this convention), uses absolute probabilities instead of ratios. The [marginal_probability law of marginal probability] states that for any set of mutually exclusive and exhaustive possibilities $~$\{X_1, X_2, …, X_i\}$~$ and any proposition $~$Y$~$:

$$~$\mathbb P(Y) = \sum_i \mathbb P(Y \wedge X_i) \tag*{(law of marginal probability)}$~$$

Then we can derive an expression for the absolute (non-relative) probability of a proposition $~$H_k$~$ after observing evidence $~$e_0$~$ as follows:

$$~$ \mathbb P(H_k | e_0) = \frac{\mathbb P(H_k \wedge e_0)}{\mathbb P(e_0)} = \frac{\mathbb P(e_0 \wedge H_k)}{\sum_i P(e_0 \wedge H_i)} = \frac{\mathbb P(e_0 | X_k) \cdot \mathbb P(X_k)}{\sum_i \mathbb P(e_0 | X_i) \cdot \mathbb P(X_i)} $~$$

The equation of the first and last terms above is what you will usually see described as Bayes' theorem.

To see why this decomposition might be useful, note that $~$\mathbb P(sick | positive)$~$ is an inferential step, a conclusion that we make after observing a new piece of evidence. $~$\mathbb P(positive | sick)$~$ is a piece of causal information we are likely to have on hand, for example by testing groups of sick patients to see how many of them turn the tongue depressor black. $~$\mathbb P(sick)$~$ describes our state of belief before making any new observations. So Bayes' theorem can be seen as taking what we already believe about the world (including our prior belief about how different imaginable states of affairs would generate different observations), plus an actual observation, and outputting a new state of belief about the world.

Vector and functional generalizations

Since the proof of Bayes' rule holds for any pair of hypotheses, it also holds for relative belief in any number of hypotheses. Furthermore, we can repeatedly multiply by likelihood ratios to chain together any number of pieces of evidence.

Suppose there's a bathtub full of coins:

You randomly draw a coin, flip it three times, and get the result HTH. What's the chance this is a fair coin?

We can validly calculate the answer as follows:

$$~$ \begin{array}{rll} & (3 : 2 : 1) & \cong (\frac{1}{2} : \frac{1}{3} : \frac{1}{6}) \\ \times & (2 : 1 : 3) & \cong ( \frac{1}{2} : \frac{1}{4} : \frac{3}{4} ) \\ \times & (2 : 3 : 1) & \cong ( \frac{1}{2} : \frac{3}{4} : \frac{1}{4} ) \\ \times & (2 : 1 : 3) & \\ = & (24 : 6 : 9) & \cong (8 : 2 : 3) \cong (\frac{8}{13} : \frac{2}{13} : \frac{3}{13}) \end{array} $~$$

So the posterior probability the coin is fair is 8/13 or ~62%.

This is one reason it's good to know the odds form of Bayes' rule, not just the probability form in which Bayes' theorem is often given.%%note: Imagine trying to do the above calculation by repeatedly applying the form of the theorem that says: $$~$\mathbb P(H_k | e_0) = \frac{\mathbb P(e_0 | X_k) \cdot \mathbb P(X_k)}{\sum_i \mathbb P(e_o | X_i) \cdot \mathbb P(X_i)}$~$$ %%

We can generalize further by writing Bayes' rule in a functional form. If $~$\mathbb O(H_i)$~$ is a relative belief vector or relative belief function on the variable $~$H,$~$ and $~$\mathcal L(e_0 | H_i)$~$ is the likelihood function giving the relative chance of observing evidence $~$e_0$~$ given each possible state of affairs $~$H_i,$~$ then relative posterior belief $~$\mathbb O(H_i | e_0)$~$ is given by:

$$~$\mathbb O(H_i | e_0) = \mathcal L(e_0 | H_i) \cdot \mathbb O(H_i)$~$$

If we normalize the relative odds $~$\mathbb O$~$ into absolute probabilities $~$\mathbb P$~$ - that is, divide through $~$\mathbb O$~$ by its sum or integral so that the new function sums or integrates to $~$1$~$ - then we obtain Bayes' rule for probability functions:

$$~$\mathbb P(H_i | e_0) \propto \mathcal L(e_0 | H_i) \cdot \mathbb P(H_i) \tag*{(functional form of Bayes' rule)}$~$$

Applications of Bayesian reasoning

This general Bayesian framework - prior belief, evidence, posterior belief - is a lens through which we can view a lot of formal and informal reasoning plus a large amount of entirely nonverbal cognitive-ish phenomena.%%note:This broad statement is widely agreed. Exactly which phenomena are good to view through a Bayesian lens is sometimes disputed.%%

Examples of people who might want to study Bayesian reasoning include:

The third application is probably of the widest general interest.

Example human applications of Bayesian reasoning

Philip Tetlock found when studying "superforecasters", people who were especially good at predicting future events:

"The superforecasters are a numerate bunch: many know about Bayes' theorem and could deploy it if they felt it was worth the trouble. But they rarely crunch the numbers so explicitly. What matters far more to the superforecasters than Bayes' theorem is Bayes' core insight of gradually getting closer to the truth by constantly updating in proportion to the weight of the evidence." — Philip Tetlock and Dan Gardner, Superforecasting

This is some evidence that knowing about Bayes' rule and understanding its qualitative implications is a factor in delivering better-than-average intuitive human reasoning. This pattern is illustrated in the next couple of examples.

The OKCupid date.

One realistic example of Bayesian reasoning was deployed by one of the early test volunteers for a much earlier version of a guide to Bayes' rule. She had scheduled a date with a 96% OKCupid match, who had then cancelled that date without other explanation. After spending some mental time bouncing back and forth between "that doesn't seem like a good sign" versus "maybe there was a good reason he canceled", she decided to try looking at the problem using that Bayes thing she'd just learned about. She estimated:

This implied posterior odds of 2 : 15 that this was an undesirable date, which was unfavorable enough not to pursue him further.%%note: She sent him what might very well have been the first explicitly Bayesian rejection notice in dating history, reasoning that if he wrote back with a Bayesian counterargument, this would promote him to being interesting again. He didn't write back.%%

The point of looking at the problem this way is not that she knew exact probabilities and could calculate that the man had an exactly 88% chance of being undesirable. Rather, by breaking up the problem in that way, she was able to summarize what she thought she knew in compact form, see what those beliefs already implied, and stop bouncing back and forth between imagined reasons why a good date might cancel versus reasons to protect herself from potential bad dates. An answer roughly in the range of 15/17 made the decision clear.

Internment of Japanese-Americans during World War II

From Robyn Dawes's Rational Choice in an Uncertain World:

Post-hoc fitting of evidence to hypothesis was involved in a most grievous chapter in United States history: the internment of Japanese-Americans at the beginning of the Second World War. When California governor Earl Warren testified before a congressional hearing in San Francisco on February 21, 1942, a questioner pointed out that there had been no sabotage or any other type of espionage by the Japanese-Americans up to that time. Warren responded, "I take the view that this lack [of subversive activity] is the most ominous sign in our whole situation. It convinces me more than perhaps any other factor that the sabotage we are to get, the Fifth Column activities are to get, are timed just like Pearl Harbor was timed… I believe we are just being lulled into a false sense of security."

You might want to take your own shot at guessing what Dawes had to say about a Bayesian view of this situation, before reading further.

%%hidden(Answer):Suppose we put ourselves into the shoes of this congressional hearing, and imagine ourselves trying to set up this problem.

As soon as we set up this problem, we realize that, whatever the probability of "no sabotage" being observed if there is a conspiracy, the likelihood of observing "no sabotage" if there isn't a conspiracy must be even higher. This means that the likelihood ratio:

$$~$\frac{\mathbb P(\neg \text{sabotage} | \text {conspiracy})}{\mathbb P(\neg \text {sabotage} | \neg \text {conspiracy})}$~$$

…must be less than 1, and accordingly:

$$~$ \frac{\mathbb P(\text {conspiracy} | \neg \text{sabotage})}{\mathbb P(\neg \text {conspiracy} | \neg \text{sabotage})} < \frac{\mathbb P(\text {conspiracy})}{\mathbb P(\neg \text {conspiracy})} \cdot \frac{\mathbb P(\neg \text{sabotage} | \text {conspiracy})}{\mathbb P(\neg \text {sabotage} | \neg \text {conspiracy})} $~$$

Observing the total absence of any sabotage can only decrease our estimate that there's a Japanese-American Fifth Column, not increase it. (It definitely shouldn't be "the most ominous" sign that convinces us "more than any other factor" that the Fifth Column exists.)

Again, what matters is not the exact likelihood of observing no sabotage given that a Fifth Column actually exists. As soon as we set up the Bayesian problem, we can see there's something qualitatively wrong with Earl Warren's reasoning.%%

Further reading

This has been a very brief and high-speed presentation of Bayes and Bayesianism. It should go without saying that a vast literature, nay, a universe of literature, exists on Bayesian statistical methods and Bayesian epistemology and Bayesian algorithms in machine learning. Staying inside Arbital, you might be interested in moving on to read:

More on the technical side of Bayes' rule

More on intuitive implications of Bayes' rule