[summary: The formulation of Bayes' rule you are most likely to see in textbooks says:

$\mathbb P(H_i\mid e) = \dfrac{\mathbb P(e\mid H_i) \cdot \mathbb P(H_i)}{\sum_k \mathbb P(e\mid H_k) \cdot \mathbb P(H_k)}$

This follows from the definition of Conditional probability which states that $\mathbb P(X \mid Y) = \frac{\mathbb P(X \wedge Y)}{\mathbb P (Y)}$ , and the [law_marginal_probability law of marginal probability] which says that $\mathbb P(Y) = \sum_k \mathbb P(Y \wedge X_k)$ .

We can think of the corresponding advice as saying, "Think of how much each hypothesis in $H$ contributed to our expectation of seeing the evidence $e$ , including both the likelihood of seeing $e$ if $H_k$ is true, and the prior probability of $H_k$ . The posterior of $H_i$ after seeing $e,$ is the amount $H_i$ contributed to our expectation of seeing $e,$ within the total expectation of seeing $e$ contributed by every hypothesis in $H.$ ]

The formulation of Bayes' rule you are most likely to see in textbooks runs as follows:

$\mathbb P(H_i\mid e) = \dfrac{\mathbb P(e\mid H_i) \cdot \mathbb P(H_i)}{\sum_k \mathbb P(e\mid H_k) \cdot \mathbb P(H_k)}$

Where:

$H_i$ is the hypothesis we're interested in.
$e$ is the piece of evidence we observed.
$\sum_k (\text {expression containing } k)$ [summation_notation means] "Add up, for every $k$ , the sum of all the (expressions containing $k$ )."
$\mathbf H$ is a set of mutually exclusive and exhaustive hypotheses that include $H_i$ as one of the possibilities, and the expression $H_k$ inside the sum ranges over all the possible hypotheses in $\mathbf H$ .

As a quick example, let's say there's a bathtub full of potentially biased coins.

Coin type 1 is fair, 50% heads / 50% tails. 40% of the coins in the bathtub are type 1.
Coin type 2 produces 70% heads. 35% of the coins are type 2.
Coin type 3 produces 20% heads. 25% of the coins are type 3.

We want to know the posterior probability that a randomly drawn coin is of type 2, after flipping the coin once and seeing it produce heads once.

Let $H_1, H_2, H_3$ stand for the hypotheses that the coin is of types 1, 2, and 3 respectively. Then using conditional probability notation, we want to know the probability $\mathbb P(H_2 \mid heads).$

The probability form of Bayes' theorem says:

$\mathbb P(H_2 \mid heads) = \frac{\mathbb P(heads \mid H_2) \cdot \mathbb P(H_2)}{\sum_k \mathbb P(heads \mid H_k) \cdot \mathbb P(H_k)}$

Expanding the sum:

$\mathbb P(H_2 \mid heads) = \frac{\mathbb P(heads \mid H_2) \cdot \mathbb P(H_2)}{[\mathbb P(heads \mid H_1) \cdot \mathbb P(H_1)] + [\mathbb P(heads \mid H_2) \cdot \mathbb P(H_2)] + [\mathbb P(heads \mid H_3) \cdot \mathbb P(H_3)]}$

Computing the actual quantities:

$\mathbb P(H_2 \mid heads) = \frac{0.70 \cdot 0.35 }{[0.50 \cdot 0.40] + [0.70 \cdot 0.35] + [0.20 \cdot 0.25]} = \frac{0.245}{0.20 + 0.245 + 0.05} = 0.\overline{49}$

This calculation was big and messy. Which is fine, because the probability form of Bayes' theorem is okay for directly grinding through the numbers, but not so good for doing things in your head.

Meaning

We can think of the advice of Bayes' theorem as saying:

"Think of how much each hypothesis in $H$ contributed to our expectation of seeing the evidence $e$ , including both the likelihood of seeing $e$ if $H_k$ is true, and the prior probability of $H_k$ . The posterior of $H_i$ after seeing $e,$ is the amount $H_i$ contributed to our expectation of seeing $e,$ within the total expectation of seeing $e$ contributed by every hypothesis in $H.$ "

Or to say it at somewhat greater length:

Imagine each hypothesis $H_1,H_2,H_3\ldots$ as an expert who has to distribute the probability of their predictions among all possible pieces of evidence. We can imagine this more concretely by visualizing "probability" as a lump of clay.

The total amount of clay is one kilogram (probability $1$ ). Each expert $H_k$ has been allocated a fraction $\mathbb P(H_k)$ of that kilogram. For example, if $\mathbb P(H_4)=\frac{1}{5}$ then expert 4 has been allocated 200 grams of clay.

We're playing a game with the experts to determine which one is the best predictor.

Each time we're about to make an observation $E,$ each expert has to divide up all their clay among the possible outcomes $e_1, e_2, \ldots.$

After we observe that $E = e_j,$ we take away all the clay that wasn't put onto $e_j.$ And then our new belief in all the experts is the relative amount of clay that each expert has left.

So to know how much we now believe in expert $H_4$ after observing $e_3,$ say, we need to know two things: First, the amount of clay that $H_4$ put onto $e_3,$ and second, the total amount of clay that all experts (including $H_4$ ) put onto $e_3.$

In turn, to know that, we need to know how much clay $H_4$ started with, and what fraction of its clay $H_4$ put onto $e_3.$ And similarly, to compute the total clay on $e_3,$ we need to know how much clay each expert $H_k$ started with, and what fraction of their clay $H_k$ put onto $e_3.$

So Bayes' theorem here would say:

$\mathbb P(H_4 \mid e_3) = \frac{\mathbb P(e_3 \mid H_4) \cdot \mathbb P(H_4)}{\sum_k \mathbb P(e_3 \mid H_k) \cdot \mathbb P(H_k)}$

What are the incentives of this game of clay?

On each round, the experts who gain the most are the experts who put the most clay on the observed $e_j,$ so if you know for certain that $e_3$ is about to be observed, your incentive is to put all your clay on $e_3.$

But putting literally all your clay on $e_3$ is risky; if $e_5$ is observed instead, you lose all your clay and are out of the game. Once an expert's amount of clay goes all the way to zero, there's no way for them to recover over any number of future rounds. That hypothesis is done, dead, and removed from the game. ("Falsification," some people call that.) If you're not certain that $e_5$ is literally impossible, you'd be wiser to put at least a little clay on $e_5$ instead. That is to say: if your mind puts some probability on $e_5,$ you'd better put some clay there too!

([bayes_score As it happens], if at the end of the game we score each expert by the logarithm of the amount of clay they have left, then each expert is incentivized to place clay exactly proportionally to their honest probability on each successive round.)

It's an important part of the game that we make the experts put down their clay in advance. If we let the experts put down their clay afterwards, they might be tempted to cheat by putting down all their clay on whichever $e_j$ had actually been observed. But since we make the experts put down their clay in advance, they have to divide up their clay among the possible outcomes: to give more clay to $e_3,$ that clay has to be taken away from some other outcome, like $e_5.$ To put a very high probability on $e_3$ and gain a lot of relative credibility if $e_3$ is observed, an expert has to stick their neck out and risk losing a lot of credibility if some other outcome like $e_5$ happens instead. If we force the experts to make advance predictions, that is!

We can also derive from this game that the question "does evidence $e_3$ support hypothesis $H_4$ ?" depends on how well $H_4$ predicted $e_3$ compared to the competition. It's not enough for $H_4$ to predict $e_3$ well if every other hypothesis also predicted $e_3$ well--your amazing new theory of physics gets no points for predicting that the sky is blue. $H_k$ only goes up in probability when it predicts $e_j$ better than the alternatives. And that means we have to ask what the alternative hypotheses predicted, even if we think those hypotheses are false.

If you get in a car accident, and don't want to relinquish the hypothesis that you're a great driver, then you can find all sorts of reasons ("the road was slippery! my car freaked out!") why $\mathbb P(e \mid GoodDriver)$ is not too low. But $\mathbb P(e \mid BadDriver)$ is also part of the update equation, and the "bad driver" hypothesis better predicts the evidence. Thus, your first impulse, when deciding how to update your beliefs in the face of a car accident, should not be "But my preferred hypothesis allows for this evidence!" It should instead be "Points to the 'bad driver' hypothesis for predicting this evidence better than the alternatives!" (And remember, you're allowed to [updatebeliefsincrementally increase $\mathbb P(BadDriver)$ a little bit], while still thinking that it's less than 50% probable.)

Proof

The proof of Bayes' theorem follows from the definition of Conditional probability:

$\mathbb P(X \mid Y) = \frac{\mathbb P(X \wedge Y)}{\mathbb P (Y)}$

And from the [law_marginal_probability law of marginal probability]:

$\mathbb P(Y) = \sum_k \mathbb P(Y \wedge X_k)$

Therefore:

$\mathbb P(H_i \mid e) = \frac{\mathbb P(H_i \wedge e)}{\mathbb P (e)} \tag{defn. conditional prob.}$

$\mathbb P(H_i \mid e) = \frac{\mathbb P(e \wedge H_i)}{\sum_k \mathbb P (e \wedge H_k)} \tag {law of marginal prob.}$

$\mathbb P(H_i \mid e) = \frac{\mathbb P(e \mid H_i) \cdot \mathbb P(H_i)}{\sum_k \mathbb P (e \mid H_k) \cdot \mathbb P(H_k)} \tag {defn. conditional prob.}$

QED.

Comments

Adam King

This experts-with-clay analogy I found EXTREMELY helpful. I appreciate different explanations work for different people, but I really do think this could have come a LOT earlier in the essay.

Dewi Morgan

If you get in a car accident, and don't want to relinquish the hypothesis that you're a great driver, then you can find all sorts of reasons $"the road was slippery\! my car freaked out\!"$ why $\\mathbb P(e \\mid GoodDriver)$ is not too low\. But $\\mathbb P(e \\mid BadDriver)$ is also part of the update equation, and the "bad driver" hypothesis better predicts the evidence\. Thus, your first impulse, when deciding how to update your beliefs in the face of a car accident, should not be "But my preferred hypothesis allows for this evidence\!" It should instead be "Points to the 'bad driver' hypothesis for predicting this evidence better than the alternatives\!" $And remember, you're allowed to increase $\\mathbb P(BadDriver)$ a little bit, while still thinking that it's less than 50% probable\.$

"you're allowed to increase P(BadDriver) a little bit,"

No, you're really not.

You're only allowed to replace P(BadDriver) with P(BadDriver|HadOneAccident).

If you have a second accident, you replace that in turn with P(BadDriver|HadOneAccident^HadASecondAccident), which if you are rational you might reexamine and update to P(BadDriver|HadTwoAccidents^HadQuiteALotOfNearMissesIfWeAreBeingHonest)

But my point is, when applying each new piece of evidence, you have to remember the conditions that caused you to get your current probability, or you end up with naive Bayes and after seeing a few new bookcases you believe in aliens.