The log-odds form of Bayes' rule says that strength of belief and strength of evidence can both be measured in bits. These evidence-bits can also be used to measure a quantity called "Bayesian surprise", which yields [fixme: One final, if this is the last thing in the path]yet another intuition for understanding Bayes' rule.
Roughly speaking, we can measure how surprised a hypothesis was by the evidence by measuring how much probability it put on If put 100% of its probability mass on , then is completely unsurprising (to ). If put 0% of its probability mass on , then is as surprising as possible. Any measure of the probability assigned to , that obeys this property, is worthy of the label "surprise." Bayesian surprise is which is a quantity that obeys these intuitive constraints and has some other interesting features.
Consider again the Blue oysters problem. Consider the hypotheses and , which say "the oyster will contain a pearl" and "no it won't", respectively. To keep the numbers easy, let's say we draw an oyster from a third bay, where of pearl-carrying oysters are blue and of empty oysters are blue.
Imagine what happens when the oyster is blue. predicted blueness with of its probability mass, while predicted blueness with of its probability mass. Thus, did better than and goes up in probability. Previously, we've been combining both and into unified likelihood ratios, like which says that the 'blue' observation carries 1 bit of evidence However, we can also take the logs first, and combine second.
Because assigned only an eighth of its probability mass to the 'blue' observation, and because Bayesian update works by eliminating incorrect probability mass, we have to adjust our belief in by bits away from (Each negative bit means "throw away half of 's probability mass," and we have to do that 3 times in order to remove the probability that failed to assign to .)
Similarly, because assigned only a quarter of its probability mass to the 'blue' observation, we have to adjust our belief in by bits away from
Thus, when the 'blue' observation comes in, we move our belief (measured in bits) 3 notches away from and then two notches back towards On net, our belief shifts 1 notch away from .
assigned 1/8th of its probability mass to blueness, so it emits bits of surprise pushing away from . assigned 1/4th of its probability mass to blueness, so it emits bits of surprise pushing away from (and towards ). Thus, belief in moves 1 bit towards , on net.
If instead predicted blue with probability 4% (penalty ) and predicted blue with probability 8% (penalty ), then we would have shifted a bit over 4.6 notches towards and a bit over 3.6 notches back towards but we would have shifted the same number of notches on net. This is why it's only the relative difference between the number of bits docked from and the number of bits docked from that matters.
In general, given an observation and a hypothesis the number of bits we need to dock from our belief in is that is, the log of the probability that assigned to This quantity is never positive, because the logarithm of for is in the range . If we negate it, we get a non-negative quantity that relates to , which is 0 when was certain that was going to happen, and which is infinite when was certain that wasn't going to happen, and which is measured in the same units as evidence and belief. Thus, this quantity is often called "surprise," and intuitively, it measures how surprised the hypothesis was by (in bits).
There is some correlation between Bayesian surprise and the times when a human would feel surprised (at seeing something that they thought was unlikely), but, of course, the human emotion is quite different. (A human can feel surprised for other reasons than "my hypotheses failed to predict the data," and humans are also great at ignoring evidence instead of feeling surprised.)
Given this definition of Bayesian surprise, we can view Bayes' rule as saying that surprise repels belief. When you make an observation each hypothesis emits repulsive "surprise" signals, which shift your hypothesis. Referring again to the image above, when predicts the observation you made with of its probability mass, and predicts it with of its probability mass, we can imagine emitting a surprise signal with a strength of 3 bits away from and emitting a surprise signal with a strength of 2 bits away from . Both those signals push the belief in in different directions, and it ends up 1 bit closer to (which emitted the weaker surprise signal).
In other words, whenever you find yourself feeling surprised by something you saw, think of the least surprising explanation for that evidence — and then award that hypothesis a few bits of belief.
Comments
Noah Luxton
The log used to determine number of bits should probably be consistent throughout or clarified each time. Here, the log 2 scale is used, when elsewhere there is usage of the log 10 scale.