[summary: A hypothesis is "strictly confused" by the data if the hypothesis does much worse at predicting the data than it expected to do. If, on average, you expect to assign around 1% likelihood to the exact observation you see, and you actually see something to which you assigned 0.000001% likelihood, you are strictly confused.]

A hypothesis is "strictly confused" by the data if the hypothesis does much worse at predicting the data than it expected to do. If, on average, you expect to assign around 1% likelihood to the exact observation you see, and you actually see something to which you assigned 0.000001% likelihood, you are strictly confused.

%%knows-requisite(Math 2): I.e., letting $H$ be a hypothesis and $e_0$ be the data observed from some set $E$ of possible observations, we say that $H$ is "strictly confused" when

$\log \mathbb P(e_0 \mid H) \ll \sum_{e \in E} \mathbb P(e \mid H) \cdot \log \mathbb P(e \mid H)$

Motivation and examples

In Bayesian reasoning, the main reason to reject a hypothesis is when we find a better hypothesis. Suppose we think a coin is fair, and we flip it 100 times, and we see that the coin comes up "HHHHHHH…" or all heads. After doing this 100 times, the hypothesis "This is a double-headed coin" has a likelihood ratio of $2^{100} : 1$ favoring it over the "fair coin" hypothesis, and the "double-headed coin" hypothesis isn't more improbable than $2^{-100}$ a priori.

But this relies on the insight that there's a simple / a priori plausible alternative hypothesis that does better. What if the coin is producing TTHHTTHHTTHH and we just never happen to think of 'alternating pairs of tails and heads' as a hypothesis? It's possible to do better by thinking of a better hypothesis, but so far as the 'fair coin' hypothesis sees the world, TTHHTTHH… is no more or less likely than any other possible sequence it could encounter; the first eight coinflips have a probability of $2^{-8}$ and this would have been true no matter which eight coinflips were observed. After observing 100 coinflips, the fair coin hypothesis will assign them a collective probability of $2^{-100},$ and in this sense, no sequence of 100 coinflips is any more 'surprising' or 'confusing' than any other from within the perspective of the fair coin hypothesis.

We can't say that we're 'confused' or 'surprised' on seeing a long sequence of coinflips to which we assigned some very low probability on the order of $2^{-100} \approx 10^{-30},$ because we expected to assign a probability that low.

On the other hand, suppose we think that a coin is biased to produce 90% heads and 10% tails, and we flip it 100 times and get some fair-looking sequence like "THHTTTHTTTTHTHTHHH…" (courtesy of random.org). Then we expected to assign the observed sequence a probability in the range of $0.9^{90} \cdot 0.1^{10} \approx 7\cdot 10^{-15},$ but we actually saw a sequence we assigned probability around $0.9^{50} \cdot 0.1^{50} \approx 5 \cdot 10^{-53}.$ We don't need to consider any other hypotheses to realize that we are very confused. We don't need to have invented the concept of a 'fair coin', or know that the 'fair coin' hypothesis would have assigned a much higher likelihood in the region of $7 \cdot 10^{-31},$ to realize that there's something wrong with the current hypothesis.

In the case of the supposed fair coin that produces HHHHHHH, we only do poorly relative to a better hypothesis 'all heads' that makes a superior prediction. In the case of the supposed 90%-heads coin that produces a random-looking sequence, we do poorly than we expected to do from inside the 90%-heads hypothesis, so we are doing poorly in an absolute, non-relative sense.

Being strictly confused is a sign that tells us to look for some alternative hypothesis in advance of our having any idea whatsoever what that alternative hypothesis might be.

Distinction from frequentist p-values

The classical frequentist test for rejecting the null hypothesis involves considering the probability assigned to particular 'obvious'-seeming partitions of the data, and asking if we ended up inside a low-probability partition.

Suppose you think some coin is fair, and you flip the coin 100 times and see a random-looking sequence "THHTTTHTT…"

Someone comes along and says, "You know, this result is very surprising, given your 'fair coin' theory. You really didn't expect that to happen."

"How so?" you reply.

They say, "Well, among all sequences of 1000 coins, only 1 in 16 such sequences start with a string like THHT TTHTT, a palindromic quartet followed by a palindromic quintet. You confidently predicted that had a 15/16 chance of not happening, and then you were surprised."

"Okay, look," you reply, "if you'd written down that particular prediction in advance and not a lot of others, I might be interested. Like, if I'd already thought that way of partitioning the data — namely, 'palindrome quartet followed by palindrome quintet' vs. 'not palindrome quartet followed by palindrome quintet' — was a specially interesting and distinguished one, I might notice that I'd assigned the second partition 15/16 probability and then it failed to actually happen. As it is, it seems like you're really reaching."

We can think of the frequentist tests for rejecting the fair-coin hypothesis as a small set of 'interesting partitions' that were written down in advance, which are supposed to have low probability given the fair coin. For example, if a coin produces HHHHH HTHHH HHTHH, a frequentist says, "Partitioning by number of heads, the fair coin hypothesis says that on 15 flips we should get between 12 and 3 heads, inclusive, with a probability of 98.6%. You are therefore surprised because this event you assigned 98.6% probability failed to happen. And yes, we're just checking the number of heads and a few other obvious things, not for palindromic quartets followed by palindromic quintets."

Part of the point of being a Bayesian, however, is that we try to only reason on the data we actually observed, and not put that data into particular partitions and reason about those partitions. The partitioning process introduces potential subjectivity, especially in an academic setting fraught with powerful incentives to produce 'statistically significant' data - the equivalent of somebody insisting that palindromic quartets and quintets are special, or that counting heads isn't special.

E.g., if we flip a coin six times and get HHHHHT, this is "statistically significant p < 0.05" if the researcher decided to flip coins until they got at least one T and then stop, in which case a fair coin has only a 1/32 probability of requiring six or more steps to produce a T. If on the other hand the researcher decided to flip the coin six times and then count the number of tails, the probability of getting 1 or fewer T in six flips is 7/64 which is not 'statistically significant'.

The Bayesian says, "If I use the Rule of Succession to denote the hypothesis that the coin has an unknown bias between 0 and 1, then the sequence HHHHHT is assigned 1/30 probability by the Rule of Succession and 1/64 probability by 'fair coin', so this is evidence with a likelihood ratio of ~ 2 : 1 favoring the hypothesis that the coin is biased - not enough to [ overcome] any significant [ prior improbability]."

The Bayesian arrives at this judgment by only considering the particular, exact data that was observed, and not any larger partitions of data. To compute the probability flow between two hypotheses $H_1$ and $H_2$ we only need to know the likelihoods of our exact observation given those two hypotheses, not the likelihoods the hypotheses assign to any partitions into which that observation can be put, etcetera.

Similarly, the Bayesian looks at the sequence HHHHH HTHHH HHTHH and says: this specific, exact data that we observed gives us a likelihood ratio of (1/1680 : 1/32768) ~ (19.5 : 1) favoring "The coin has an unknown bias between 0 and 1" over "The coin is fair". With that already said, the Bayesian doesn't see any need to talk about the total probability of the fair coin hypothesis producing data inside a partition of similar results that could have been observed but weren't.

But even though Bayesians usually try avoid thinking in terms of rejecting a null hypothesis using partitions, saying "I'm strictly confused!" gives a Bayesian a way of saying "Well, I know something's wrong…" that doesn't require already having the insight to propose a better alternative, or even the insight to realize that some particular partitioning of the data is worth special attention.

Comments

Leon D

I propose that this concept be called "unexpected surprise" rather than "strictly confused":

"Strictly confused" suggests logical incoherence.
"Unexpected surprise" can be motivated the following way: let $s(d) = \textrm{surprise}(d \mid H) = - \log \Pr (d \mid H)$ be how surprising data $d$ is on hypothesis $H$ . Then one is "strictly confused" if the observed $s$ is larger than than one would expect assuming a $H$ holds.

This terminology is nice because the average of $s$ under $H$ is the entropy or expected surprise in $(d \mid H)$ . It also connects with Bayes, since $\textrm{log-likelihood} = -\textrm{surprise}$ is the evidential support $d$ gives $H$ .

The section on "Distinction from frequentist p-values" is, I think, both technically incorrect and a bit uncharitable.
- It's technically incorrect because the following isn't true:
  
  The classical frequentist test for rejecting the null hypothesis involves considering the probability assigned to particular 'obvious'-seeming partitions of the data, and asking if we ended up inside a low-probability partition.
  
  Actually, the classical frequentist test involves specifying an obvious-seeming measure of surprise $t(d)$ , and seeing whether $t$ is higher than expected on $H$ . This is even more arbitrary than the above.
- On the other hand, it's uncharitable because it's widely acknowledged one should try to choose $t$ to be sufficient, which is exactly the condition that the partition induced by $t$ is "compatible" with $\Pr(d \mid H)$ for different $H$ , in the sense that $\Pr(H \mid d) = \Pr(H \mid t(d))$ for all the considered $H$ .
  
  Clearly $s$ is sufficient in this sense. But there might be simpler functions of $d$ that do the job too ("minimal sufficient statistics").
  
  Note that $t$ being sufficient doesn't make it non-arbitrary, as it may not be a monotone function of $s$ .
Finally, I think that this concept is clearly "extra-Bayesian", in the sense that it's about non-probabilistic ("Knightian") uncertainty over $H$ , and one is considering probabilities attached to unobserved $d$ (i.e., not conditioning on the observed $d$ ).

I don't think being "extra-Bayesian" in this sense is problematic. But I think it should be owned-up to.

Actually, "unexpected surprise" reveals a nice connection between Bayesian and sampling-based uncertainty intervals:
- To get a (HPD) credible interval, exclude those $H$ that are relatively surprised by the observed $d$ (or which are a priori surprising).
- To get a (nice) confidence interval, exclude those $H$ that are "unexpectedly surprised" by $d$ .

Javier Ivona

In the paragraph 4th from last, page says the sequence HHHHHT "is assigned 1/30 probability by the Rule of Succession". Where does this number come from? They don't explain. I do understand the part about that same sequence being assigned 1/64 by the fair coin hypothesis, but the part about the rule of succession isn't so clear to me.

The second example, in the paragraph 2nd from last, is also confusing to me: the part that says that the sequence HHHHH HTHHH HHTHH gives the Bayesian a 19.5 : 1 chance of the coin being biased vs it being fair.