Report likelihoods, not p-values

[summary: If scientists reported likelihood functions instead of p-values, this could help science avoid p-hacking, publication bias, the decline effect, and other hazards of standard statistical techniques. Furthermore, it could help make it easier to combine results from multiple studies and perfrom meta-analyses, while making statistics intuitively easier to understand. (This is a bold claim, but a claim which is largely supported by Probability theory.)]

This page advocates for a change in the way that statistics is done in standard scientific journals. The key idea is to report likelihood functions instead of p-values, and this could have many benefits.

(Note: This page is a personal opinion page.)

[toc:]

What's the difference?

The status quo across scientific journals is to test data for "[statistically_significant statistical significance]" using functions such as [p_value p-values]. A p-value is a number calculated from a hypothesis (called the "null hypothesis"), an experiment, a result, and a [-summary_statistic]. For example, if the null hypothesis is "this coin is fair," and the experiment is "flip it 6 times", and the result is HHHHHT, and the summary statistic is "the sequence has at least five H values," then the p-value is 0.11, which means "if the coin were fair, and we did this experiment a lot, then only 11% of the sequences generated would have at least five H values."%%note:This does not mean that the coin is 89% likely to be biased! For example, if the only alternative is that the coin is biased towards tails, then HHHHHT is evidence that it's fair. This is a common source of confusion with p-values.%% If the p-value is lower than an arbitrary threshold (usually $p < 0.05$ ) then the result is called "statistically significant" and the null hypothesis is "rejected."

This page advocates that scientific articles should report likelihood functions instead of p-values. A likelihood function for a piece of evidence $e$ is a function $\mathcal L$ which says, for each hypothesis $H$ in some set of hypotheses, the probability that $H$ assigned to $e$ , written [51n $\mathcal L_e(H)$ ].%%note: Many authors write $\mathcal L(H \mid e)$ instead. We think this is confusing, as then $\mathcal L(H \mid e) = \mathbb P(e \mid H),$ and it's hard enough for students of statistics to keep "probability of $H$ given $e$ " and "probability of $e$ given $H$ " straight as it is if the notation isn't swapped around every so often.%% For example, if $e$ is "this coin, flipped 6 times, generated HHHHHT", and the set of hypotheses are $H_{0.25} =$ "the coin only produces heads 25% of the time" and $H_{0.5}$ = "the coin is fair", then $\mathcal L_e(H_{0.25})$ $=$ $0.25^5 \cdot 0.75$ $\approx 0.07\%$ and $\mathcal L_e(H_{0.5})$ $=$ $0.5^6$ $\approx 1.56\%,$ for a likelihood ratio of about $21 : 1$ in favor of the coin being fair (as opposed to biased 75% towards tails).

In fact, with a single likelihood function, we can report the amount of support $e$ gives to every hypothesis $H_b$ of the form "the coin has bias $b$ towards heads":%note:To learn how this graph was generated, see Bayes' rule: Functional form.%

Note that this likelihood function is not telling us the probability that the coin is actually biased, it is only telling us how much the evidence supports each hypothesis. For example, this graph says that HHHHHT provides about 3.8 times as much evidence for $H_{0.75}$ over $H_{0.5}$ , and about 81 times as much evidence for $H_{0.75}$ over $H_{0.25}.$

Note also that the likelihood function doesn't necessarily contain the right hypothesis; for example, the function above shows the support of $e$ for every possible bias on the coin, but it doesn't consider hypotheses like "the coin alternates between H and T". Likelihood functions, like p-values, are essentially a mere summary of the raw data — there is no substitute for the raw data when it comes to allowing people to test hypotheses that the original researchers did not consider. (In other words, even if you report likelihoods instead of p-values, it's still virtuous to share your raw data.)

Where p-values let you measure (roughly) how well the data supports a single "null hypothesis", with an arbitrary 0.05 "not well enough" cutoff, the likelihood function shows the support of the evidence for lots and lots of different hypotheses at once, without any need for an arbitrary cutoff.

Why report likelihoods instead of p-values?

1. Likelihood functions are less arbitrary than p-values. To report a likelihood function, all you have to do is pick which hypothesis class to generate the likelihood function for. That's your only degree of freedom. This introduces one source of arbitrariness, and if someone wants to check some other hypothesis they still need access to the raw data, but it is better than the p-value case, where you only report a number for a single "null" hypothesis.

Furthermore, in the p-value case, you have to pick not only a null hypothesis but also an experiment and a summary statistic, and these degrees of freedom can have a huge impact on the final report. These extra degrees of freedom are both unnecessary (to carry out a probabilistic update, all you need are your own personal beliefs and a likelihood function) and exploitable, and empirically, they're actively harming scientific research.

2. Reporting likelihoods would solve p-hacking. If you're using p-values, then you can game the statistics via your choice of experiment and summary statistics. In the example with the coin above, if you say your experiment and summary statistic are "flip the coin 6 times and count the number of heads" then the p-value of HHHHHT with respect to $H_{0.5}$ is 0.11, whereas if you say your experiment and summary statistic are "flip the coin until it comes up tails and count the number of heads" then the p-value of HHHHHT with respect to $H_{0.5}$ is 0.03, which is "significant." This is called "p-hacking", and it's a serious problem in modern science.

In a likelihood function, the amount of support an evidence gives to a hypothesis does not depend on which experiment the researcher had in mind. Likelihood functions depend only on the data you actually saw, and the hypotheses you chose to report. The only way to cheat a likelihood function is to lie about the data you collected, or refuse to report likelihoods for a particular hypothesis.

If your paper fails to report likelihoods for some obvious hypotheses, then (a) that's precisely analogous to you choosing the wrong null hypothesis to consider; (b) it's just as easily noticeable as when your paper considers the wrong null hypothesis; and (c) it can be easily rectified given access to the raw data. By contrast, p-hacking can be subtle and hard to detect after the fact.

3. Likelihood functions are very difficult to game. There is no analog of p-hacking for likelihood functions. This is a theorem of probability theory known as [-conservation_of_expected_evidence], which says that likelihood functions can't be gamed unless you're falsifying or omitting data (or screwing up the likelihood calculations).%note:Disclaimer: the theorem says likelihood functions can't be gamed, but we still shouldn't underestimate the guile of dishonest researchers struggling to make their results look important. Likelihood functions have not been put through the gauntlet of real scientific practice; p-values have. That said, when p-values were put through that gauntlet, they failed in a spectacular fashion. When rebuilding, it's probably better to start from foundations that provably cannot be gamed.%

4. Likelihood functions would help stop the "vanishing effect sizes" phenomenon. The decline effect occurs when studies which reject a null hypothesis $H_0$ have effect sizes that get smaller and smaller and smaller over time (the more someone tries to replicate the result). This is usually evidence that there is no actual effect, and that the initial "large effects" were a result of publication bias.

Likelihood functions help avoid the decline effect by treating different effect sizes differently. The likelihood function for coins of different biases shows that the evidence HHHHHT gives a different amount of support to $H_{0.52},$ $H_{0.61}$ , and $H_{0.8}$ (which correspond to small, medium, and large effect sizes, respectively). If three different studies find low support for $H_{0.5},$ and one of them gives all of its support to the large effect, another gives all its support to the medium effect, and the third gives all of its support to the smallest effect, then likelihood functions reveal that something fishy is going on (because they're all peaked in different places).

If instead we only use p-values, and always decide whether or not to "keep" or "reject" the null hypothesis (without specifying how much support goes to different alternatives), then it's hard to notice that the studies are actually contradictory (and that something very fishy is going on). Instead, it's very tempting to exclaim "3 out of 3 studies reject $H_{0.5}$ !" and move on.

5. Likelihood functions would help stop publication bias. When using p-values, if the data yields a p-value of 0.11 using a null hypothesis $H_0$ , the study is considered "insignificant," and many journals have a strong bias towards positive results. When reporting likelihood functions, there is no arbitrary "significance" threshold. A study that reports a relative likelihoods of $21 : 1$ in favor of $H_a$ vs $H_0,$ that's exactly the same strength of evidence as a study that reports $21 : 1$ odds against $H_a$ vs $H_0.$ It's all just evidence, and it can all be added to the corpus, there's no arbitrary "significance" threshold.

6. Likelihood functions make it trivially easy to combine studies. When combining studies that used p-values, researchers have to perform complex meta-analyses with dozens of parameters to tune, and they often find exactly what they were expecting to find. By contrast, the way you combine multiple studies that reported likelihood functions is… (drumroll) …you just multiply the likelihood functions together. If study A reports that $H_{0.75}$ was favored over $H_{0.5}$ with a Relative likelihood of $3.8 : 1$ , and study B reports that $H_{0.75}$ was favored over $H_{0.5}$ at $5 : 1$ , then the combined likelihood functions of both studies favors $H_{0.75}$ over $H_{0.5}$ at $(3.8 \cdot 5) : 1$ $=$ $19 : 1.$

Want to combine a hundred studies on the same subject? Multiply a hundred functions together. Done. No parameter tuning, no degrees of freedom through which bias can be introduced — just multiply.

7. Likelihood functions make it obvious when something has gone wrong. If, when you multiply all the likelihood functions together, all hypotheses have extraordinarily low likelihoods, then something has gone wrong. Either a mistake has been made somewhere, or fraud has been committed, or the true hypothesis wasn't in the hypothesis class you're considering.

The actual hypothesis that explains all the data will have decently high likelihood across all the data. If none of the hypotheses fit that description, then either you aren't considering the right hypothesis yet, or some of the studies went wrong. (Try looking for one study that has a likelihood function very very different from all the other studies, and investigate that one.)

Likelihood functions won't do your science for you — you still have to generate good hypotheses, and be honest in your data reporting — but they do make it obvious when something went wrong. (Specifically, each hypothesis can tell you how low its likelihood is expected to be on the data, and if every hypothesis has a likelihood far lower than expected, then something's fishy.)

A scientific community using likelihood functions would produce scientific research that's easier to use. If everyone's reporting likelihood functions, then all you personally need to do in order to figure out what to believe is take your own personal (subjective) prior probabilities and multiply them by all the likelihood functions in order to get your own personal (subjective) posterior probabilities.

For example, let's say you personally think the coin is probably fair, with $10 : 1$ odds of being fair as opposed to 75% biased in favor of heads. Now let's say that study A reports a likelihood function which favors $H_{0.75}$ over $H_{0.5}$ with a likelihood ratio of $3.8 : 1.$ , and study B reports a $5 : 1$ likelihood ratio in the same direction. Multiplying all these together, your personal posterior beliefs should be $19 : 10$ in favor of $H_{0.75}$ over $H_{0.5}$ . This is simply Bayes' rule. Reporting likelihoods instead of p-values lets science remain objective, while allowing everyone to find their own personal posterior probabilities via a simple application of Bayes' theorem.

Why should we think this would work?

This may all sound too good to be true. Can one simple change really solve that many problems in modern science?

First of all, you can be assured that reporting likelihoods instead of p-values would not "solve" all the problems above, and it would surely not solve all problems with modern experimental science. Open access to raw data, preregistration of studies, a culture that rewards replication, and many other ideas are also crucial ingredients to a scientific community that zeroes in on truth.

However, reporting likelihoods would help solve lots of different problems in modern experimental science. This may come as a surprise. Aren't likelihood functions just one more statistical technique, just another tool for the toolbox? Why should we think that one single tool can solve that many problems?

The reason lies in Probability theory. According to the axioms of probability theory, there is only one good way to account for evidence when updating your beliefs, and that way is via likelihood functions. Any other method is subject to inconsistencies and pathologies, as per the [probability_coherence_theorems coherence theorems of probability theory].

If you're manipulating equations like $2 + 2 = 4,$ and you're using methods that may or may not let you throw in an extra 3 on the right hand side (depending on the arithmetician's state of mind), then it's no surprise that you'll occasionally get yourself into trouble and deduce that $2 + 2 = 7.$ The laws of arithmetic show that there is only one correct set of tools for manipulating equations if you want to avoid inconsistency.

Similarly, the laws of probability theory show that there is only one correct set of tools for manipulating uncertainty if you want to avoid inconsistency. According to those rules, the right way to represent evidence is through likelihood functions.

These laws (and a solid understanding of them) are younger than the experimental science community, and the statistical tools of that community predate a modern understanding of probability theory. Thus, it makes a lot of sense that the existing literature uses different tools. However, now that humanity does possess a solid understanding of probability theory, it should come as no surprise that many diverse pathologies in statistics can be cleaned up by switching to a policy of reporting likelihoods instead of p-values.

What are the drawbacks?

The main drawback is inertia. Experimental science today reports p-values almost entirely across the board. Modern statistical toolsets have built-in support for p-values (and other related statistical tools) but very little support for reporting likelihood functions. Experimental scientists are trained mainly in [-frequentist_statistics], and thus most are much more familiar with p-value-type tools than likelihood-function-type tools. Making the switch would be painful.

Barring the switching costs, though, making the switch could well be a strict improvement over modern techniques, and would help solve some of the biggest problems facing science today.

Comments

Eric Rogstad

Likelihood functions help avoid the decline effect by treating different effect sizes differently\. The likelihood function for coins of different biases shows that the evidence HHHHHT gives a different amount of support to $H\_{0.55},$ $H\_{0.6}$ , and $H\_{0.8}.$ If three different studies find low support for $H\_{0.5},$ and one of them gives all of its support to the large effect, another gives all its support to the medium effect, and the third gives al of its support to the smallest effect, then likelihood functions reveal that something fishy is going on $because they're all peaked in different places$\.

Do the different biases of coin correspond to different effect sizes? (E.g. large effect corresponds to H0.8, medium to H0.6, small effect corresponds to H0.55)