Why It Is Problematic to Calculate Probabilities of Findings Given Range Null Hypotheses

An important problem with null hypothesis significance testing, as it is normally performed, is that it is uninformative to reject a point null hypothesis [1]. A way around this problem is to use range null hypotheses [2]. But the use of range null hypotheses also is problematic. Aside from the usual issues of whether null hypothesis significance tests can be justified at all, there is an issue that is specific to range null hypotheses. It is not straightforward how to calculate the probability of the data given a range null hypothesis. The traditional way is to use the single point that maximizes the obtained p-value. The Bayesian alternative is to propose a prior probability distribution and integrate across it. Because frequentists and Bayesians disagree about a variety of issues, especially those pertaining to whether it is permissible to assign probabilities to hypotheses, and what gets lost in the shuffle is that the two camps actually come to different answers for the probability of the data given a range null hypothesis. Because the probability of the data given the hypothesis is a precursor for both camps, for drawing conclusions about hypotheses, different values for this probability for the different camps is crucial but seldom acknowledged. The goal of the present article is to bring out the problem in a manner accessible to researchers without strong mathematical or statistical backgrounds.


Introduction
Frequentists and Bayesians disagree about how to handle the inverse inference issue. How does a researcher traverse a pathway from the calculated probability of the finding given a hypothesis (such as the null hypothesis) to the probability of the hypothesis given the finding? Bayesians argue that direct inverse inferences are invalid, thereby similarly invalidating the null hypothesis significance testing procedure. In contrast, frequentists criticize Bayesians for having to make unjustified assumptions about priori probabilities of null hypotheses to allow the Bayesian machinery to run. In marked contrast to this issue, there is little literature on what would seem to be an issue that precedes the inverse inference issue; namely, how does one calculate the probability of the finding given a hypothesis in the first place? It might seem that this is straightforward, and it is straightforward in the context of point hypotheses. But it is not straightforward in the context of range hypotheses which provide the present focus. Put simply, the question of interest is: Given a range hypothesis, how can one calculate the probability of the finding given it?
To understand why we should care about range null hypotheses at all, it is necessary to consider, in detail, that which is so well known that few consider it carefully. First, there is a preliminary issue about whether hypotheses can have probabilities at all. Second, there is an additional preliminary issue about precisely the logic by which frequentists decide between competing hypotheses. My immediate goal in these sections is not to take sides but rather to bring out the disagreements. My more general goal is to show that both sides can be faulted not just on the difficult issue of inverse inference, but even on the more basic issue of calculating the probability of the finding given a range hypothesis. If a calculation this basic already is problematic, the inverse inference issue may be even more intractable.

Probabilities of Hypotheses
Bayesians and frequentists disagree with each other with respect to how they draw conclusions about hypotheses. Bayesians are willing to assume that hypotheses have probabilities anywhere between 0 and 1, whereas the furthest frequentists are willing to go is to allow that hypotheses can have probabilities of 0 (hypothesis is false) or 1 (hypothesis is true), but nothing between these extreme values. And of course, most frequentists will freely admit that they do not know whether to assign a probability of 0 or a 1 to a particular hypothesis. This admission causes most frequentists to focus on procedures for controlling the error rate, rather than assigning values to particular hypotheses [3] [4]. From a Bayesian perspective, frequentists might be considered to be too "conservative" but from a frequentist perspective, Bayesians can be considered to be too "liberal." Graduate training in the sciences tends to stress scientific conservativism; scientists should demand reasonably impressive evidence before being willing to draw a conclusion. From this perspective, the fact that frequentists are more conservative than liberals easily can be taken as evidence that frequentists are more "scientific" than Bayesians. When contrasting the two perspectives against each other at the level of drawing conclusions about hypotheses, it is relatively easy to make this argument. It clearly is more conservative to admit to not knowing how to assign a probability to a hypothesis than to insist that one does know how to do this. Saying "I do not know" is more conservative than making numerical assignments of numbers to hypotheses.
In the other direction, however, Bayesians could claim that frequentists are too liberal because they use 0.05 as the alpha level for deciding statistical significance. In general, Bayesians claim to be more conservative than frequentists because they insist that the probability of the favored hypothesis given the finding, when they get to that point, be at least 8 or even 10 times greater the probability of the hypothesis that is not favored, before believing the favored hypothesis [5]. It is possible to suggest that the two groups are talking about apples and oranges because of the difficulty of comparing p-values against posterior ratios.
Thus, the two sides disagree on whether the notion of a probability of a hypothesis (other than zero or one) makes sense at all, whether it is possible to calculate such an entity even if it did make sense, on the importance of the probabilities of findings given hypotheses, and on whether liberalism or conservativism is about ratios of hypothesis probabilities or about probabilities of findings given hypotheses. But the main issue of interest here is yet to come, and it falls out of the issue of the plausibility of point null hypotheses.

Are Null Hypotheses Plausible?
Over many decades, there has accumulated much criticism pertaining to the null hypothesis significance testing procedure (NHSTP) [1] [6]- [23]. The criticism that is of particular present relevance is that because the null hypothesis specifies an exact value when there is an infinitude of possible values, the null hypothesis almost certainly is not true [1]. Therefore, the NHSTP is a pointless exercise because it results in the rejection of a null hypothesis that is not plausible anyhow.
Although there have been a couple of attempts to argue that the null hypothesis is plausible under particular circumstances [24] [25], this is not the main defense. The main defense is that one does not have to settle for using a point null hypothesis. It is possible to let the null hypothesis specify a range of values, as is the case when one performs a one-tailed test, and so the rejection of the null hypothesis is meaningful after all [26] [27] [28].
My goal is to examine this argument carefully to see where it leads. However, it is first necessary to review the syllogisms that come into play in discussions of this sort.

The Syllogisms
Let us commence with the usual logic that accompanies traditional two-tailed significance tests. In such cases, researchers define a point null hypothesis to be contrasted against a range alternative hypothesis. Because the arguments to be developed do not depend on the idiosyncrasies of any particular type of study, let us consider the simplest possible case of coin tosses and whether or not the coin is fair. We might define null and alternative hypotheses as follows where P(H) refers to the probability of heads. In the foregoing case, the logic is simple and based on the ability to use a small p-value to reject the null hypothesis 1  {Conclusion} Syllogism 2 has a rather obvious flaw that stems from the fact that Premise 1 states three possibilities, which is necessitated by the fact that Case 2 leaves open the possibility that P(H) can be greater than 0.5 (alternative hypothesis), equal to 0.5 (null hypothesis), or less than 0.5 (unstated hypothesis). Therefore, rejecting the null hypothesis that the probability of heads is equal to 0.5 does not allow an unambiguous conclusion about whether this probability is less than or greater than 0.5. Put simply, Syllogism 2 is blatantly invalid when based on Case 2. Perhaps it is a recognition of this invalidity that is responsible for some statistical authorities favoring range null hypotheses. As an example, consider Case 3 and The combination of Case 3 and Syllogism 3 seems beautiful. It is logically valid and, at the same time, solves the problem that we had earlier of rejecting a non-plausible null hypothesis. Rejecting the null hypothesis in Case 3 is quite informative because doing so also causes half of the possibilities to be rejected, 1 In the interest of full disclosure, I do not believe that a small p-value justifies rejecting the null hypothesis. However, let us accept this premise for the sake of argument. 2 Note that Case 2 and Case 3 both would be tested using a one-tailed test according to the traditional null hypothesis significance testing procedure. thereby allowing a directional hypothesis to be supported. Therefore, it is worth examining this combination in more detail.

Getting the p-Value
The standard way to handle the combination of Case 3 and Syllogism 3 is to use a one-tailed test. For coin tosses, one would use the binomial theorem. Suppose that one has obtained k heads out of N tosses. The one-tailed probability is simply the probability of having obtained k heads out of N tosses, plus the probability of having obtained k + 1 heads out of N tosses, and so on, up to N heads out of N tosses. The binomial theorem is presented below as Equation (1): Suppose that an investigator performed a study that involved N = 20 coin tosses and k = 17 heads. The normal procedure would be to use Equation (1) as follows. Set p at the "fair coin" level of 0.5 (note that this p is not the same p as in p-value), and substitute 20 and 17 for N and k in Equation (1), respectively, but with three more iterations where 18, 19, and 20 are substituted for k. This is performed below:

Simplifying the Problem
Let us commence with a null hypothesis that specifies only two values, rather than dealing with a range of values. Later, we will add more values.

The Example of a Null Hypothesis with Two Values
Suppose that we have a null hypothesis with two values instead of a null hypo-thesis with a range of values. This is shown in Case 4. To answer this question, let us put aside the null hypothesis for a moment and consider the abstract case where we are concerned with the probability of A given that C or D is true. For example, imagine we are invited to dinner and we are interested in the probability that our hostess will serve chocolate for dessert (A) given that she serves chicken (C) or fish (D) for dinner. In symbols, we are in- Let us assume that C and D are mutually exclusive ( ) In the dinner example, our hypothetical hostess would never serve both chicken and fish for dinner, though she might serve either one or something else entirely. It is possible to rewrite the expression of interest so that we have only conditional and unconditional probabilities (see Equation (5) below): Equation (5) makes clear that we need not only the conditional probability of A given C or D, but we also need the unconditional probability of C and the unconditional probability of D, in order to calculate the conditional probability of A given that C or D is true. Returning to our hostess, if we wish to calculate the probability that she will serve chocolate for dessert given that she serves chicken or fish for dinner, we need to know the unconditional probability that she will serve chicken for dinner and the unconditional probability that she will serve fish for dinner. If we do not know these two unconditional probabilities, there is no way for us to calculate the conditional probability that our hostess will serve chocolate for dessert given that she serves chicken or fish for dinner.
Let us now apply what we learned from our hostess to consider again the combination of Case 4 and Syllogism 4, where the null hypothesis specifies that p = 0 or p = 0.50. Equation (5) tells us that in order to calculate the probability of getting, say, 17 heads out of 20 tosses, given that the population proportion of heads is 0 or 0.50, we would need to know the unconditional probability that the population proportion of heads is 0.50 and the unconditional probability that the population proportion of heads is 0. We saw earlier how to calculate the conditional probability of 17 heads out of 20 tosses given a single value (p = 0.50), but Equation (5) shows that this is insufficient when there are two population values to consider. Again, although we do need the conditional probability of 17 heads out of 20 tosses, given that p = 0 or p = 0.50, we also need the two unconditional probabilities concerning the 0 and 0.50 population values. Are we stuck?
It depends, to some extent, on one's philosophical position pertaining to whether hypothesized population values can take on probabilities. If one's answer is "yes," as would be the case with most Bayesians, then we are not stuck.
The researcher would find an arbitrary way of assigning probabilities to p = 0 and p = 0.50, and then it would be easy to carry the calculation through. An example of an arbitrary approach would be to say that because we have no reason to favor p = 0 over p = 0.50, or to favor p = 0.50 over p = 0, we can assign a probability of 0.5 to each of these. Using this arbitrary system, we might perform the following calculation: In fact, there is, though it also depends on arbitrariness [29] [30]. To move in this direction, consider that in the foregoing reasoning, we commenced with the assumption that it is reasonable to assign probabilities to hypotheses. But it is possible not to assume this and there are two possibilities for not making the assumption. One possibility is to assert that hypotheses do not have any probabilities whatsoever. If we make this assertion, then there obviously is no way to use Equation (5)  So what is correct? If one assumes that hypotheses about population values can have probabilities other than 0 or 1, this results in a dilemma for those who wish to perform one-tailed tests to reject null hypotheses. That is, to make the logic work out so that rejecting the null hypothesis is both meaningful (because it specifies a range rather than a point) and also really does force acceptance of the alternative hypothesis, it is necessary to have a range null hypothesis rather than a point null hypothesis. However, again from the perspective that it is reasonable to assign probabilities to hypotheses, Equation (5) shows that the mathematics typically used gives wrong answers! That is, the mathematics of using only the binomial theorem, without taking unconditional probabilities into account, is blatantly wrong by Equation (5). On the other hand, if one uses the argument that it only is permissible to assign a value of 0 or 1 to each population value, perhaps it is possible to justify the binomial calculation that results in a value of 0.001 for the probability of 17 heads out of 20 tosses. We will discuss this further, but let us come closer to considering the whole range of population values first. We might ask what the probability is of getting 17 or more heads out of 20 trials given Equation (6). The distribution is "normal-like" rather than "normal" because it is not continuous, nor can the tails extend infinitely. should be apparent. First, different decisions about assigning probabilities to population values will render different probabilities of getting 17 or more heads out of 20 tosses. Second, although it is possible to make decisions that would result in findings similar to the binomial calculation with which we commenced, it also is possible to make decisions that would result in findings that differ markedly from that obtained by the binomial calculation.

Using 0, 0.01, 0.02, •••, 0.50 as Population Values for the Null Hypothesis
Or, we could resort again to the strategy of maximizing the calculated probability of obtaining 17 or more heads out of 20 tosses. In this case, we would assign a probability of 1 to the population value of 0.5, and a probability of 0 to all of the other population values. In this case, Equation (6) would reduce down to a single term that is equivalent to the binomial calculation with which we commenced-namely, the probability of 17 or more heads out of 20 tosses is 0.001.

The Continuous Null Hypothesis
Let us now return to Case 3 and Syllogism 3, copied below for the reader's convenience. In the combination of Case 3 and Syllogism 3, we have a continuous range, with an infinite number of points contained within the range from 0 to 0.5.
Nevertheless, the issues that were raised still apply. From the philosophical point of view that hypothesized population values can have probabilities, the question in this continuous case is: What is the density distribution that the researcher should assign to the range going from 0 to 0.5? Here, if one desired, one could apply a uniform distribution, an approximation of the normal distribution, a triangular distribution, or many others. Further, the researcher might wish to define the peak, if there is one, at 0.5 but might instead choose 0.25, or even 0, with the choice being influenced by the shape of the distribution one wishes to assume. To find the probability of the finding given the range null hypothesis, the researcher would have to integrate across the range of values for the assumed distribution. The philosophical weak point, perhaps, is that it is difficult to know what distribution to use and also, if the chosen distribution has a peak, it may be difficult to know where that peak should be.
Or, we can be traditional, and again assign a prior probability of 1 to the population value of 0.5 and a probability of 0 to everything else. As usual, this results in a binomial calculation for the probability of 17 or more heads out of 20 tosses as being 0.001 5 . If a researcher decided to assign probabilities other than 0 or 1 to hypotheses, there are many ways of integrating across the range from 0 to 0.5, depending on the assumed distribution, that would result in values for the probability of 17 heads out of 20 tosses that differ markedly from each other, and also from the strict binomial calculation based on a probability of 1 for the 0.5 population value.

Discussion
Having taken considerable trouble to mark out the issues, does it make sense to compute the probabilities of findings given range null hypotheses? As we will see below, there is more than one way to think about it.

Three Perspectives on Probabilities of Findings Given Hypotheses
Calculations of probabilities of findings given hypotheses play a role in three  p-values should be used in concert with other information, rather than being the sole piece of information used to decide whether or not to reject hypotheses [37].
But a researcher who wishes to use a p-value as a preliminary indication of the strength of the empirical evidence would not undergo the exercise of constructing a syllogism to reject or accept hypotheses. If the p-value is not used to accept or reject hypotheses, the worry explored earlier, about implausible null hypotheses, is no longer relevant. That is, because the researcher is not accepting or rejecting a null hypothesis, there need not be concern pertaining to whether or not one is rejecting an implausible null hypothesis. Three conclusions are consistent with this point of view. First, a point null hypothesis is fine because it does not have to be plausible. It only has to be useful in coming up with a pvalue that, in turn, is merely a preliminary way to assess the strength of the empirical evidence. Second, because the purpose of the p-value is merely to help the researcher form a preliminary assessment of the strength of the empirical evidence, its importance should be much less than it typically is taken as having, by researchers, journal reviewers, and journal editors. Third, if the computed pvalue is to be used for the purpose of a preliminary assessment of the strength of the empirical evidence, it seems desirable to have as precise a value as possible.
Therefore, the use of a range null hypothesis, along with the strategy of overestimating the p-value by an amount that is impossible to determine, is far from being the best possible practice. With all of this having been said, however, it is worth noting that Vieland and Hodge have shown that no existing statistical procedure validly indexes the state of the empirical evidence, and this includes p-values, whether they are one-tailed or two-tailed 6 [38].
The probability of the null hypothesis, given the finding, is not the same as the probability of the finding given the null hypothesis (Trafimow & Marks, 2015). Well then, from the first and second points of view above, it is silly to engage in null hypothesis significance testing in the first place. That is, the goal either should be to come to a conclusion about the probability of the hypothesis given the finding, in which case one assumes that it is reasonable to assign probabilities to hypotheses; or to form a preliminary assessment of the strength of the evidence. In the former case, to calculate the conditional probability of the finding, one needs to assign unconditional probabilities of hypotheses to carry the calculation through. From this perspective, the usual one-tailed calculation gives blatantly wrong answers (again, not just conservative answers but wrong answers). And from the perspective of using the calculation as a preliminary as- scientists should consider a variety of factors, rather than just using p-values [37]. Clearly, if one is to use a cutoff, as is necessitated by the insistence on controlling Type I error, there is no room left to consider other factors. Either the computed conditional probability of the finding, given the null hypothesis, is under or over the cutoff and that is it! Second, few philosophers of science would agree that it is the job of scientists to reject or not reject hypotheses after obtaining single findings. Rather, scientists are supposed to propose and test larger theories, and each individual hypothesis is part of a larger network of theoretical assumptions, auxiliary assumptions, substantive hypotheses, and statistical hypotheses [12] [13] [19] [40]. From this perspective, a strong focus on controlling Type I error, and the cutoff strategy that goes with it, seems philosophically naïve.

Conclusions
We have seen that, depending on one's perspective, the traditional calculation for conditional probabilities of findings given range null hypotheses via maximization at the largest value in the range, is either blatantly wrong, quite imprecise, a conservative overestimate of the actual probability of the finding (in which case the risk of Type II error is quite large), or a quite liberal underestimate of the actual probability of the finding (relative to two-tailed calculations with point hypotheses). I underscore that these contradictory assessments pertain to evaluating probabilities of findings given hypotheses. This contrasts with the usual demonstration that different points of view give contradictory assessments about how researchers should evaluate hypotheses given findings. To my knowledge, this is the first demonstration to emphasize that these contradictions occur at the level of data evaluation rather than just at the level of hypothesis evaluation.
It is interesting that if researchers only used point null hypotheses, although different philosophical perspectives would still demand differences in hypothesis evaluation, at least the calculations would not differ pertaining to data evaluation. That is, for example, the probability of 17 or more heads out of 20 tosses, given a fair coin, would be calculated the same way by everybody and in accordance with the binomial theorem as we saw earlier (probability = 0.001). Thus, data evaluation, at least, would not be controversial, though hypothesis evaluation would remain controversial. But matters change when range null hypotheses are used, which places researchers in the Neyman-Pearson tradition in a dilemma. On the one hand, they can use point null hypotheses, where the calculations pertaining to data evaluation are not controversial, but at the cost of rejecting null hypotheses that are not plausible anyway. Or, they can use range null hypotheses that have better plausibility; but where the calculation of the probability of the obtained findings, given the null hypothesis, is potentially quite problematic. More generally, whatever the philosophical perspective, range null hypotheses are problematic even from a data evaluation point of view prior to hypothesis evaluation. Unfortunately, it is not obvious how to defend the computational technique chosen to calculate the probability of the obtained finding given a range null hypothesis. A balanced assessment might be that all of them stand on rickety foundations.