12 myths about one-tailed vs. two-tailed tests of significance
Author: Georgi Z. Georgiev, Published: Aug 6, 2018
One of the purposes of this project is to dispel commonly encountered myths about one-sided tests which are usually framed in relation to their two-sided counterparts. Here we will talk about one-sided tests of significance and the resulting p-values, but all the conclusions are equally valid for one-sided confidence intervals versus two-sided confidence intervals.
Table of contents:
- One-sided tests of significance are "biased"
- One-sided tests result in higher actual type I error than their nominal one
- Performing a one-sided test amounts to or requires a prediction or expectation
- One should not do a one-sided test after a two-tailed test or after a one-tailed test in the opposite direction
- One should only perform a one-sided test if a potential outcome in the opposite direction is of no interest
- With a fixed sample size, a one-sided test provides greater statistical power vs. a two-sided test with the same error guarantees
- One-sided tests lead to results that are not replicable or otherwise questionable
- One-sided tests are acceptable only if the outcome variable can change only in one direction
- X2, Fisher’s exact and other such tests cannot accommodate one-tailed tests
- Using a one-sided test is "tampering" with or relabeling of Z-scores with lower p-values
- One-sided tests have more assumptions and restrict one’s inquiry
- Using one-tailed tests is controversial
(You can copy any of the above links to provide a direct link to the specific myth.)
Myth #1: One-sided tests of significance are "biased"
Various critics of one-sided tests put a different meaning in the term "bias": some use it to mean that it allows for more actual type I errors than the nominal significance level or reported p-value. Others compare it to a two-sided test using the same critical region. Yet others mean that due to a prediction or expectation with regards to the outcome the test is somehow favoring results with one sign versus results in the opposite direction. I’m yet to encounter anyone using the word "bias" in its statistical meaning as "asymptotically skewed estimate" that is over or under-estimating the true value, on average.
All but the last of these are reviewed separately below.
Myth #2: One-sided tests result in higher actual type I error than their nominal one
More often encountered as something like "one-tailed tests result in a higher probability of type I errors" or as is sometimes put: "one-tailed tests result in a higher rate of "false discoveries"". Some go as far as calling this "cheating" or "fraud" (e.g. V.15 & V.16 in "Examples of negative portrayal of one-sided significance tests").
Once explicitly stated, under a strict definition of "type I error" and "discovery" or "false discovery" these are easily demonstrated as false.
The reason is simple: the nominal p-values and confidence bounds of one-sided statistical tests offer the same error probabilities as those from two-sided ones under their respective null hypotheses. If the acceptable risk is 5% against undesirable outcomes (usually lack of efficiency or negative efficiency of some sort) then in fact the one-sided tests control this risk conservatively at the 5% level, while a two-sided test calculated under the same value of the critical boundary would control it conservatively at the 2.5% level (assuming symmetrical distribution).
The confusion comes from the fact that a two-sided test rejects its null hypothesis at a more extreme boundary value than a corresponding one-sided test while forgetting that the null hypothesis is different in both cases. In the case of a two-sided test it is very specific thus there is higher uncertainty in rejecting it versus a one-sided test where a broader null hypothesis can be rejected with lower uncertainty given the same data. This is something I discuss in more detail in "The paradox of one-sided vs. two-sided tests of significance".
A one-sided hypothesis provides the proper error probability whenever the claim or conclusion we make is directional as explained in detail in "Directional claims require directional (statistical) hypotheses". Using a two-sided test with a point null results in overestimation of the probability of observing such a discrepancy in the specific direction and comes at the cost of significant increase in the required sample size (20%-60% increase) or a corresponding decrease in the power of the test.
Therefore, the problem is not that one-sided tests are somehow less stringent / more relaxed, but that often two-sided tests are used where they do not belong (see "Examples of improper use of two-sided hypotheses").
Myth #3: Performing a one-sided test amounts to or requires a prediction or expectation
This is one part of the one-sided tests being biased myth. There is no prediction or expectation involved in computing a one-sided test of significance or confidence interval bound.
While there might be a prediction, expectation or hope in the researcher’s head, it does not enter the data-generating procedure in any form. In other words, it does not affect the sampling space. If the claim is broader it limits the possible outcomes that would lead to a negative answer to that claim (rejection of the null hypothesis). It has no effect on the error guarantees of the test.
Contrary to popular opinion, there is also no need to predict the direction of the experiment outcome or that we would want to conduct a one-sided test in a given direction before the data is gathered. It is perfectly fine to make a directional claim and support it with the proper left-sided or right-sided test without any prior prediction or expectation of it whatsoever.
The reason is that a one-sided test amounts to nothing more or less than the answer of a question, a claim expressed in probabilistic terms. There is no requirement to predict or expect the claim for the statistic to be valid.
We can test this easily by writing down our prediction or expectation and then gathering data and performing a one-sided test in one direction or the other (or both). It will always provide a conservative bound on the probable error, given the assumptions for its application are met.
Myth #4: One should not do a one-sided test after a two-tailed test or after a one-tailed test in the opposite direction
This claim is encountered in much of the literature and is enshrined in statistical guidelines and requirements in the form of requiring pre-specification. I believe it stems either from the paradox mentioned in Myth #2 or from the false belief that it somehow accounts to multiple testing or cherry-picking. It simply does not, since no kind of prediction or expectation, nor any kind of prior two-sided tests or tests in the opposite direction have any effect on the data generation procedure.
All we do is alter the question being asked from the data by selecting from a mutually-exclusive set of questions in the case of the two one-sided tests. It would be equivalent to saying that one cannot calculate both a 95% and a 99% confidence interval on the same data. Or that one cannot use the same set of data to claim that a drug likely has an effect (p=0.04) and that it likely has a positive effect greater than zero (p=0.02) and that it likely has a positive effect greater than δ (p=0.05). Altering the claim we make while correspondingly altering the probability calculation requires no adjustments, precautions, etc.
Myth #5: One should only perform a one-sided test if a potential outcome in the opposite direction is of no interest
This one is espoused in several critiques of one-sided tests with slightly different wording. It has a lot to do with Myth #4 according to which it is somehow bad, cheating, etc. to say that we will perform, or to actually perform a one-sided test in one direction and then to do so in the other. There is no issue in doing so since the fact that we have asked one question has no effect on the answer (statistic) given to the opposite question.
There are no issues in making multiple inferences out of a given set of data, other than the question: would a potential consumer of the data be interested in all or just some of them?
Myth #6: With a fixed sample size, a one-sided test provides greater statistical power vs. a two-sided test with the same error guarantees
This is a fairly prevalent claim in criticism and negative portrayals of one-sided tests, but it is also encountered among literature supporting one-sided tests, so I’ve given it detailed consideration here. I’ve seen it used in some cases as if power is not the probability of correctly rejecting the null hypothesis if a given point alternative is true, but as if the "if a given point alternative is true" part is dropped, making it similar to myth #2. In other versions of it "with the same error guarantees" is dropped instead, clearly making it a version of myth #2.
I reject the myth altogether due to simple logic. Changing the null hypothesis changes the power against a set alternative given fixed sample size and error probability, therefore comparing the results of the two power calculations is like comparing apples to oranges. While it is possible to compare them numerically, at best it tells us nothing that we don’t already know from the definition of the tests and at worst it has the potential the confuse us unnecessarily. This becomes obvious when we express the power calculations in full notation.
If T1(α) is the statistical test corresponding to the one-sided test and T2(α) is the statistical test corresponding to the two-sided test, and c1(α) is the critical Z value for the one-sided test and c2(α) is the absolute value of the critical Z value for the two-sided test and |c1(α)| < |c2(α)|. Denote the point alternative by μ1. Then:
POW(T1(α); μ1) = P(d(X) > c1(α); μ1)
POW(T2(α); μ1) = P(d(X) > c2(α); μ1)
It should now be clear that while the powers of the two tests are comparable numerically they are not comparable in practice if we account for what they represent. Since a two-sided test is two one-sided tests back to back it is powered for |μ1| and will therefore maintain its power against -μ1.
In notation that would be POW(T2(α); |μ1|) = P(d(X) > c2(α); μ1) ∪ P(d(X) < -c2(α); -μ1).
We must consider -µ1 if we want a meaningful comparison since the null of the two-sided test allows for -µ1 to reject the null. Since POW(T2(α); -μ1) > 0 and POW(T1(α); -μ1) = 0 things even out once we compare the two power functions in a meaningful way. However, if the question is genuinely two-sided, then a one-sided test is inadmissible and so the comparison is of no practical interest.
If we wrongly apply a two-sided test to a one-sided question it will have power POW(T2(α); μ1) < POW(T1(α); μ1). A brief examination of the cost of that mistake might be helpful, for example with α = 0.05. If POW(T1(α); μ1) versus μ1 is 0.80, then POW(T2(α); μ1) will be 0.70, meaning there is a 50% increase in the type II error limit β for μ ≥ μ1 (β ≤ 0.2 vs β ≤ 0.3). If POW(T1(α); μ1) is 0.90 then POW(T2(α); μ1) will be 0.83(3) (66.6% increase in β). If POW(T1(α); μ1) is 0.95 then POW(T2(α); μ1) will be 0.908 (84% increase in β). We can see the cost in applying the wrong test in terms of loss of test sensitivity and great increase in type II errors. To maintain power we would need 20-60% higher sample size. Note that α = 0.05 is maintained conservatively in all cases.
Once we see that "power" refers to different null hypotheses and thus probability calculations it becomes easy to see through the fallacy. While one is free to call both "power" the two power calculations above are for rejecting different claims. Since the null hypotheses under which d(X) is calculated in T1(α) and in T2(α) are different the results of the power calculations refer to different things.
Put otherwise, changing the question does not replace or alter the measuring tape. What changes is what we limit our claims to. There are no tricks or magic in devising a different test that answers different questions and thus has different power given some fixed constants.
It is thus revealed that the claim that a one-sided test has greater statistical power than a two-sided test can only be sustained if we ignore the difference between the null hypotheses and act as if both are calculated under the same null hypothesis or if we do not allow for directional conclusions (too wild a claim to entertain at all).
This myth is, upon closer examination, resolved by understanding that it is making an invalid comparison or wrongly applying a two-sided test to a one-sided research claim, or vice versa.
Myth #7: One-sided tests lead to results that are not replicable or otherwise questionable
See IV.2 in here). This is said to result from "Using statistical tests inappropriately" with examples of choosing a one-tailed test because it leads to attaining significance and performing a one-tailed test after performing a two-tailed one. While I agree that one should use statistical tests appropriately I do not see any issue in using any statistical test as long as it is properly documented. The reader can then come to their own conclusion whether the test result supports the conclusions or claims it is meant to warrant.
In most cases I have seen it is the improper use of two-sided tests that is the issue. Performing a one-sided calculation and properly reporting its p-value or confidence bound is not, by itself, an issue.
Myth #8: One-sided tests are acceptable only if the outcome variable can change only in one direction
Even when the variable in question can only move in one direction, e.g. number or proportion of events versus non-events in a given sample (number of events cannot be less than 0, proportion is 0..1) the practical question of interest usually concern the differences between two numbers or two proportions and differences. The difference can almost always go both ways and be either positive or negative.
A one-sided test is not only allowed in such situations, but actually required when the resulting probability or interval bound is to support a directional claim such as "number of events in group A is greater than those in group B" or "the treatment results in a reduction of relative risk in the treatment group versus the control".
Myth #9: X2, Fisher’s exact and other such tests cannot accommodate one-tailed tests
This is a strange one, I’ll admit, but I decided to mention it anyways. It comes from example V.6 in here: "F tests, Chi-square tests, etc. can’t accommodate one-tailed tests because their distributions are not symmetric."
The argument is that since the statistic’s distribution is asymmetric there is no way to calculate a one-tailed rejection region corresponding to a given cumulative probability. However, I cannot see why it would be a problem to calculate a rejection region in one side of an asymmetric distribution, be it chi-squared or F distribution.
Fisher himself performed one-sided hypothesis tests using both the chi-square and a Fisher’s exact test in an example in his 1925 "Statistical Methods for Research Workers" and elsewhere. Other examples are available as well. The fact that the statistic has only one tail or that it is asymmetric does not mean that its cumulative probability cannot be calculated using a one-sided hypothesis.
Myth #10: Using a one-sided test is "tampering" with or relabeling of Z-scores with lower p-values
These come from Hick 1952 and Goodman 1988 in the first part of here (I.3, I.2). It is another manifestation of "The paradox of one-sided vs. two-sided tests of significance". In both cases in its root is equating the null hypothesis and the nil hypothesis and using improper tests due to that.
Myth #11: One-sided tests have more assumptions and restrict one’s inquiry
This is a variation of the prediction/expectation myth above (#3). It is false since asking a directional question does not involve any more assumptions and does not impose any more restrictions on the possible values of the outcome variable and does not affect it in any way when compared to a two-sided equivalent.
In court asking the witness "did you see the defendant at 2P.M.?" might be considered subtly leading the witness and a phrasing like "at what time did you see the defendant?" is considered more neutral. However, the data will not alter its mean, median or variance based on what we are hoping these values to be. It is there, sitting unamused by our inner struggles and attempts to influence it and has a precise amount of information relative to any question we ask.
Doing a one-sided test imposes a restriction only on the outcome space, that is: on the set of possible values that will result in a rejection of a null hypothesis. They do not alter the sampling space (all possible values of the outcome variable) or the probability of observing a given outcome.
Myth #12: Using one-tailed tests is controversial
This is a bit of a tongue-in-cheek "myth" or a "wish for the future". While it might still be factually true, I hope this will become a myth sooner rather than later with the help of the OneSided Project and by applying sound logic and sound math to statistical inference.
Enjoyed this article? Please, consider sharing it where it will be appreciated!
Cite this article:
If you'd like to cite this online article you can use the following citation:
Georgiev G.Z., "12 myths about one-tailed vs. two-tailed tests of significance", [online] Available at: https://www.onesided.org/articles/12-myths-one-tailed-vs-two-tailed-tests-of-significance.php URL [Accessed Date: 11 Dec, 2018].