Proponents of one-sided statistical tests

Author: Georgi Z. Georgiev, Published: Aug 6, 2018

We already covered that the giants Fisher, Neyman and Pearson supported one-sided tests theoretically and used in practice and in famous examples of statistical analysis. They are, of course, not the only ones so here I provide a brief account of the many other proponents of one-sided tests with citations and short summaries of their positions.

I. Positive portrayals in published papers in scientific journals

1. Deborah Mayo, Aris Spanos

I begin with what is, in my opinion, one of the most important modern philosophers of science and statistician: Deborah Mayo. In her work with Aris Spanos on severe testing / error statistics she uses one-sided tests exclusively and her SEV statistic is defined mathematically as a test for a one-sided statistical hypothesis. Her views on the relation between one and two-sided tests can be seen in the following citation from Mayo & Spanos (2006) [1]:

"To keep the focus on the main logic, we assume that the null and alternative hypotheses of interest will concern the mean µ:

H0: µ ≤ µ0 vs: H1: µ > µ0:

[…]

Note that while test T(α) describes a familiar ‘one-sided’ test, our discussion easily extends to the case where one is interested in ‘two-sided’ departures: One simply combines two tests, ‘one to examine the possibility that µ1 > µ0, the other for µ1 < µ0’ (Cox and Hinkley [1974], p. 106, replaced u with m). In this case, the α level two-sided test combines both one-sided tests, each with significance level 0.5α."

Since this is a joint work between Mayo & Spanos, it is obvious that the latter is also in favor of one-sided tests. This is supported in their later work as well: Mayo & Spanos (2010) [2].

2. Karl Peace

K.Peace is a staunch defender of the use of one-sided tests of significance as can be seen in multiple papers and letters to journal editors on the issues of one-tailed tests.

For example, in a letter to the editor of "Controlled Clinical Trials" in 1988 Peace writes [3]: "However, as long as the regulatory risk is 5%, the confidence intervals would be 90% (two-sided) ones rather than 95% ones.".

In the same paper we see three arguments for one-sided tests, such as: "…if the question, the research is directed toward is unidirectional, then significance tests should be onesided." and then proceeds to make a second point about the consistency between the research hypothesis and the statistical hypothesis "A second point is that we should have internal consistency to our construction of the alternative hypothesis. An example of what I mean here is the dose response or dose comparison trial. Few (none that I have asked) statisticians would disagree that dose response as a research objective, captured in the hypothesis specification framework, is Ha : µp ≤ µd1 ≤ µd2, where for simplicity I have assumed that there are two doses d1 and d2 of the test drug, a placebo (p) control, and the effect of drug is expected to be nondecreasing. If this is the case and if for some reason, the research is conducted in the absence of the d1 group, then why would Ha : µp ≤ µd2 become Ha : µp ≠ µd2?"

His third point is that in a confirmatory (also "Phase III") trial it makes no sense to have a two-sided alternative.

Peace makes these same points in an extended form in a separate paper Peace (1989) [4]: "The appropriateness of a one-sided alternative hypothesis rather than the more conservative, boiler-plate, two-sided hypothesis is discussed and examples provided. It is concluded that confirmatory efficacy clinical trials of pharmaceutical compounds should always be viewed within the one-sided alternative hypothesis testing framework."

Despite making the erroneous claim that a two-sided hypothesis is "more conservative", he is still clearly in favor of using one-sided tests and confidence intervals.

In a third paper Peace (2007) [5] Peace elaborates on the same points and states: "If hypothesis testing provides the most meaningful framework to address the objective, then the alternative hypothesis should embody the objective in both substance and direction, and the p value should be consistent with the direction." essentially restating the need for the statistical hypothesis to correspond to the research one and any claims made.

Further down I was pleased to find support for something I claim as well:

"There may be situations where two-sided p values are appropriate. But my experience in the clinical development of drugs suggest few-if any."

Finally, the point about the proper interpretation of a two-sided interval when directional claims are made is reiterated again:

"Two-sided 90% confidence intervals could be constructed, which would permit an inference at the 5% level for the superiority of either group in a pairwise comparison."

3. Kaiser, Boissel, Wolterbeek & Enkin

In Kaiser (1960) [6] we find full-fledged support for one-sided tests and opposition to two-sided tests: "It seems obvious that the traditional two-sided test should almost never be used".

The points made by Boissel in his brief paper (Boissel, 1988) [7] are more moderate and he does see some places where two-sided. He makes the distinction between factual and decisional trials, which I find artificial, given that it does not matter what decision, if any, follows from a claim. What matters is, is the claim directional or not. He does make a good point in stating: "…the problem is not that one-sided approaches are too frequently used; it is whether the design is consistent with the question (and the conclusion as well)." which if taken to its logical conclusion would lead to the adoption of one-sided tests in most if not all occasions.

In letters to the editor of the British Medical Journal both Wolterbeeek (1994) [8] and Enkin (1994) [9] express support for the usage of one-sided tests. Wolterbeek is more allowing of the usage of two-tailed tests, erroneously allowing them in cases where there is no prediction or expectation of the direction of the effect, however he defends and recommends the usage of one-sided tests in medical trials.

The endorsement of one-sided tests by Enkin is more decisive and supported by the obvious logic of a clinical trial: "In this case we are interested only in whether the experimental treatment is better than the less expensive or more invasive form. We are indifferent as to whether it is equal or worse."

4. Freedman and the three-decision rule

Freedman in Freedman (2008) [10] makes a number of arguments for the use of one-sided tests of significance, many of them from a Bayesian perspective. A curious argument is presented in subsection "A classical argument for not doubling the p-value when moving from a one-sided to a two-sided test" wherein he makes the case that the intersection of two one-sided tests with arbitrarily small discrepancy from the nil have no intersection and thus there is no need to take 2 times alpha for the two-sided test: "The intersection of two one-sided tests is zero for arbitrarily small region around the null – no need to half the alpha or to take 2 times alpha for a two-sided test."

He then proceeds to advocate for a three-decision rule: "we decide on the sign of δ through tests (a2) and (b2), declaring δ positive if test (a2) is significant, negative if test (b2) is significant, making no decision otherwise" wherein a2 and b2 are complementary one-sided tests. Freedman credits this rule to Neyman and Wald and lists Kaiser and Tukey as supporters.

While this is a clear endorsement of one-sided tests I have to say I do not agree that there is no need to double the p-value when one wants to consider a two-sided alternative. Simple simulations reveal that in order for the probabilities to be correct and if one views a two-sided test as two one-sided tests back-to-back, then each one-sided test has to be reported at half the p-value. It is an entirely different question of how often and in which cases is a two-sided test called for.

5. Cho & Abe, Westlake

Most recently a strong defense of the adoption of one-sided tests in economics/business/marketing research is mounted by Cho & Abe (2013) [11].

"This paper demonstrates that there is currently a widespread misuse of two-tailed testing for directional research hypotheses tests. One probable reason for this overuse of two-tailed testing is the seemingly valid beliefs that two-tailed testing is more conservative and safer than one-tailed testing. However, the authors examine the legitimacy of this notion and find it to be flawed. A second and more fundamental cause of the current problem is the pervasive oversight in making a clear distinction between the research hypothesis and the statistical hypothesis. Based upon the explicated, sound relationship between the research and statistical hypotheses, the authors propose a new scheme of hypothesis classification to facilitate and clarify the proper use of statistical hypothesis testing in empirical research."

The remark about the lack of clear distinction between research and statistical hypothesis is spot on. Cho & Abe allow for two-sided tests when there is lack of prior knowledge of the subject being tested, however I do not accept that case as legitimate. The second case where they would allow for a two-sided test is testing for "nonexistence" of relationship. However, this one is flawed as well. In medical trials these are known as equivalence tests and they are performed using TOST: Two One-Sided Tests of significance (following Westlake’s response to Kirkwood’s 1981 "Bioequivalence Testing - A Need to Rethink"), and rightly so, since there will likely be some observed discrepancy and thus we’d like to say that if the null hypothesis in the opposite direction was to be true, such a discrepancy would not have been observed. A discrepancy in just one direction will be sufficient to state that there is some discrepancy.

While mathematically the two one-sided tests are equivalent to a two-sided test with a higher uncertainty level (e.g. two one-sided at 0.025 are one two-sided at 0.05) reporting the procedure as two one-sided tests allows one to better align the statistical hypothesis and the claims they’d like to be able to make after the experiment (e.g. the two are not equivalent since there is a positive difference (p=0.01)). If the outcome of the test is not statistically significant then what matters is the statistical power and/or the width of the corresponding confidence intervals.

In continuing the last point, it should be noted that Westlake in his response makes a strong case for one-sided significance tests in pharmaceutical trials: "My interpretation of this is that the regulatory agency is attempting to ensure that if the drug is really the same as placebo there is only a low probability, .05, that it will be approved. Note, however, that since the drug would never be approved for being less efficacious than placebo, the test and its associated critical region should be one-sided."

He also makes a great point with regards to the perceptive lower stringency of one-sided tests: "To a regulatory agency, for example, use of a 90% rather than a 95% confidence coefficient might appear to represent a relaxation of its standards whereas, as I have pointed out above, it is completely consonant with the practice of using a one-sided c-level of .05 in efficacy trials. With this background, it should be clear that my recommendation of 95% symmetrical confidence intervals can be viewed as an attempt to bridge the gap from a traditional 95% confidence interval to the 90% confidence interval that is really more appropriate."

6. Ricardo Murphy

I’ve stumbled across this one very late in my research, but Murphy (2017) [12] is a paper that makes the strong point which basically coincides significantly with the point I am trying to make with my project:

"Yet if a theory predicts the direction of an experimental outcome, or if for some practical (eg clinical) reason an outcome in that direction is the only one of interest, then it makes sense to use a one‐sided test. The use of a two‐sided test in these situations will lead to too many false negatives. Consequently treatment effects that corroborate a theory or that are of practical importance may be missed. This problem becomes particularly acute in the case of borderline results."

Unfortunately, he also makes his point based on prediction while my claim is that it is not necessary at all. The only necessity for a one-sided statistical test is a directional claim one wants to provide an error probability statistic for.

Murphy does make some other good points and some points about which I have mild criticism which I prefer to keep for (maybe) another time, but he ends with "The decision whether to perform a one‐sided or two‐sided test should always be made on logical grounds, not statistical ones. In particular the question of statistical power should be recognized for what it is: irrelevant. Insisting that all tests for treatment effects be two‐sided is not only illogical but unethical, because in a placebo‐controlled drug trial it means reducing the power to detect beneficial effects for no good reason." with which I fully agree.

7. Overall

Overall (1990) [13] states in the abstract of his paper: "The p value associated with a test of significance is supposed to represent the probability of observed results given that the null hypothesis is actually true. In evaluating the efficacy of a new drug against placebo, regulatory considerations focus on superiority of the new drug over placebo. The disapproval consequence of placebo superiority is no different from the disapproval consequence of equivalence between drug and placebo. It is furthermore inconceivable that a drug company will submit a New Drug Application claiming superior efficacy for placebo. Thus, the only probability of concern is the probability that apparent superiority of drug over placebo is a chance finding, and that is the probability associated with a one-tailed test. Where multiple studies must be evaluated with regard to the regulatory decision, meta-analytic considerations further support the relevance of one-sided p values."

Clear as day, but unfortunately I was not able to get my hands on the full paper, so I have to refrain from comments on how it argues the point.

Ending on that high note, let us see if any agencies followed his advice.

II. Endorsement of one-sided tests of significance in regulatory guidelines and technical recommendations

1. FDA

Unfortunately, the FDA is yet to update its guidelines on equivalence testing and TOST is not mentioned anywhere in the document ("Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests" [14]). It is nevertheless seeing significant adoption as evident by the amount of published papers on bioequivalence trials.

2. EPA

The US Environmental Protection Agency (EPA) on the other hand has very sound explanation and advice on when to use one-sided tests in their "Data Quality Assessment: Statistical Methods for Practitioners" [15]. Their statistical tables are also calculated for the one-sided scenario for which they get further kudos (read "Is the widespread usage of two-sided tests a result of a usability/presentation issue" on why that is so crucial). They still end up recommending a two-tailed test where a one-tailed one is the obvious choice, as discussed elsewhere, but this does not negate their support in using one-sided tests.

3. EMA

The European Medicines Agency (EMA) is overtly for two-sided tests in its "Statistical Principles for Clinical Trials" [16] and makes one-sided tests pass through unnecessary hoops. Still, there is one occasion in which they exclusively prescribe the usage of one-sided confidence intervals: "For non-inferiority trials a one-sided interval should be used. The confidence interval approach has a one-sided hypothesis test counterpart for testing the null hypothesis that the treatment difference (investigational product minus control) is equal to the lower equivalence margin versus the alternative that the treatment difference is greater than the lower equivalence margin. The choice of type I error should be a consideration separate from the use of a one-sided or two-sided procedure."

4. EFSA

Finally, while examining the "Guidance on Statistical Reporting" issued by the European Food Safety Authority" [17] we see no recommendations for or against one-sided tests. The only requirement is to clearly report which type of calculation was used.

This is by no means an exhaustive review. Such an endeavor would be work for an entire article.

III. Positive portrayal in online university resources

1. Online Course – PennState Eberly College of Science

The explanations in this course on statistics are not that great or at least I found them slightly hard to follow and a bit confusing. Also, the null in a one-sided case is equated to the nil, which is incorrect as it should be one-sided as well. Still, one-sided tests are listed as valid and applicable alongside two-sided tests and there are no precautions against using them.

Positive press for one-sided tests in popular online resources

Contextual popularity was assessed by high rankings for relevant searches in Google (the Google ranking system has a significant "popularity/citation" component to it). It should be noted that negative press dominated the results at the time of extraction although I have not made an attempt at numerical estimation in terms of % of results. There are only a couple of positive sources I was able to uncover.

1. Statistician Daniel Lakens, PhD

In his "One-sided tests: Efficient and Underused" he explains why one-sided tests should be used and concludes with "We can now answer the question when we should use one-sided tests. To prevent wasting tax money, one-sided tests should be performed whenever:

1) a hypothesis involves a directional prediction

2) a p-value is calculated.

I believe there are many studies that meet these two requirements."

Although I do not think the word "prediction" is chosen wisely, I do think it is a good read.

2. Cardiff University Centre for Trials Research and the department of Primary Care and Public Health

A moderate endorsement of one-sided tests can also be found at the SignificantlyStatistical blog which is reporting on a club discussion at the Cardiff University, one of the outcomes of which was that: "Something that we definitely dispelled was the impression that one-sided tests are in some way unsavoury or not good practice (with pre-specification).". Pre-specification is needed no differently than one would do for two-tailed tests. Also, this should in no way preclude you from performing the reverse one-sided test.

3. Georgi Georgiev

In "One-tailed vs Two-tailed Tests of Significance in A/B Testing" I lay out my first coherent defense of one-sided tests and I condemn the usage of two-sided tests in the context of A/B testing but speaking, in practice, about most, if not all practical scenarios. My argument has improved since as evident by this website, but it might still be an interesting read.

References

[1] Mayo D.G., Spanos A. (2006) "Severe testing as a basic concept in a Neyman–Pearson philosophy of induction", The British Journal for the Philosophy of Science, Volume 57(2):323–357; https://doi.org/10.1093/bjps/axl003

[2] Mayo D.G., Spanos A. (2010) "Error statistics", in P. S. Bandyopadhyay & M. R. Forster (Eds.), Philosophy of Statistics, (7, 152–198). Handbook of the Philosophy of Science. The Netherlands: Elsevier; ISBN: 9780444518620

[3] Peace K.E. (1988) "Some thoughts on one-tailed tests", Biometrics 44(3):911-912

[4] Peace K.E. (1989) "The alternative hypothesis: one-sided or two-sided", Journal of Clinical Epidemiology 42(5):473-476

[5] Peace K.E. (1991) "One-sided or two-sided ρ values: which most appropriately address the question of drug efficacy", Journal of Biopharmaceutical Statistics 1(1):133-138

[6] Kaiser H.F. (1960) "Directional statistical decisions", Psychological Review 67:160-170

[7] Boissel J.P. (1988) "Some thoughts on two-tailed tests (and two-sided designs)", Controlled Clinical Trials 9(4):385-386

[8] Wolterbeek R. (1994) "One and two sided tests of significance", British Medical Journal (Clinical Research Edition) 309(6958):873-874

[9] Enkin M.W. (1994) "One sided tests should be used more often", British Medical Journal (Clinical Research Edition) 309(6958):873-874

[10] Freedman L.S. (2008) "An analysis of the controversy over classical one-sided tests", Clinical Trials (London, England) 5(6):635-640; https://doi.org/10.1177/1740774508098590

[11] Cho H.C., Abe S. (2013) "Is two-tailed testing for directional research hypotheses tests legitimate?", Journal of Business Research 66:1261-1266; https://doi.org/10.1016/j.jbusres.2012.02.023

[12] Murphy R. (2018) "On the use of one‐sided statistical tests in biomedical research", Clinical and experimental pharmacology & physiology 45(1):109-114; https://doi.org/10.1111/1440-1681.12754

[13] Overall J.E. (1990) "Tests of one-sided versus two-sided hypotheses in placebo-controlled clinical trials", Neuropsychopharmacology 3(4):233-235.

[14] US Food and Drug Administration (FDA): "Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests", drafted in 2003, issued on March 13, 2007.

[15] European Medicines Agency (EMA): "Statistical Principles for Clinical Trials", drafted 1997, issued Mar 1998.

[16] European Food Safety Authority (EFSA) (2014) "Guidance on Statistical Reporting", EFSA Journal 12(12):3908

[17] US Environmental Protection Agency (EPA) "Data quality assessment: statistical methods for practitioners", EPA QA/G-9S, issued Feb 2006

Enjoyed this article? Please, consider sharing it where it will be appreciated!

Cite this article:

If you'd like to cite this online article you can use the following citation:
Georgiev G.Z., "Proponents of one-sided statistical tests", [online] Available at: https://www.onesided.org/articles/proponents-one-sided-statistical-tests.php URL [Accessed Date: 11 Dec, 2018].