Examples of improper use of two-sided hypotheses

Author: Georgi Z. Georgiev, Published: Aug 6, 2018

A sample of papers from any scientific journal, be it one on physics, economics, psychology, biology, medicine, etc. will inevitably reveal a most obvious truth: people care about the direction of the effect of whatever intervention or treatment one wants to study.

In most published papers and articles, we see claims of "positive effect", "efficacy", "reduces negative effects", "results in reduction of risk", "results in substantial increase in…", "decrease in…", "lowers expenditure", "demonstrate improved…", "shows impaired…" yet we see next to such claims statements like "All statistical tests were two-sided." or we can check the p-values presented and see that these are results of calculations under a two-sided alternative (point null) and we see two-sided confidence intervals instead of one-sided ones.

This is quite peculiar since a two-sided test is a non-directional one and so are two-sided confidence intervals. If we are interested in making claims about the direction of the effect, a one-sided test is clearly the test that answers the questions we are asking, and it is the test which we should adopt to support a directional claim. It is also the uniformly most-powerful test for the directional hypothesis and should be preferred for minimizing type II errors. The nominal type I error associated with the claim is equal to the actual one, unlike when using a two-tailed test, in which case the nominal uncertainty is larger than the actual (more on this in "Directional claims require directional (statistical) hypotheses").

This practice leads to results that appear weaker than they are as well as to an increased number of false "negative" findings due to the often-committed mistake of interpreting a high p-value as supporting the null hypothesis without regard for statistical power. Why are then practitioners reporting two-tailed probabilities and why are the statisticians who consult them in favor of such practices?

I believe it is due to a misunderstanding about what a one-sided p-value is combined with the stigma attached to using it from fellow researchers, some journal guidelines and regulatory requirements. No matter the reason, one-sided p-values are not only admissible, but they are the only proper probability to cite when talking about the direction of the effect.

That said, it is useful to see how this plays out in practice. What do directional claims look like and how does an improper use of a two-sided probability statement make its way into a paper?

Selection of examples from published papers

The following are examples from mostly recently published papers from several different fields of study. Some are highly cited, while others are barely cited at all. The primary criteria for inclusion was that they contain enough statistical information to be able to determine whether one or two-tailed probabilities were used or, alternatively, the type of test used was specified explicitly.

For example, this report by Starikova E., Shubina N. 2018 "Clinical and laboratory features of Cryptosporidium-associated diarrhea in children under 5 years old" published in Journal of Bacteriology and Parasitology was not included despite the excellent example:

"The antibiotics were more frequently prescribed for cryptosporidiosis-positive children compared with cryptosporidium-free patients (p=0.026)"

The use of "more frequently" clearly specifies a one-sided alternative hypothesis, however it is impossible to tell if the p-value was calculated under the corresponding one-sided null hypothesis or versus a point null of no difference.

Before I proceed I should say that these are obviously not attempts to pick at the papers as a whole, they only serve as an example of a particular approach to statistical hypothesis testing and presentation of results.

Improper use of two-tailed tests in clinical and pharmacological trials

First, I present an example from a large breast-cancer trial (ALTTO): "Adjuvant Lapatinib and Trastuzumab for Early Human Epidermal Growth Factor Receptor 2–Positive Breast Cancer: Results From the Randomized Phase III Adjuvant Lapatinib and/or Trastuzumab Treatment Optimization Trial" by Piccart-Gebhart et al. (2016) [1]. The goals of the trial involving 8381 patients was to test the promising adjuvant Lapatinib (L) in combination with standard therapy with trastuzumab (T) for effects in improving outcomes for metastatic human epidermal growth factor 2-positive breast cancer.

Eyebrowes will certainly be raised when one reads that in planning the study "Sample size calculations focused on the two-sided superiority comparison between the L+T arm and the T arm". I thought "two-sided superiority" might have been a typing mistake, since it makes zero sense, but the authors are yet to respond to my inquiry into their paper. If it was not a typo, then I cannot fathom how a non-directional two-sided alternative can at the same time be a superiority (directional) one as well. This is far from the major issue, though.

The authors report that "In the ITT population, a 16% reduction in the hazard of a DFS event was observed with L+T compared with T, but this effect was modest, not statistically significant at .025, and of little clinical significance in consideration of the additional toxicity" and similarly make the conclusion that "Adjuvant treatment that includes L did not significantly improve DFS compared with T alone and added toxicity. One year of adjuvant T remains standard of care.".

However, despite making claims about reduction of the hazard ratio, the paper only presents two-sided confidence intervals at 97.5% (due to a correction for the three initial arms) while what should have been presented was a one-sided interval. Such an interval does exclude an HR of 1 (upper-bound of 0.98) making the result statistically significant. The reported p-value of 0.048 was also two-sided, while a one-sided p-value of 0.024 would have just passed the requirement for p < 0.025 (I actually get 0.017 using the same numbers, not sure why). Therefore the claim for a reduction in HR is in fact statistically significant, contrary to what the paper states.

Whether the up to ~45% reduction in the hazard ratio (lower bound of a one-sided 97.5% interval) is of substantive significance and whether it offsets the observed side effects is a different story, but the fact is that the L+T arm was in fact statistically significantly better than the T arm at the risk level adopted by the researchers and it should have been reported as such. Decision makers such as regulatory bodies, physicians and patients can then make up their own mind based on their particular circumstances and risk tolerance. However, this can hardly happen when the reported error estimate is not related to the claim and conclusion of the paper.

Second, another example of a large clinical trial. In a large clinical study on the effects of losartan by Brenner et. al. (2001) "Effects of losartan on renal and cardiovascular outcomes in patients with type 2 diabetes and nephropathy." [2] we read that "All statistical tests were two-sided.". However, in the paper we see statements like: "Treatment with losartan resulted in a 16 percent reduction in the risk of the primary composite end point (P=0.02). […] The risk of the combined end point of end-stage renal disease or death was 20 percent lower in the losartan group than in the placebo group (P=0.01)"

The study conclusion is "Our study establishes that losartan, along with conventional antihypertensive treatment as needed, confers strong renal protection in patients with type 2 diabetes and nephropathy. […] In particular, the risk of end-stage renal disease was reduced by 28 percent with losartan during an average follow-up of 3.4 years."

The words "lower", "reduction" and "reduced" are clearly referring to answers to questions, corresponding to one-sided statistical hypotheses. "protection" is a bit more elusive, but nonetheless refers to a directional result, the opposite direction being "harm".

In the same paper we see an illustration of how using two-tailed p-values can obscure findings for which the experiment had enough power to detect. We read the following statement about one of the secondary end-points:

"…there was a difference between the number of myocardial infarctions in the losartan group (50 patients [6.7 percent]) and the number in the placebo group (68 patients [8.9 percent]; risk reduction, 28 percent), but this difference was not statistically significant (P=0.08).". Note that the paper states that the difference was not statistically significant at P=0.08, however the risk reduction is statistically significant at P=0.04!

This is a clear example of a missed opportunity to detect a positive result which could have happened simply by using the most sensitive test appropriate for the question at hand: does losartan reduce risk of myocardial infarctions? Given the small sample size and the relatively big effect size such a result is likely to be strong evidence of a true positive effect of significant magnitude.

Finally, in a replication trial by Wood et al. (2015) [3] one reads: "The threshold for statistical significance was p < 0.05 (two-tailed)". However, in the conclusions we see the claim: "The results of Study One failed to replicate the previous findings of Brunet et al. (2008), which had suggested that traumatic memory reactivation plus propranolol reduces physiological responding during subsequent traumatic mental imagery."

Obviously, given the one-sided alternative hypothesis, all calculations should have been one-sided and in a pre-specified direction. If the results were significantly negative, as it was the case in this study, the one-sided results for the opposite direction should have been presented.

Improper use of two-tailed tests in psychiatry papers

In a psychiatry paper by Vancampfort et al. (2013) [4] we read that "Except for handgrip strength (p = 0.07), patients with schizophrenia demonstrated an impaired Eurofit test performance compared with age-, gender- and BMI-matched healthy controls". However, the calculation is not versus the null of "not impaired" but against the null of "zero effect". A proper calculation results in a p-value of 0.039 for the null of "does not impair", thus making the result statistically significant at the 0.05 adopted by the authors.

Meta-analyses are not excluded from this issue. Take for example Shi et al. (2014) [5] where we see: "First, meta-analysis results from sham-controlled trials indicate that rTMS is effective in treating negative symptoms in schizophrenia, with a moderate effect size." and also "These results indicated that active rTMS, compared with sham rTMS, induced a significant and moderate improvement in negative symptoms."

Clearly, to make a claim like that supposes a one-sided statistical hypothesis, but instead, when looking at the p-values and their corresponding Z-scores in Figures 2 & 3 we easily see that the p-values are for a point null / two-sided alternative.

Two-tailed instead of one-tailed significance reported in economics papers

An example from economics can be seen by the calculations in Manning et al. (1987) [6], all of which appear to be two-sided. Take for example this claim: "Total expenditures on this plan are significantly less than the free plan (t = - 2.34, p <.02)." the p-value is obviously from a t-distribution of high degrees of freedom so we can use a z to p value calculator to convert the value to p and get the actual result which is p = 0.009642 or p < .01 if you prefer this notation.

A study which falls somewhere between medicine and economics by Jones et al. (2015) [7], examined the costs of medical care for patients enrolled in breast cancer trials, due to concerns from clinics and practitioners that patients involved in trials resulted in higher expenditures. "A widely held perception is that costs of care for clinical trial (CT) patients are higher than standard of care (SOC)." It should have been very easy to select the statistical hypothesis of CT cost greater than SOC cost as their alternative hypothesis and a null hypothesis of SOC costs being equal to or less than CT costs.

However, the language of the paper stays very cautiously away from stating that the cost of CT is higher than SOC, or vice versa. All it talks about are differences, e.g. "By excluding in-kind pharmaceuticals from the analyses, the mean cost differential between CT and SOC patients was reduced to $2,227 (95% CI: -$711 - $5,166, p=0.14,…". The trick is, of course, that we still see the positive difference. It is not framed in terms of absolute value, but as a positive dollar value. So, one can only assume that given all the above the p-values calculated will correspond to a proper one-sided alternative stating that the difference is at least higher than $0 (a smarter design would have set some reasonable superiority margin to gain power).

However, the 0.14 value is calculated using a one-sided hypothesis (null of no difference) and hence corresponds to a different claim than the one cited as the motivation of the study. The proper p-value for a one-sided alternative for positive difference is 0.067515, making the outcome much more unlikely under the null. Similarly, for sub-groups the p-values for an increase of cost of CT patients should be adjusted, resulting in one more group – Pharmacy, being significant at the 0.05 level (from ~0.08 to ~0.04).

Of course, the astute observer will note the horrendous lack of power in the above study and power is what matters when you claim lack of effect (accept the null), but this is a topic for another time.

From this last example we see that it is not necessary to specifically say that the effect is positive or negative, beneficial or harmful. Even saying the difference between two variables γ = µ1 - µ0 is, say, 20, is already a directional statement! Why? Because there is a difference in saying γ = 20 and |γ| = 20. Most people say the former while making the claim based on a statistical test for the latter. You should not forget that 20 is in fact +20, with the plus sign omitted for convenience.

US Environmental Protection Agency statistical guideline

The US Environmental Protection Agency (EPA) actually has very sound explanation and advice in their "Data Quality Assessment: Statistical Methods for Practitioners" [8] on when to use one-sided tests. However, as they try to come up with an example for when a two-sided test is called for they end up recommending a two-tailed t-test where a one-tailed one is the obvious choice:

"Box 3-22: An Example of a Two-Sample t-Test (Equal Variances). At a hazardous waste site, an area cleaned using an in-situ methodology (area 1) was compared with a similar, but relatively uncontaminated reference area (area 2). If the in-situ methodology worked, then the average contaminant levels at the two sites should be approximately equal. If the methodology did not work, then area 1 should have a higher average than the reference area."

Even though the problem is stated directionally as a research question, it is inappropriately translated into the statistical hypotheses: H0 : μ1 − μ2 = 0 and HA : μ1 − μ2 > 0. This is a clear violation of the rule that the null and alternative hypotheses should exhaust the whole sample space as there is a region of possible values for which we have no statistical hypothesis. What is worse is that the null and alternative are reversed!

The statistical null should correspond to the claim we do not want to reject unless the evidence at hand is highly unlikely to have been generated if it was true. In this case this is the research claim that "the in-situ methodology does not work", or more precisely that "the contaminant levels in area 1 are higher than the non-inferiority margin ɛ of the levels in area 2". ɛ should be decided on based on external knowledge. If the tolerance ɛ is precisely zero, we have a classical "less than or equal to" null.

The statistical null hypothesis should thus be stated as H0 : μ1 − μ2 ≥ 0 + ɛ. The alternative hypothesis then becomes HA : μ1 − μ2 < 0 + ɛ, exhausting the remainder of the sample space.

Even though this was a hypothetical example I think it was worth including as an extreme illustration of the issues inevitably accompanying the justification of using the result of a two-tailed test as the primary statistic of interest.

Closing remarks

I believe the above examples to be informative and useful as a teaching tool on how not to apply two-sided statistical hypotheses in research work and experimentation. I believe both researchers writing papers and readers of said papers will be greatly assisted in their understanding of the data if each p-value is accompanied by a statement of the null hypothesis it was calculated for. For example, instead of writing "the observed proportion in the treatment group (PT) was 0.1 while the proportion in the control group (PC) was 0.12 (p<0.05)" or with "(p=0.046)" at the end, state it in this way: "(p=0.046; H0: PT ≤ PC)". Even better would be to report the effect size and a one-sided confidence interval bound, e.g. (p=0.046; H0: PT ≤ PC; δ = 0.02; 95%CI[δ) = 0.005)".

Does it make for a slightly lengthier expression? Sure. Is it correct and easier to interpret? Sure. Does it protect against some of the common misinterpretations? Maybe.

Article updated on Aug 27, 2018 to include the Lapatinib and Trastuzumab study.

References

[1] Piccart-Gebhart et al. (2016) "Adjuvant Lapatinib and Trastuzumab for Early Human Epidermal Growth Factor Receptor 2–Positive Breast Cancer: Results From the Randomized Phase III Adjuvant Lapatinib and/or Trastuzumab Treatment Optimization Trial", Journal of Clinical Oncology 34(10):1034-42; https://doi.org/10.1200/JCO.2015.62.1797

[2] Brenner B.M. et al. (2001) "Effects of losartan on renal and cardiovascular outcomes in patients with type 2 diabetes and nephropathy.", The New England Journal of Medicine 345(12):861-869; https://doi.org/10.1056/NEJMoa011161

[3] Wood N.E. et al. (2015) "Pharmacological blockade of memory reconsolidation in posttraumatic stress disorder: three negative psychophysiological studies.", Psychiatry Research 225(1-2):31-39; https://doi.org/10.1016/j.psychres.2014.09.005

[4] Vancampfort D. et al. (2013) "Relationships between physical fitness, physical activity, smoking and metabolic and mental health parameters in people with schizophrenia", Psychiatry Research 207(1-2):25-32; https://doi.org/10.1016/j.psychres.2012.09.026

[5] Shi C. et al. (2014) "Revisiting the therapeutic effect of rTMS on negative symptoms in schizophrenia: a meta-analysis.", Psychiatry Research 215(3):505-513; https://doi.org/10.1016/j.psychres.2013.12.019

[6] Manning W.G. (1987) "Health Insurance and the Demand for Medical Care: Evidence from a Randomized Experiment", The American economic review 77(3):251-277

[7] Jones et al. (2015) "A Comparison of Incremental Costs of Breast Cancer Clinical Trials to Standard of Care", Journal of Clinical Trials 5:216; https://doi.org/10.4172/2167-0870.1000216

[8] US Environmental Protection Agency (EPA) "Data quality assessment: statistical methods for practitioners", EPA QA/G-9S, issued Feb 2006

Enjoyed this article? Please, consider sharing it where it will be appreciated!

Cite this article:

If you'd like to cite this online article you can use the following citation:
Georgiev G.Z., "Examples of improper use of two-sided hypotheses", [online] Available at: https://www.onesided.org/articles/examples-of-improper-use-of-two-sided-hypotheses.php URL [Accessed Date: 04 Oct, 2024].