# Reasons for misunderstanding and misapplication of one-sided tests

Author: Georgi Z. Georgiev, Published: Aug 6, 2018

One-sided tests of significance and one-sided intervals are often misunderstood and incorrectly applied in published scientific papers, regulatory guidelines, journal submission guidelines, etc. There is a kind of a stigma around one-sided test which results in two-sided tests being applied in their place, leading to reporting of probabilities that do not correspond to the claims they are supposed to support.

The issue is significant since for symmetrical probability distributions the replacement of a two-sided test with a proper one-sided one can lead to the reduction of the nominal reported probability by 50%, drastically changing the inferences and decisions made, which is especially true for values near the adopted decision boundaries. Here I offer a brief account of the leading causes I suspect have led to that situation. Based on my reading of the relevant literature, learning materials and arguments put forth against one-sided values, I believe the three most significant are:

## Reason 1: Poor design of the statistical tables in the early 20-th century

Statistical tables were the hammer and nails of statisticians in the era before advanced electronic calculators later, computers, became available. Some of the most famous and most used ones were those of R.Fisher, one of the fathers of modern statistical methods. They were published as part of his 1925 book "Statistical Methods for Research Workers" [1] and were widely used. According to some accounts (cannot remember and cannot re-discover source) people used to rip the pages with the tables from the book and plastered their walls and desks with them for easy reference.

They had one significant flaw: for reasons unknown, they contained two-tailed cumulative probabilities while making it appear as if those probabilities referred to one-tail of the statistic distribution. For example, we would have key positive values of the t distribution and next to the probabilities which referred to the absolute value of t. In effect, the tables were reflecting the probability of "< -t or > t" which was not stated anywhere in the tables themselves. Only the chi-square table stated that, but none of the rest had that disclaimer. The only guidance on how to get a one-tailed value from the t-tables was in the text of the book some 40 pages before the table itself.

It is then no wonder that many researchers with only cursory statistical knowledge would not even be aware that the values they are reporting are two-tailed and would not even consider if this is the correct value for them to report. Even experienced statisticians would be more likely to get confused given such a poor presentation.

In present day such bias towards two-sided tests of significance and confidence intervals is propagated by software vendors who would select two-sided ones as default or would not even allow the user to specify a one-sided case (the former being significantly more common than the latter). For example, in R, an open-source project widely popular amongst practicing and academic statisticians, most functions that calculate p-values or confidence intervals have "two-sided" selected as default. One can simply not pass any argument for that parameter of the function and it will happily calculate a two-sided output, encouraging little consideration for the alternative.

The result from this is the widespread usage of two-sided statistical hypotheses, even when the research hypothesis is stated in one-sided terms and such use is not called for. I suspect that at least in some instances, people who have correctly stated both their research and statistical hypotheses might still end up using a critical region or reporting a probability that is for the two-sided case.

I have gone into significant detail on this in "Is the widespread usage of two-sided tests a result of a usability/presentation issue?" and for those who think this is just some ex-post-facto explanation I’ve proposed an experimental way to estimate the extent to which such a presentation would be confusing to users of the tables. My 15+ years of work in designing and managing websites used by hundreds of thousands of people and my experience in the web usability and conversion optimization industry make me willing to bet good money that the outcome of a proper test will favor my informed opinion.

## Reason 2: Mistaking the Null hypothesis with the Nil hypothesis

One of the hardest things for an autodidact in statistics such as myself (I have not received any kind of training in statistics) is to get used to the specialized jargon an, naming conventions and notations. Part of the difficulty comes from the fact that many terms re-use words that have different meaning in every-day language use (e.g. "statistical significance", "probability") and sometimes terms were not specific enough (e.g. "hypothesis" might be referring to "research hypothesis" or "statistical hypothesis", "significance" to "statistical significance" or to "practical significance"). Different and often incomplete or outright wrong definitions of the same term in different texts did not help, either. I guess that’s a period every young science must go through.

The term "null hypothesis" was among the most confusing ones and it took me several years to grasp its full and correct meaning. I had both the time and the motivation to do so: two things which are lacking, I am afraid, in a lot of practitioners and even some statisticians. Given that in many examples in courses and textbooks on statistics often contain a null hypothesis which is also a null hypothesis (µ = 0) and rarely any other ones, it is easy for me to see how many would be inclined to equate the term null hypothesis to the nil hypothesis. Some may recognize that theoretically a null could be different than zero (no effect, lack of effect) but still in practice only consider a null of µ = 0.

Once someone equates the null with nil for all intents and purposes, it is no wonder that they will not even consider learning about different null hypotheses, especially if everyone else around them supports this view or at least does not oppose it.

## Reason 3: Believing the p-value is a probability statement about a research hypothesis or of something happening "by chance"

All kinds of misconceptions and misinterpretations of p-values are present due to their widespread use by researchers with poor statistical education. I will not go into the reasons why or into how widespread it is, many others have done so before and will do so after me, for as long as there are in use.

The issue with the first case: thinking that the p-value expressed a probability related to our research hypothesis reflects the failure to understand how a research hypothesis relates to a statistical one. It expresses a high level of misunderstanding of what can be achieved by inductive inference in a scientific, objective way. It inevitably leads with calculating "the p-value" from the data using whatever software or formula is at hand, and then reporting it as if it relates to the research hypothesis. In some cases, it would correspond to the research claims but often it will not.

The second issue stems from believing that a p-value reflects "chance" or "randomness" or "random error" or "probability of error", without additional qualifiers. Once the phrase "under the null hypothesis" is added, one is forced to think about what this null hypothesis is, why should we assume it as true in order to calculate the p-value or confidence interval, and so on.

This process will lead, most of the time, to the choice of a one-sided hypothesis of some kind. When it is not present, though, one just calculates a p-value and proceeds onward, blissfully unaware of what the number really means.

I believe the above three reasons are the major causes for the fact that one-sided tests of significance and confidence intervals are used much less often than they should, why there is stigma and discrimination against their use and all the negative outcomes that follow. I dare not attempt to rank them, nor would I say these are the only reasons for that. I am sure there are many other incentives, conscious and unconscious biases working against the usage of one-sided tests.

#### Reference

[1] Fisher R.A. (1925) "Statistical methods for research workers". Oliver & Boyd, Edinburg