Is the widespread usage of two-sided tests a result of a usability/presentation issue?

Author: Georgi Z. Georgiev, Published: Aug 6, 2018

Since I was born in the age of easy access to abundant computing power until recently I did not fully grasp how much statistical applications – both applied and scientific, depended on pre-calculated tables of critical boundaries and probabilities. When I realized the extent to which this was true in a previous era, it downed on me that this might be a plausible factor in the widespread adoption of two-sided calculations despite them being unsuited / inappropriate for supporting most claims made by both applied and scientific researchers.

In short, I believe that a significant reason for the preference for two-sided tests over one-sided ones is how the famous Fisher tables of the T-distribution, Z-distribution and X2 distribution were tabulated and presented.

In order to support this claim I will first demonstrate the importance and ubiquitous use of Fisher’s tables, presented for the first time in his 1925 book "Statistical Methods for Research Workers" [1]. Then I will explain why I believe there is a major flow in their presentation and how it was detrimental to and discouraged usage of one-sided tests in the early days of statistics. This then carried on to present day where computing power makes usage of such tables redundant.

The Importance of Fisher’s Tables

The fact that the tables of critical values and probabilities for different distributions that Fisher included in his book are of importance is rarely debated, but I believe stressing this point cannot be done enough, so I will lay some facts and citations on the matter by reputable sources to make the point more vivid.

Writing about Fisher’s "Statistical Methods for Research Workers" Vyas & Desai (2015) state [2]: "Here, Fisher included tables that gave the value of the random variable for specially selected values of p that were much more compact than Pearson’s detailed tables. An unexpected reason for a new way of tabling is suggested by Jack Good: ‘Kendall mentioned that Fisher produced the tables of significance levels to save space and to avoid copyright problems with Karl Pearson, whom he disliked’. […] According to Conniffe, this work ‘went through many editions and motivated and influenced the practical use of statistics in many fields of study’. Today it is considered the gold standard of applied statistics for scientists in many fields. It seems that Fisher did as much to popularize the use of statistics in other disciplines as he did to contribute to its own development."

How important statistical tables were at the time is apparent by a report in Kendell’s "Ronald Aylmer Fisher, 1890-1962" (1963) [3] on the dispute between K.Pearson and Fisher over the reproduction of Pearson’s chi-squared tables that Fisher requested to be included in his book. "This was perhaps not simply a personal matter because the hard struggle which Pearson had for long experienced in obtaining funds for printing and publishing statistical tables had made him most unwilling to grant anyone permission to reproduce. He was afraid of the effect on sales of his Tables for Statisticians and Biometricians on which he relied to secure money for further table publications. It seems, however, to have been this refusal which first directed Fisher’s thoughts towards the alternative form of tabulation with quantiles as argument, a form he subsequently adopted for all his tables and which has become common practice."

To get a sense of how much of a work that was, it has to be said that according to Box (1981) [4] it took several years and the usage of two mechanical calculators to compute the tables with a satisfying level of precision. For a full understanding of the monumental work that tabulating the values for the t distribution and others was, I recommend a reading of Box (1981) "Gosset, Fisher, and the t Distribution".

Yet another testament for that can be found in Krishnan (1997) [5] - Fishers Contributions to Statistics: "These tables, together with those by Pearson and Hartley, were essential tools of a statistician's trade in those days when a statistical laboratory consisted of manually or electrically operated calculating machines and even in the days of electronic desk calculators." (emphasis mine).

How the tables are presented in the book

Again, according to Box, the tables published by Fisher represented not only something with wide applications, but also a significant improvement over existing ones in terms of presentation, percentiles being presented not only made the tables shorter, but much more readable and easy to use by practitioners.

Yet, likely for reasons of taking less space, the tables presented only two-tailed probabilities for the corresponding values of the X, t and z statistic.

Starting from Table I: "The deviation in the normal distribution in terms of the standard deviation" we see that the values of X are given as positive numbers. So, when observing X of ~1.64 one adds the column header: .10 to the row header: 00 and finds out P=0.1. When observing X of ~-1.64 one has to first take the absolute value, find that in the column and find the same P. Only in the explanatory text below the table is it mentioned that "x is the deviation such that the probability of an observation falling outside the range from -x to +x is P". This corresponds to a test of a point null and a two-sided alternative hypothesis.

Scans of the table are available online. Here is a partial reconstruction of the table to illustrate what Table I: "Table of x" looked like:

.01. .02. ... .08. .09. .10.
.00 2.575 2.326 ... 1.750 1.695 1.644
.10 1.598 1.554 ... 1.340 1.310 1.281
... ... ... ... ... ... ...
0.70 .371 .358 ... .279 .266 .253
0.80 .240 .227 ... .150 .138 .125
0.90 .113 .100 ... .025 .012 .0

The explanatory text below the table in full: "The value of P for each entry is found by adding the column heading to the value in the left-hand margin. The corresponding value of x is the deviation such that the probability of an observation falling outside the range from -x to +10 is P. For example, P = .03 for x = 2.170090; so that 3 per cent of normally distributed values will have positive or negative deviations exceeding the standard deviation in the ratio 2.170090 at least."

There is no explicit guidance on how one is to treat x if they are interested in the probability of an observation falling in the range from 0 to +x or 0 to -x (the two most commonly used one-sided nulls) or any other range, for that matter.

Then we have a Table IV: "Table of t" wherein we are again given only the positive values of t with no guidance that the probabilities listed are those for a two-sided test, nor guidance in how to convert them for at least the simple one-sided case.

Such guidance is offered in a paragraph inside the book, some 40 pages before the table, and it states: "If it is proposed to consider the chance of exceeding the given values of t, in a positive (or negative) direction only, then the values of P should be halved." This is a clear statement on how to handle one-sided questions, but it is, unfortunately, buried many pages into a lengthy and complicated book.

Scans of the table are available online. Here is a partial reconstruction of the table to illustrate what Table IV: "Table of t" looked like:

n. 9. 8. ... 0.05 0.02 0.01
1 .158 .325 ... 12.706 31.821 63.657
2 .142 .289 ... 4.303 6.959 9.925
... ... ... ... ... ... ...
29 .127 .256 ... 2.045 2.462 2.756
30 .127 .256 ... 2.042 2.457 2.750
.127 .256 ... 1.95996 2.32634 2.57582

Similar issues are present in tables V.A and V.B (correlation coefficients) and Table VI on the values of the z distribution.

Given that pages with the tables were often ripped off the book and kept at a hand’s distance for reference it becomes painfully clear how easy it is to misuse them in favor of using two-sided significance calculations even when such are not warranted or wanted.

How the presentation favors two-sided significance calculations

It should already be easy for a person with experience in graphical design and user experience to spot why the tables as they are presented would favor two-sided calculations, even if said person has minimal statistical knowledge. Let us make this explicit.

1. Aside from table I, there is no relevant information on the page on which the table is printed showing that the P-values are calculated for a null of zero difference. This makes it more likely for one to remain unaware of that fact and makes it easier for people wise to it to forget it.

2. The explanation on how to treat the one-sided case is buried deep in the book and only references one of the several tables. It is therefore likely that many of the users of the tables would be unaware of it and if in need of such guidance they should consult other sources (books, papers, colleauges, etc.).

Imagine for a moment that you are a research worker with deep expertise in your field, but only cursory understanding of a statistical procedure or two which are relevant to the daily tasks you work with. You, unlike the likely reader of this article, have not had the joy of familiarizing yourself with the deep mathematical and philosophical roots of the procedures you apply. How easy it is then to interpret the probabilities as applicable to any null hypothesis you have at hand?

As demonstrated elsewhere ("Examples of improper use of two-sided hypotheses") many researchers do not understand the need to define a statistical null hypothesis that corresponds to their research hypothesis and threat any statistical answer as if it applies to their research hypothesis. As seen, making a directional claim can be quite subtle at times and given the practice of simply attaching a p-value to a number without explicitly specifying the null hypothesis under which it was calculated it is no wonder that we see so many p-values detached from the conclusions they are meant to support.

The above can be shrugged off as conjecture, ex-post-facto explanation, but I believe simple empirical tests can and will confirm this. A way to test this would be to introduce the tables to students who are familiar with what a t or z distribution is, and then to ask them to apply it to a couple of different problems requiring the use of one-tailed tests to see if they will extract the correct p-value from the tables. To make the test more sensitive, it would be recommended to include at least one experiment in which the observed value of the test statistic has a negative value, e.g. z=-1.65.

A control group can use different, one-sided tables. A great example are the tables provided in the US Environmental Protection Agency: "Data Quality Assessment: Statistical Methods for Practitioners" [6] guide, for example the Z table (A-1) is split into to sections, one for negative values of z and another for positive. The p-values provided are one-sided. Above each section is a graphical representation of the rejection region. Now to do a two-sided calculation you need to add up the p-values from the two tables:  for -|z| and for |z|, to get the two-sided p-value.

Current situation and conclusions

Given the usage of one-tailed probabilities by Fisher in many of his examples of applications of statistical methods I do not believe that the reporting of two-tailed probabilities in the tables was any kind of deliberate attempt to discourage usage of one-tailed probabilities. It was most likely a convenient or economical decision, or just lack of oversight for the potential consequences.

Similarly, I do not think that nowadays major statistical software vendors are making it deliberately harder to get one-tailed probabilities out of their tools by adopting two-tailed probabilities as defaults. I think it is part custom/convention and part poor UI that does little to help the researcher decide on what is the best statistical test for their research hypothesis or question at hand. The unintended consequence is bias against the usage of one-sided tests and the proliferation of suboptimal matching between research and statistical hypothesis, resulting in incorrect error probabilities being reported.

In conclusion, I find strong reason to believe that lacking user experience of early statisticians and research workers in using the tables provided by Fisher in his "Statistics for Research Workers" book is at least partly responsible for present day misconceptions about one-sided tests as well as the usage of two-sided tests for providing error probabilities to the answers of questions that require the use of one-sided tests. Newer software tools continue to suffer from similar issues, propagating the confusion further into the future.

P.S. In writing the above, I also realized that in a time when these tables where the main tool of the statistician it is not so difficult for one to forget there are null hypotheses different than the nil hypothesis. The step from that to equating the two is small, indeed. Similar, but less pronounced effects are likely present due to the default choices of present day statistical software.

Reference

[1] Fisher R.A. (1925) "Statistical methods for research workers". Oliver & Boyd, Edinburg

[2] Vyas S.A., Desai S.P. (2015) "The Professor and the Student, Sir Ronald Aylmer Fisher (1890-1962) and William Sealy Gosset (1876-1937): Careers of two giants in mathematical statistics.", Journal of medical biography 23(2):98-107; https://doi.org/10.1177/0967772013479482

[3] Kendall M.G. (1963) "Ronald Aylmer Fisher, 1890–1962", Biometrika 50(1-2):1-15; https://doi.org/10.1093/biomet/50.1-2.1

[4] Box J.F. (1981) "Gosset, Fisher, and the t Distribution", The American Statistician 35(2):61-66; http://dx.doi.org/10.1080/00031305.1981.10479309

[5] Krishnan T. (1997) "Fishers Contributions to Statistics", Resonance Journal of Science Education 2(9):32-37

[6] US Environmental Protection Agency (EPA) "Data quality assessment: statistical methods for practitioners", EPA QA/G-9S, issued Feb 2006

Enjoyed this article? Please, consider sharing it where it will be appreciated!

Cite this article:

If you'd like to cite this online article you can use the following citation:
Georgiev G.Z., "Is the widespread usage of two-sided tests a result of a usability/presentation issue?", [online] Available at: https://www.onesided.org/articles/widespread-usage-of-two-sided-tests-result-of-usability-issue.php URL [Accessed Date: 21 Nov, 2018].