# The hidden costs of bad statistics in clinical research

*Author: Georgi Z. Georgiev, Published: Aug 29, 2018*

What if I told you that irrelevant statistics are routinely used to estimate risks of tested treatments and pharmaceutical formulas in many clinical trials? That we fail time and time again to correctly identify good treatments or harmful effects of drugs due to a single practice that most researchers apply without realizing its inadequacy? What if I added that this poor practice continues unquestioned as it is enshrined in countless research papers, textbooks and courses on statistical methods, and to an extent perpetuated and encouraged in regulatory guidelines?

Here I will lay out the issue in as simple terms as possible, but I will provide references to more detailed and technical explanations as I go along.

## How we do clinical trials

When a new drug or medical intervention is proposed, it has to be tested before being recommended for general use. We try to establish both its efficacy and any potential harmful effect by subjecting it to a rigorous experiment.

Usually, patients with the condition to be threated are randomly assigned to a control group and one or more treatment groups. The control group receives standard care while the treatment group receives the new treatment or standard care plus the new treatment, depending on the case at hand.

Such an experiment allows us to statistically model the
effects of unknown factors and isolate a **causal link between the tested
treatment and patient outcomes**.

Since any scientific measurement is prone to errors, a very important quality of clinical trials is that they allow us to estimate error probabilities for what we measure. For example, they allow us to say that “had the treatment had no true positive effect, we would rarely see such an extreme improvement in recovery rate after treatment X”.

Researchers and regulatory bodies agree on a certain level
of acceptable risk before conducting the trial, ideally trying to balance
between the risk of falsely accepting a treatment that has little to no
beneficial effects and falsely rejecting a beneficial treatment simply because
the trial didn’t have the sensitivity to demonstrate the effect. There is an
inherent **trade-off**, since requiring lower risk for false acceptance
leads to higher risk of false rejection or, alternatively, to longer trial
times (longer time to market / general use) and experimenting on more patients
which has both ethical and economic disadvantages.

While the process is good overall, it has some issues and the one I will focus on here is the widespread use of two-sided statistical tests instead of the correct one-sided ones.

## Failure to match research claims with risk estimates

As mentioned, before any given trial, the researchers and relevant
regulatory bodies decide on a **threshold of acceptable risk** of falsely
declaring a treatment as efficient, for example, “we would not want to approve
this treatment unless the measurable risk of it being ineffective compared to
current standard care is 5% or less”. So far, so good.

However, what happens in most clinical trials is that the
measurement error is reported **not based** on the risk threshold as defined
above but based on the risk of “the treatment effect being exactly zero”. So,
the researcher might claim “treatment improves outcomes with error probability
equal to 1%” but in fact what the 1% probability they report is for the claim
“treatment either improves or harms outcomes”, not for “treatment improves
outcomes”. In most cases the error probability that should be reported is half
of the reported, or in this case 0.5% instead of 1% (2 times less measurable risk!).

Researchers **fail to use the appropriate statistical test**,
therefore the statistical
hypothesis does not match their research hypothesis. In statistical
terms, researchers report **two-sided** p-values and two-sided confidence
intervals, instead of **one-sided** p-values and one-sided confidence
intervals, the latter of which would actually correspond to their claims.

This confusion is not limited to medicine and clinical trials, but is present in many behavioral sciences like psychology, psychiatry, economics, business risk management and possibly many others. However, I’ll keep to examples from medical trials for the sake of brevity.

## The profound effects of this simple error

You might be thinking: what is the big deal? After all, we are exposed to less risk, not more, so where is the harm? However, the cost is very real, and it is expressed in several ways.

Firstly, we see **beneficial treatments being rejected since
the apparent risk does not meet the requirement**. For example, the observed
risk, using an irrelevant (two-sided) estimate is 6%, with a 5% requirement. However,
using the correct (one-sided) risk assessment we can see that the actual risk
is 3%, which passes the regulatory requirement for demonstrating effectiveness.

Many similar examples can be found in scientific research, including a big Phase III breast cancer trial (8381 patients recruited) which demonstrated a probable effect of up to 45% reduction of the hazard ratio, however the treatment was declared ineffective at least partly due to the application of an irrelevant risk estimate. Had the correct risk estimate been applied, it would have made the treatment accepted as standard practice if the side-effects (of which there was a noted increase) were deemed acceptable.

I’ve discussed this and several other examples, including from other research areas in my article “Examples of improper use of two-sided hypotheses”.

Secondly, we have **underappreciation of risk for harmful side-effects**.
Like measurements of beneficial effects, measurements of harm are also prone to
error, and a drug or intervention will not be declared harmful unless the risk
of such an error is deemed low enough. After all, we do not want to incorrectly
reject a beneficial treatment due to what can be attributed to expected
measurement errors.

However, if we use an incorrect error estimate we will fail to take note of harmful effects that meet the regulatory risk standard, and which should have stopped the drug or intervention from being approved. Using a two-sided statistic, we might believe that the observed harm is merely a measurement artefact while the proper one-sided statistic will show us that it exceeds the acceptable risk threshold and should be considered seriously.

Finally, reporting irrelevant risk estimates **robs us from
the ability to correctly appraise risk when making decisions about therapeutic
interventions**. Not only are researchers and regulatory bodies led to wrong conclusions, but your physician and you are
being provided with inflated risk estimates which may preclude you from making
an informed choice about the treatment route which is most suitable for your
condition and risk tolerance.

The last point is especially painful for me, since I’m a firm proponent of making personal calculations for risk versus potential harm, in medicine and beyond. No two people are the same, no two personal situations are the same and where one sees unacceptable risk another sees a good chance to improve their situation. Being provided with doubled error probabilities can have a profound effect on any such calculation.

## How is this possible and why it happens?

This is a question which I found fascinating, since I’m an autodidact in statistics and even for me it was apparent that when you make a claim of a positive or a negative effect, the relevant statistic should be one-sided. I was especially stumped after discovering that the fathers of modern statistics: R.Fisher, J.Neyman and E.Pearson all embraced and used one-sided tests to a significant extent.

So, somewhere along the road a part of the research community and, apparently, some statisticians, became convinced that one-sided tests somehow amount to cheating, to making unwarranted predictions and assumptions, to being less reliable or less accurate, and so on. As result two-sided tests are recommended for most if not all situations, contrary to sound logic.

I have reasons to believe this is partly due to the apparent
paradox
of one-sided vs. two-sided tests which is a hard one to wrap your
head around, indeed. Another possible issue is one I traced to the graphical presentation
of statistical tables from the early 20^{th} century and which is manifested
in a different form in statistical software of nowadays. Finally, mistakes in
teaching statistics which lead to mistaking the “null hypothesis” with the “nil
hypothesis” or the interpretation of p-values as probability statements related
to research hypothesis is surely taking its toll as well.

These reasons are too complex to cover in depth here, but I have done so in Reasons for misunderstanding and misapplication of one-sided tests, if you fancy a deeper dive in the matter.

Whatever the reason, it is a fact that currently one-sided tests are incorrectly portrayed in books, textbooks and university courses on statistical methods. The bad press follows them in Wikipedia and multiple blogs and other online statistical resources. Given the large-scale negative portrayal of one-sided tests, some of which I have documented here, it is no wonder that researchers do not use them. In fact, I am pretty sure some have not even heard of the possibility of constructing a one-sided confidence interval.

Another reason are the **unclear regulatory guidelines**,
some of which (e.g. the U.S. Food and Drugs Administration and the European Medicines
Agency) are either not explicit in their requirements or they specifically
include language suggesting one-sided statistics are controversial. Some
guidelines recommend justification for their use which is not something
requested for two-sided statistics.

This naturally leads most researchers to take what appears as a safe road and so they end up reporting two-sided risk estimates, perhaps sometimes against their own judgement and understanding. Peer pressure and seeing two-sided p-values and confidence intervals in most published research in their field of study probably takes care of any remaining doubt about the practice.

## How to improve this situation

My personal attempt to combat this costly error is to educate researchers and statisticians by starting Onesided.org. It is a simple site with articles where I explain one-sided tests of significance and confidence intervals as best as I can, correcting misconceptions, explaining paradoxes, and so on. It also contains some simple simulations and references to literature on the topic as I am by no means the first one to tackle the problem.

My major proposal is to adopt a
**standard for scientific reporting** in which p-values are always accompanied
by the null hypothesis under which they were computed. This will both help ensure that the hypothesis will correspond
to the claim and will also deal with several other issues of misinterpretation
of error probabilities.

Of course, it would be great if regulatory bodies could improve their guidelines. My brief proposal for which can be found here. However, this is usually a slow and involved process and it mostly reflects on what is happening in practice already.

## In conclusion

I think the important point is making use of error probabilities to measure risk where possible and of using the right risk measurement for the task. Failure to do so while under the delusion that we are in fact doing things correctly costs us lives, health, and wealth, as briefly demonstrated above. Whether it is a government-sponsored study or a privately-sponsored one, I know that in the end the money is being deducted from the wealth we acquire with blood, sweat and tears and I see no reason so as not to get the best value for it that we can.

Furthermore, this poor statistical practice denies us the ability to correctly apply our own judgement to data, thus hindering our personal decision-making and that of any expert we may choose to recruit.

I’m optimistic that bringing light to the issue will have a positive effect on educating researchers and statisticians about it. I have no doubt most of them will be quick to improve their practice, had it been an error for one reason or another.

**Enjoyed this article? Please, consider sharing it where it will be appreciated!**

#### Cite this article:

If you'd like to cite this online article you can use the following citation:

Georgiev G.Z., *"The hidden costs of bad statistics in clinical research"*, [online] Available at: https://www.onesided.org/articles/the-hidden-cost-of-bad-statistics-in-clinical-research.php URL [Accessed Date: 25 May, 2019].