Tuesday, April 2, 2019

Retire Statistical Significance? Really?



Last week, the article "Remove Statistical Significance" in Nature went viral. It  brought criticism on statistical dogmatism. In this text, I will discuss two opposite sides of the same coin. On the one hand, the value of authors’ point of view; on the other hand, unintended consequences of retiring the concept of statistical significance. The first point of view relates to overestimation bias, while the second is related to positivism bias.

The concept of statistical significance is dichotomous, that is, categorizes the analysis into positive or negative. Categorization brings pragmatism, but categorization is an arbitrary reductionism. We must understand categories as something of lesser value than the view of the whole. The paradox of categorization occurs when we come to value categorical information more than continuous information. Continuous information accepts shades of gray, intermediate thinking, doubt, while the categorical brings a (pseudo) deterministic tone to the statement.

Statistics is the exercise of recognizing uncertainty, doubt, chance. The definition of statistical significance was originally created to hinder claims arising from chance. The confidence interval was created to recognize imprecision of statements. Statistics is the exercise of integrity and humility of the scientist.

However, the paradox of categorization carries a certain dogmatism. In Nature's article Amrhein at al first point to overestimation of negative results. A negative study is not one that proves non-existence, which would be philosophically impossible; simply put, it is actually a study that has not proved existence. So, strictly speaking, "absence of evidence is not evidence of absence," as Carl Sagan said (a very good quote, usually kidnapped by believers). That is, "the study proved that there is no difference (P > 0.05)" is not a good way to put it. It is better to say "the study did not prove any difference".

We should not confuse this statement with the idea that a negative study does not mean anything. It has value and impact. The impact of a negative study (P > 0.05) is on reducing the likelihood of the phenomenon to exist. To the extent that good studies have failed to prove, the probability of the phenomenon falls to a point that it is no longer worth trying to prove it, so we take the null hypothesis as the most probable.

In addition, a negative study is not necessarily contradictory with a positive study. It may be that the results of the two are the same, only one failed to reject the null hypothesis and another was able to reject it. One could not see and another could. In fact, most of the time only one of the two studies is correct.

Finally, the paradox of categorization makes us believe in any statistical significance, although most are false positive (Ioannidis in PLOS One). P < 0.05 is not irrefutable evidence. Sub-dimensioned studies, multiplicity of analyses and biases can produce false statistical significance.

In fact, the predictive value (negative or positive) of studies does not lie only in statistical significance. It depends on the quality of the study, how appropriate is the analysis, the scientific ecosystem, and the pre-test probability of the idea.

Therefore, Amrhein at al are correct to criticize the deterministic view of statistical significance.

But should we really retire statistical significance, as suggested by the title of the article?

I do not think so. We would be retiring an advent that historically was responsible for a great evolution of scientific integrity. The problem of the P value is that every good thing tends to hijacked for alternative purposes. Artists of false positivation of studies hijacked the advent of the value of P (made to hinder type I error) to prove false claims.

If on one hand the retirement of statistical significance would avoid the paradox of categorization, on the other hand it would leave open space for positivism bias, meaning our tropism for creating or absorbing positive information, regardless of its true value.

The critique of statistical significance has become fashion, but a better ideia has not been purposed. Indeed, in certain passages Amrhein et al do not propose a total abandonment of the notion of statistical significance. In my view, the title does not match with the true content of the article. I think there should have been a question mark at the end of the title: "Remove Statistical Significance?"

Discussion about scientific integrity has become more frequent recently.  In addressing this issue with more emphasis than in the past, it seems that it is worsening these days. It's not the case. We have experimented some evolution of scientific integrity: multiplicity concepts are more discussed than in the past, clinical trials must have their designs published a priori, CONSORT standards of publication are required by journals, there is more talk about scientific transparency, open science, slow science. And the very first step of this new era of concern regarding scientific integrity was the creation of statistical significance, in the first half of the last century by Ronald Fisher.

My friend Bob Kaplan published in PLOS One a landmark study which analyzed results of clinical trials funded by the NIH. Prior to the year 2000, when there was no obligation for pre-publishing protocols, the frequency of positive studies was 57%, falling to 8% of positive studies after prior publication rule. Before, the authors positivated their studies by multiple post hoc analyses. Today, this was improved by the obligation to publish the protocol a priori. We are far from ideal, but we should recognize the scientific field has somehow evolved towards more integrity (or less untruthfulness).

It became fancy to criticize the P value, which in my view is a betrayal with some of great historical importance and until now has not found a better substitute. It is not P's fault to have been abducted by malicious researchers. It's the researchers' fault.

Therefore, I propose to keep the P value and adopt the following measures:

  • Describe P value only if the study has adequate sample size for hypothesis testing. Otherwise, studies should gain an exploratory descriptive nature, with no use of associations to prove concepts. This would avoid false positives results from the so common "small studies". Actually, the median statistical power of studies in biomedicine is only 20%.
  • Do not describe P value of secondary end-points analyses.
  • For subgroup analyses (exploratory), use only P for interaction (more conservative and difficult to provide significant result), avoiding P value obtained by comparison within a subgroup.
  • Include in CONSORT the obligation for authors to leave explicit in the title of substudies that it is an exploratory and secondary analysis of a previously published negative study.
  • Abandon the term “statistical significance”, replacing it by “statistical validity”. Statistics is used to differentiate between true causal associations and chance-mediated relationships. Therefore, a value of P < 0.05 connotes veracity. If the association is significant (relevant), it depends on the description of the numerical difference or association measures of categorical outcomes. The use of statistical veracity will prevent the confusion between statistical significance and clinical significance.
  • Finally I propose a researcher’s index of integrity.

This index will be calculated by the ratio of the number of negative studies / number of positive studies. An integrity index < 1 indicates a questionable integrity. This index is based on the premise that the probability of a good hypothesis to be true is less than 50%. Therefore, there should be more negative studies than positive studies. It is usually not observed due to techniques of positivation of studies (small samples, multiplicities, bias, spin in conclusions) and publication bias that hides negative studies. An author with integrity would be one who does not use these practices, so he or she would have several negative and few positive studies, resulting in an integrity index well above 1.

The viral Nature article was useful to discuss the pros and cons of statistical significance. But it  went too far with the suggestion to "retire statistical significance". On the contrary, statistical significance should remain active and progressively evolve in the form of use. 

Finally, let us learn to value P > 0.05 too. After all, the unpredictability of life is represented by this symbology, much of our destiny is mediated by chance. 

Vitamin C for Sepsis: a philosophical-scientific point of view

The CITRIS-ALI trial was a negative trial recently published in the JAMA, which depicts a graphic figure with looks and numbers show a ...