Unboxing Evidence Based Medicine

Monday, October 14, 2019

Vitamin C for Sepsis: a philosophical-scientific point of view

The CITRIS-ALI trial was a negative trial recently published in the JAMA, which depicts a graphic figure with looks and numbers show a striking 45% relative reduction in mortality in patients with sepsis or ARDS treated with vitamin C. The study had 55,000 accesses and Altimetrics of 494, rending excitement and positive tweets.

In this text, I will not waste readers time discussing methodological flaws of “positive” secondary analysis of a primarily negative trial: in CITRIS-ALI the primary outcomes were surrogates of clinical outcomes, and mortality was just one of the 46 secondary endpoints tested; the difference was not exactly statistically significant, if properly adjusted for multiple comparisons (P = 0.03 x 46 comparisons). But this post does not lend itself to the obvious. These things do not make the CITRIS-ALI study unusual. In fact, it is just “one of those”.

In this text, I intend to discuss non-intuitive concepts which requires careful explanation. After all, scientific thinking is not exactly intuitive.

Why should a surrogate study not have death as a secondary outcome?

First, we need to review the real scientific purpose of secondary outcomes: refining knowledge about the primary outcome (positive or negative). Thus, the secondary outcome is, in its origin, subordinated to the primary outcome.

Secondary outcomes, if properly applied, explain the primary outcome. In this process, we start from primary results (most important) and then look at the secondary (less important). Let's go to examples.

If death is a positive primary outcome, it is interesting to know the mechanisms of mortality reduction. In this case, the secondary analysis of cause-specific death gains importance. Death is actually a net outcome, combined from multiple types. So it is interesting to know which type of death mostly contributed to the final result. Similarly, when there is no mortality reduction, it is importante to understand if it occurred because treatment did not impact anything, or because it reduced one type of death and increased another type of death (complication from treatment).

The more relevant an outcome is, the closer to the final pathway it is. Therefore, intermediate ways of explanation tend to be less important outcomes. So the nature of an explanatory outcome (secondary) is to be less important than the primary outcome.

When a study defines a final pathway outcome (death) as a secondary endpoint, it inverts the logic and create bias: the secondary outcome tends to invade the protagonism of the primary outcome. Death in the secondary position does not serve the explanatory purpose, but tends to "steal the scene" if positive.

The phenomenon of outcome interpretation bias takes place when a positive secondary outcome soften the negativity of the study. In the case of death, the effect goes beyond softening, it nudges our thinking towards positivity. We can’t help. And it gets worse when we realize the nudge is based on a flawed analysis of low positive predictive value, because of underpower and multiple comparison.

Therefore, death should not be listed as secondary endpoint in studies of surrogate or intermediate outcomes. If done, we run the risk of finding a probably false but very exciting result. So exciting it justified a graph to be built and spread in tweets, generating visibility, enthusiasm and quotations.

Why should pretest probability not be influenced by the logic of potential mechanisms?

Based on potential mechanisms of beneficial effect, some has considered the CITRIS-ALI hypothesis worth to be tested. Well, it is not. And I will explain.

It is known since the dawn of medical-scientific thinking that biological plausibility is not the same as probability of truth. In fact, it is one of the basic principles of evidence-based medicine. Null treatments often have several plausible mechanisms of functioning stated in the elegant introductions of clinical trials. Therefore, it is not the existence of possible mechanisms that make a phenomenon more or less likely to be truth. Daniel Kahneman described “confidence by coherence”, an illusion that mistakes coherence with truth. It is in the core of the belief that a fancy theoretical mechanism justifies a study.

So what is the point of winding up mechanisms? Actually, it serves the emergence of the idea. But between thinking about the idea and deciding to do a study, one needs to move to a real estimation of pretest probability.

How to estimate pretest probability?

Here I am not taking about Bayesian statistical analysis. I am talking about Bayesian reasoning, a nature way of human thinking.

Estimating pretest probability of a hypothesis looks like clinical reasoning. When faced with a clinical picture, it is not representative heuristics that best estimates pretest probability. In fact, it is best to use epidemiological data that does not concern the clinical picture itself. That is, the prevalence of the disease in a given circumstance is the true pretest probability, not the clinical similarity (“confidence by coherence”).

The same is true when we think of a study. The pretest probability is not how logic the idea seems. Probability rises from epidemiology. In the field of vitamins, studies are consistently negative, even when high expectations had been placed on vitamins preventing cancer or cardiovascular disease. It is the epidemiology of vitamins as potential treatment. It indicated low probability of a true effect.

We must recognize: a vitamine reducing mortality of a serious disease like sepsis by half is just “too good to be truth”.

There is also a second component of pretest probability: previous studies that tested the same specific hypothesis. The so called exploratory studies.

But, why “exploratory studies” are banalized?

The banalization arises from the confusion between a bad study and an exploratory study. Often a study with high risk of bias or random error shows an interesting result. Hence, enthusiasts call it exploratory or hypothesis generators. But low predictive value studies generate nothing but enthusiasm. They do not generate hypothesis!

Exploratory studies should be made of good empirical observations, with risk of bias or chance low enough to shape the likelihood of a hypothesis being true; but insufficient to confirm the hypothesis.

Therefore, this study of vitamin C does not generate any hypothesis of mortality reduction to be tested by future studies. In spite of a result that generated enthusiasm, but it is not capable of shaping the likelihood of the effect to be true. It is just a reinforcement of the belief that will justify future studies to test what did not need to be tested in the first place.

But why doesn't it need to be tested by future studies?

Because it would waste our (mental, time and financial) resources for future studies. If the future study was negative, we already knew it; if positive, it would only raise the probability from very low to low. For confirmation, it would take several positive studies to prove an unlikely hypothesis.

Yes, it is true that there are discoveries that emerge as black swans, unpredictabilities that changes the world. But these, unpredictable as they are, are not hypotheses previously fabricated and tested by poor quality studies. They arise at random, such as the discovery of penicillin.

Administering vitamin C may not cause much harm to the patient, but it does harm the collective cognition when belief replaces rationality.

Faith is not bad, it is actually innate of the sapiens species. But the value of faith relies in religious, espiritual or personal situations. We should not let the personal value of beliefs invade the scientific concept, nor the professional concept. Medical professionalism lies in respecting the order of things, based on evidence.

This text is not about vitamin C. Vitamin C was just a hook to spark a much more important discussion than its "effectiveness" in sepsis.

This text is about a scientific ecosystem of low probability hypotheses, tested by studies with severe methodological limitations, which use spin, outcome reporting bias and publication bias to generate positive studies of low predictive value to promote guideline recommendations, which will suffer “medical reversal” years later.

The question is whether we want to continue this fantasy or take a truly scientific professional atitude. Science is not looking for news, it is the humble attempt to identify how the world works and find real solutions to our problems.

As the late Douglas Altman once said, "we need less research, better research and research done for the right reasons."

Tuesday, September 10, 2019

NNT Invisibility to the Clinical Eye

Doctors often claim eloquently and pretentiously to be “having great results” with their treatment choices. However, they do not realize it is impossible to perceive effectiveness from clinical experience in many of the circumstances. Experience is blind to prognostic effectiveness.

The theoretical framework for this observation comes from Daniel Kahneman’s discussion about prediction under uncertainty. Previously, Meehel demonstrated that statistical predictions beat expert opinion in 60% of professional situations. Kahneman proposed that expert opinion is mediated by heuristics, therefore vulnerable to cognitive bias.

As opposed, the work by psychologist Gary Klein with firefighters recognized the value of experience in building accurate expert opinion. To reconcile both views, Kahneman and Klein worked together and proposed that in situations of large, consistent and immediate effects, professionals are able to accumulate experience to predict consequences of their choices.

It is the case for pain control by morphine. There is a large, consistent and immediate effectiveness, easily perceived. Reproducibility and time proximity between cause and effect makes it easier to link treatment and improvement. It is how players learn the game. A wrong move in chess almost always results in a quick defeat.

As opposed, prognostic effectiveness is a typical situation of uncertainty. First, it refers to the future, so the result is not immediate. Second, the result of a prognostic treatment is probabilistic, not deterministic. Thus, it is inconsistent and of modest effect size. To understand it we should review the philosophical concept of number needed to treat: and the sense of humility provided by it.

The Impossible to Perceive NNT

The argument based on “clinical experience” is often used in cases of great uncertainty and questionable scientific evidence. And it is precisely in those situations that it is impossible to notice the benefit (or harm) from clinical observation.

The blindness clinical perception occurs when intervention happens in the present in order to reduce risk in the future: it is not an immediate effect. Second, when it comes to the future, the benefit is much more uncertain. These are probabilistic situations that suffer from the uncertainty of the “number needed to be treated”.

In these circumstances, practice does not increase our decision-making ability. In contrast, trying to create experience-based concepts of effectiveness is a good example of confirmation bias. It is about using experience is a way of unlearning on the basis of practical illusion.

We must understand the difference between the “prevalent NNT” and “incident NNT”. In prevalent situations (symptom improvement), the NNT tends to approach 1. In incident situations (future improvement), the NNT tries to distance itself from 1 towards infinity. The incident NNT is the one hard to perceive.

It happens for two reasons. In prevalent situations all patients are in need of treatment. But in preventing future events, only a small portion of patients really need treatment: those who will suffer the future outcome. But as we do not know who they are, we treat them all and many who would not need to end up being treated, increasing NNT to get a benefit. Second, relieving a symptom is usually easier than preventing an outcome, so the effect size of present symptoms treatments is greater than preventing future events.

A great current example is the (increasingly common) claim by cardiologists that their patients with chronic heart failure have benefited from the new medication called "Entresto." This clinical argument is usually brought when one questions the validity of that trial based on the asymmetric control group of the PARADIGM trial.

That study concluded that a new drug (sacubitril) was effective from a severe adjunctive therapy asymmetry between the drug and placebo groups. The study showed a reduction in the combined death or hospitalization due to heart failure, two future outcomes.

The methodological fallacy of the study generates uncertainty. To compensate for this uncertainty, faithful cardiologists have used their eloquence, and use the statement, "In my experience, I've had a great response to Entresto."

Statement Mathematical Analysis

The NNT derived from the PARADIGM-HF trial is 21 to reduce the combined death or hospitalization, a benefit that would be of great magnitude. But how could a doctor perceive an NNT of 21 in clinical practice?

Suppose she had 21 patients using Entresto and 21 patients without using Entresto. In 20 of each group the evolution would be the same, only the difference of evolution in the twenty-first patient of each group would take place. It is just imperceptible in everyday life.

This is the “fallacy of the prognostic clinical impression”. It is impossible to perceive 1 in 21 “with the naked eye” or “with the clinical eye”.

If we think of 100 patients treated in each group, the difference between the groups would be only 4 patients. 4 out of 100 patients: how to perceive the phenomenon depicted in the figure below?

So let's stop using this argument, which borders on ridiculousness.

One statistician once told the story of his two adopted children. The daughter was a child who was adopted in China. The son was American. One day the girl said, “Girls come from China, boys come from the United States.” The innocence of the child shows a trace of the human mind: to conclude from small samples.

Does it seem cartoonish for the girl to have completed this? But that's exactly what doctors do when, after three successful consecutive experiments, they find that something works.

Behind this is the confirmation bias. Because clinical practice is not an experimental scientific environment, any choice is based on the belief in benefit. If I prescribe something, it is because I believe in it. In fact, it would be unethical to prescribe something I do not believe. Therefore, clinical practice is a naturally believing environment, predisposed to confirmation bias. Starting from belief and looking at the world around us, we will fall into the cognitive trap of seeking evidence for what we believe. We will record in our memory patients who evolved according to our belief and validate our conduct without symmetrically computing patients who rejected our belief.

Complicating matters, clinical practice is fraught with performance bias. The tendency of the concerned physician who changes the patient's treatment to Entresto is to make further improvements in his conduct, to adjust the diuretic, to better guide the diet. Therefore, even if it were possible to perceive the result, it would be impossible to know what caused that result.

It is different in the scientific environment in which we start from skepticism and reject the null hypothesis only when the evidence is far beyond chance and bias-mediated effects.

Conclusion

Experience is essential for the individual application of a scientific concept, in the perception of patient values, in shared-decision making, in the generation of a diagnostic hypothesis. But we cannot trivialize and undermine the value of clinical experience by its caricature and inappropriate use.

Clinical experience is blind to the effectiveness of prognostic management. We are blind to the prognostic NNT, but even blinder to the limitation of our own experience.

The story of how Entresto got approved for the Brazilian Public Health System

In one of the most skillful evidence-based medicine hijacking movements, Entresto is approved to be financed by the government for use in patients with heart failure in the Brazilian Unified Health System.

The hijacking of a concept occurs when it is transformed into its own caricature, to the point of breaking the barrier of rationality, supporting paradoxical application of the original concept itself.

Evidence-based medicine proposes that reliable scientific concepts be used in the reasoning process of medical decision. In the process of “hijacking", the evidence dominates rationality. The recommendation becomes hostage of the evidence. And by mastering the process, the evidence is no longer questioned as to its reliability.

We will explain in this post how the hijacking occurred in the case of Entresto and our public health system. Entresto is the trade name that contains Sacubitril. Sacubitril inhibits the enzyme neprilysin, which is responsible for the breakdown of good molecules such as natriuretic peptide and bradykinin. Thus, by inhibiting neprilysin, sacubritil increases the concentration of these good molecules, which have vasodilating and natriuretic action.

Therefore, there is biological plausibility for benefit, being justifiable to test for clinical efficacy. So there comes the PARADIGM-HF randomized clinical trial, published in 2014 in the New England Journal of Medicine.

PARADIGM-HF, a disingenuous asymmetry of the control group

Following the successful demonstration of efficacy of angiotensin inhibitors, aldosterone antagonists and beta-blockers in heart failure, molecules with different neurohumoral effects over the past decade have failed to demonstrate efficacy, leading to successive negative clinical trials. Amid the perception that we could have reached a pharmacological "sealing" in heart failure, an idea seemed to have sprung up in the minds of "researchers": inventing a new "scientific" method.

The original scientific method was proposed by Ronald Fisher and consists of an experimental innovation: the existence of a control group. This method was translated into biomedicine by statistician Bradford Hill. In his seminal Lancet article, Hill wrote:

“The essence of the method lies in the determination that we are really comparing like with like. We must endeavor to equalize the groups we compare in every possibly influential respect, except the one factor at issue. ”

Following this logic, clinical trials compare new drug versus placebo with an equal treatment background in both groups. This allows testing of the intrinsic efficacy of the new molecule, as nothing other than the mais treatment is different between the two groups.

The design of the PARADIGM-HF study violates the very definition of control group, as if it created a “new” scientific method. And this method brings the advent of a double molecule: the new drug is combined with a proven effective molecule, resulting in a name that connotes pharmacological ingenuity. LCZ696 is born, the fusion of sacubitril with a proven vasodilator at full dose (valsartan 320 mg daily). In this new method, instead of testing the new molecule, we are testing the combination of a proven effective molecule with a molecule of unknown benefit. And now comes the greatest idea: the control group consisted of another vasodilator, enalapril at a more modest dose.

We need to understand the difference between what was done and the traditional scientific method: sacubitril versus placebo was not compared taking into account the rest of the treatment being equal between groups, as Bradford Hill (like with like) suggested. In reality, sacubitril versus placebo was compared, being the best adjuvant treatment in the sacubitril group. And to ensure that the adjuvant was indeed better, the maximum dose valsartan molecule was glued to Sacubitril (LCZ 696) and the enalapril dose in the control group was frozen at 20 mg daily.

If sacubitril-valsartan was compared with placebo-valsartan there would be no bias.

So why didn't they do this and prefer the freakish idea of setting sacubitril-valsartan maximum dose versus enalapril half dose?

The violation of the scientific method of the PARADIGM-HF case is so gross that it is not on any clinical trial bias risk checklist. For this reason, if we apply any critical appraisal tool to the PARADIGM-HF study, this bias will not be detected. Thus, a pseudo low risk of bias study was created.

The commotion

A commotion improves the predisposition of people believing in the unreal.

Following successive reductions in morbidity and mortality in heart failure, such as vasodilation and blockage of the angiotensin-aldosterone system in the 1980s and 1990s, beta-adrenergic blockers in the 1990s, resynchronizers and implantable defibrillators in the 2000s, cardiologists expressed concern and frustration about the absence of novelty in recent years. This background that I call “commotion-induced belief”.

Even better if the commotion is also triggered in the patients. The best scenario is a synergism between the physician's enthusiasm for prescribing a medication and the patient's desire to receive that prescription. Therefore, there was a need to convince patients of the importance of fighting with new weapons for heart failure.

Suddenly, Novartis, maker of Entresto, develop a desire to educate the public about the problem of heart failure. Folders were prepared and placed in the offices waiting rooms, making the public aware of the seriousness of the problem. The strategy of causing fear is used, a feeling that makes any kind of solution desired.

A caricature of the strategy was the event called Weak Heart, sponsored by Novartis, and organized by nations' main newspaper, Folha de São Paulo. The event was open to the general public. In that event, heart failure was characterized as an "epidemic" by some of the experts. Strange to use the word epidemic, which means "transient disease that simultaneously attacks large numbers of individuals in a particular locality." Sounds sensational.

In fact, heart failure is a serious problem, but epidemiologically morbidity and mortality of heart failure has been decreasing over the last decades in Brazil and worldwide. The question remains: why wake up to this problem right now? Doesn’t it seem unusual the temporal coincidence with the public consultation of Entresto? Or is it no coincidence?

The Recommendation

CONITEC is the technical body that evaluates requests for drug cover by our universal health system. Novartis requested the approval of Entresto and CONITEC recommended not to implement Entresto.

But CONITEC bases the recommendation on the cost of the drug, while recognizing the efficacy "demonstrated" by PARADIGM-HF. Failing to offer something very good for lack of money is different from not to offer something of dubious benefit.

The Public Consultation

After the technical rejection of the incorporation of Entresto, nothing like a public consultation that democratizes the discussion. Consultation with the public is in the protocol of CONITEC. In summary, out of 185 technical-scientific manifestations, in which health professionals predominate, there was no record of opinion against Entresto. Almost all opinions were positive and a minority were neutral. But the good thing is the 1,956 contributions based on "experience or opinion”: 1,797 in favor. Regardless of a possible selection bias, the public consultation was clearly in favor of Entresto.

However, a question remains: does democracy apply to drug incorporation?

The Big Turn

After the public consultation, the contrary opinion of CONITEC becomes a favorable opinion.

Apparently, based on the interaction between a technical opinion that recognizes its efficacy (although not recommending) and a democratic manifestation in favor of the drug, CONITEC position was reversed.

To mitigate, this decision brought with it the restriction of use in patients over 75 years, simulating a judicious decision.

It is noteworthy that this restriction is not based on a correct analysis of the evidence, since in the subgroup analysis there was no interaction of age and drug efficacy. The scientific concept that “interaction is a rare phenomenon” and that subgroups should not be analyzed for the significance of interaction was not considered. This restriction promotes a scientific injustice. If it is to be released for use by a 74-year-old, it should be released for the 76-year-old as well.

Epilogue

A hijacking of evidence-based medicine. A paradigm that proposes evidence as a guiding concept is used to promote the influence of pseudo-evidence on recommendation of a new drug. A clinical trial that violates the essence of the scientific method (control group) promotes the approval of a high-cost drug by the Brazilian Unified Health System.

It should be emphasized that our system universally supply more than 200 million people in a poor country. We need to be proud of this system, try to preserve it and increase its efficiency.

But what I am discussing here is not just about the monetary cost of this drug. There is a greater cost, which is the transformation of medicine into a profession guided by studies of questionable validity. It is one thing to say that we need evidence, but another is to have evidence fabricated and published based on conflicts of interest.

We cannot say that it is proved that Entresto is not superior to traditional treatment. But we cannot say it is either. It was just not tested.

The paradigm is hijacked… patients have their drugs switched on a questionable basis… an optimism bias predominates, followed by the confirmation bias that makes us realize “how well the drug is working”, although an NNT of 20 should be imperceptible to the clinical eye. And so we live in an immediate, fantasy world that neglects a careful analysis of the evidence as the basis of scientific thinking. The damage to the scientific culture is greater than the monetary waste.

We need nonconformity with tribal positions, where prevalence of opinion is confused with evidence. Rationality must prevail over the opinion of the masses who stand up for modest sized results in late-breaking clinical trial sessions, turning medical specialties into religion cults.

We need to realize the difference between invention and innovation.

Saturday, August 3, 2019

Is acupuncture better than stenting for stable angina?

As a cardiologist, it was not easy to accept the negative result of ORBITA Trial two years ago. Stenting an obstructed coronary does not control angina? Because of my bias towards scientific skepticism and my love affair with the null hypothesis, it was easier for me. So, at that time, I wrote a quite philosophical post analyzing the value of ORBITA. For the effect size assumed in that trial, angioplasty is not effective and we must recognize the role of placebo within the efficacy of any therapy.

As a cardiologist, I was just presented by JAMA Internal Medicine to a Chinese randomized clinical trial that supposed demonstrate the efficacy of acupuncture in controlling stable angina, as compared with sham.

Now, it is too much for me. I have to accept that stenting is not such a treatment for angina while accepting acupuncture as a valid therapy?

I could not control my bias against this Chinese trial. Yes, I was severely biased to read the trial with impartiality. But after some meditation (is the evidence for that?), I became a little more impartial and able to read the article technically.

For my surprise, I found a well written article, according to CONSORT standards, which fulfills the basic criteria for low risk of bias and random errors. It is a well dimensioned trial, with correct assumptions in sample size calculation, providing enough power to test the hypothesis and precision of confidence intervals. Methodology performed according to previous definition in clinicaltrial.gov, with no change in protocol. Conclusion based on previously defined primary end-point. Central and concealed randomization, two sham control groups (non-meridians acupuncture and simulation of acupuncture), intention-to-treat analysis, no lost of follow-up. Therefore, at a first glance, it seemed to be low risk for false positive result.

Really? What is the positive predictive value of this particular trial?

While the methodology of a trial should be evaluate in itself, its predictive value should be evaluated in the context of the pretest probability of a true hypothesis. Pretest probability depends on (1) plausibility and (2) how much previous data support the hypothesis.

First, one must consider very plausible that opening of an obstructed coronary diminishes symptoms from myocardial ischemia. Although not equivalent to the effectiveness of parachutes, improving symptoms by coronary stenting is almost obvious. On the other hand, improving symptoms from myocardial ischemia by introducing needles in a remote part of the body is less obvious.

I learned that the "meridians" are based on the trajectory of afferent nerves to be stimulated by the procedure. It gives a basic logic to acupuncture, so it is not the same as homeopathy. But nerve trajectory alone is not enough for plausibility regarding clinical efficacy. We have do go further on mechanisms.

So I asked two acupuncturists friends (hospitalist and anaesthesiologist, respectively) what is acupuncture's mechanism of action. One first gave me several different mechanisms and recognized he was not sure; the other promised to “talk to me later” … I am still waiting for the answer.

It confirms my epistemological impression that acupuncture efficacy for treating angina has low level of biological plausibility. In biology, mechanisms of disease are complex and multifactorial. On the other hand, the true mechanism of an effective treatment is related to one pathway. It seems strange, almost a miracle, that one intervention has so many beneficial pathways (imunological, anti-anti-inflamatory, improves blood flow, relaxes muscle, improves muscle mobility and more).

Regarding empirical evidence, I surprisedly found at BMJ a systematic review of randomized clinical trials testing acupuncture for stable angina, which showed consistent (no heterogeneity) positive effect of this therapy in controlling angina. However, all trials were classified as high risk of bias due to lack of blinding. High risk of bias research should not increase the pretest probability of a hypothesis being true.

Finally, people commonly use the “milenar therapy” argument in favor of acupuncture. Well, I do not know of any “milenar criteria” to be taken into account for the probability of things to be true. In fact, several myths are milenar.

Thus, we should conservatively consider the efficacy of acupuncture for controlling angina as low pretest probability of being true. I am not saying it is false, just improbable.

When we find a very good piece of evidence in favor of an improbable hypothesis, the final probability will not be high. Maybe the good evidence raises the probability to moderate, but is still needs further confirmation.

But, is it really a good evidence? I decided to compare the methodologies of the ORBITA trial against the Chinese trial. Clearly, ORBITA has a greater respect for the null hypothesis. Two issues, not evaluated by standard checklist for appraisal of evidence.

Subjectivity of primary outcome: while ORBITA chose an objective criteria (exercise time in stress testing), Chinese chose a self-reported subjective criteria (number of angina events per week).

Effectiveness of blinding: while ORBITA reported blinding indexes, the Chinese trial did not bother. In this case, we must consider that the acupuncturist was not blind and the patient was fully conscious. How blind it really was?

In the end, I should recognize that stenting is definitely overrated in its role for stable coronary disease. But I there is no basis for finding acupuncture the future of coronary intervention.

Tuesday, April 2, 2019

Retire Statistical Significance? Really?

Last week, the article "Remove Statistical Significance" in Nature went viral. It brought criticism on statistical dogmatism. In this text, I will discuss two opposite sides of the same coin. On the one hand, the value of authors’ point of view; on the other hand, unintended consequences of retiring the concept of statistical significance. The first point of view relates to overestimation bias, while the second is related to positivism bias.

The concept of statistical significance is dichotomous, that is, categorizes the analysis into positive or negative. Categorization brings pragmatism, but categorization is an arbitrary reductionism. We must understand categories as something of lesser value than the view of the whole. The paradox of categorization occurs when we come to value categorical information more than continuous information. Continuous information accepts shades of gray, intermediate thinking, doubt, while the categorical brings a (pseudo) deterministic tone to the statement.

Statistics is the exercise of recognizing uncertainty, doubt, chance. The definition of statistical significance was originally created to hinder claims arising from chance. The confidence interval was created to recognize imprecision of statements. Statistics is the exercise of integrity and humility of the scientist.

However, the paradox of categorization carries a certain dogmatism. In Nature's article Amrhein at al first point to overestimation of negative results. A negative study is not one that proves non-existence, which would be philosophically impossible; simply put, it is actually a study that has not proved existence. So, strictly speaking, "absence of evidence is not evidence of absence," as Carl Sagan said (a very good quote, usually kidnapped by believers). That is, "the study proved that there is no difference (P > 0.05)" is not a good way to put it. It is better to say "the study did not prove any difference".

We should not confuse this statement with the idea that a negative study does not mean anything. It has value and impact. The impact of a negative study (P > 0.05) is on reducing the likelihood of the phenomenon to exist. To the extent that good studies have failed to prove, the probability of the phenomenon falls to a point that it is no longer worth trying to prove it, so we take the null hypothesis as the most probable.

In addition, a negative study is not necessarily contradictory with a positive study. It may be that the results of the two are the same, only one failed to reject the null hypothesis and another was able to reject it. One could not see and another could. In fact, most of the time only one of the two studies is correct.

Finally, the paradox of categorization makes us believe in any statistical significance, although most are false positive (Ioannidis in PLOS One). P < 0.05 is not irrefutable evidence. Sub-dimensioned studies, multiplicity of analyses and biases can produce false statistical significance.

In fact, the predictive value (negative or positive) of studies does not lie only in statistical significance. It depends on the quality of the study, how appropriate is the analysis, the scientific ecosystem, and the pre-test probability of the idea.

Therefore, Amrhein at al are correct to criticize the deterministic view of statistical significance.

But should we really retire statistical significance, as suggested by the title of the article?

I do not think so. We would be retiring an advent that historically was responsible for a great evolution of scientific integrity. The problem of the P value is that every good thing tends to hijacked for alternative purposes. Artists of false positivation of studies hijacked the advent of the value of P (made to hinder type I error) to prove false claims.

If on one hand the retirement of statistical significance would avoid the paradox of categorization, on the other hand it would leave open space for positivism bias, meaning our tropism for creating or absorbing positive information, regardless of its true value.

The critique of statistical significance has become fashion, but a better ideia has not been purposed. Indeed, in certain passages Amrhein et al do not propose a total abandonment of the notion of statistical significance. In my view, the title does not match with the true content of the article. I think there should have been a question mark at the end of the title: "Remove Statistical Significance?"

Discussion about scientific integrity has become more frequent recently. In addressing this issue with more emphasis than in the past, it seems that it is worsening these days. It's not the case. We have experimented some evolution of scientific integrity: multiplicity concepts are more discussed than in the past, clinical trials must have their designs published a priori, CONSORT standards of publication are required by journals, there is more talk about scientific transparency, open science, slow science. And the very first step of this new era of concern regarding scientific integrity was the creation of statistical significance, in the first half of the last century by Ronald Fisher.

My friend Bob Kaplan published in PLOS One a landmark study which analyzed results of clinical trials funded by the NIH. Prior to the year 2000, when there was no obligation for pre-publishing protocols, the frequency of positive studies was 57%, falling to 8% of positive studies after prior publication rule. Before, the authors positivated their studies by multiple post hoc analyses. Today, this was improved by the obligation to publish the protocol a priori. We are far from ideal, but we should recognize the scientific field has somehow evolved towards more integrity (or less untruthfulness).

It became fancy to criticize the P value, which in my view is a betrayal with some of great historical importance and until now has not found a better substitute. It is not P's fault to have been abducted by malicious researchers. It's the researchers' fault.

Therefore, I propose to keep the P value and adopt the following measures:

Describe P value only if the study has adequate sample size for hypothesis testing. Otherwise, studies should gain an exploratory descriptive nature, with no use of associations to prove concepts. This would avoid false positives results from the so common "small studies". Actually, the median statistical power of studies in biomedicine is only 20%.
Do not describe P value of secondary end-points analyses.
For subgroup analyses (exploratory), use only P for interaction (more conservative and difficult to provide significant result), avoiding P value obtained by comparison within a subgroup.
Include in CONSORT the obligation for authors to leave explicit in the title of substudies that it is an exploratory and secondary analysis of a previously published negative study.
Abandon the term “statistical significance”, replacing it by “statistical validity”. Statistics is used to differentiate between true causal associations and chance-mediated relationships. Therefore, a value of P < 0.05 connotes veracity. If the association is significant (relevant), it depends on the description of the numerical difference or association measures of categorical outcomes. The use of statistical veracity will prevent the confusion between statistical significance and clinical significance.
Finally I propose a researcher’s index of integrity.

This index will be calculated by the ratio of the number of negative studies / number of positive studies. An integrity index < 1 indicates a questionable integrity. This index is based on the premise that the probability of a good hypothesis to be true is less than 50%. Therefore, there should be more negative studies than positive studies. It is usually not observed due to techniques of positivation of studies (small samples, multiplicities, bias, spin in conclusions) and publication bias that hides negative studies. An author with integrity would be one who does not use these practices, so he or she would have several negative and few positive studies, resulting in an integrity index well above 1.

The viral Nature article was useful to discuss the pros and cons of statistical significance. But it went too far with the suggestion to "retire statistical significance". On the contrary, statistical significance should remain active and progressively evolve in the form of use.

Finally, let us learn to value P > 0.05 too. After all, the unpredictability of life is represented by this symbology, much of our destiny is mediated by chance.