Saturday, August 3, 2019

Is acupuncture better than stenting for stable angina?

As a cardiologist, it was not easy to accept the negative result of ORBITA Trial two years ago. Stenting an obstructed coronary does not control angina? Because of my bias towards scientific skepticism and my love affair with the null hypothesis, it was easier for me. So, at that time, I wrote a quite philosophical post analyzing the value of ORBITA. For the effect size assumed in that trial, angioplasty is not effective and we must recognize the role of placebo within the efficacy of any therapy. 

As a cardiologist, I was just presented by JAMA Internal Medicine to a Chinese randomized clinical trial that supposed demonstrate the efficacy of acupuncture in controlling stable angina, as compared with sham.  

Now, it is too much for me. I have to accept that stenting is not such a treatment for angina while accepting acupuncture as a valid therapy? 

I could not control my bias against this Chinese trial. Yes, I was severely biased to read the trial with impartiality. But after some meditation (is the evidence for that?), I became a little more impartial and able to read the article technically.

For my surprise, I found a well written article, according to CONSORT standards, which fulfills the basic criteria for low risk of bias and random errors. It is a well dimensioned trial, with correct assumptions in sample size calculation, providing enough power to test the hypothesis and precision of confidence intervals. Methodology performed according to previous definition in, with no change in protocol. Conclusion based on previously defined primary end-point. Central and concealed randomization, two sham control groups (non-meridians acupuncture and simulation of acupuncture), intention-to-treat analysis, no lost of follow-up. Therefore, at a first glance, it seemed to be low risk for false positive result. 

Really? What is the positive predictive value of this particular trial?

While the methodology of a trial should be evaluate in itself, its predictive value should be evaluated in the context of the pretest probability of a true hypothesis. Pretest probability depends on (1) plausibility and (2) how much previous data support the hypothesis. 

First, one must consider very plausible that opening of an obstructed coronary diminishes symptoms from myocardial ischemia. Although not equivalent to the effectiveness of parachutes, improving symptoms by coronary stenting is almost obvious. On the other hand, improving symptoms from myocardial ischemia by introducing needles in a remote part of the body is less obvious. 

I learned that the "meridians" are based on the trajectory of afferent nerves to be stimulated by the procedure. It gives a basic logic to acupuncture, so it is not the same as homeopathy. But nerve trajectory alone is not enough for plausibility regarding clinical efficacy. We have do go further on mechanisms. 

So I asked two acupuncturists friends (hospitalist and anaesthesiologist, respectively) what is acupuncture's mechanism of action. One first gave me several different mechanisms and recognized he was not sure; the other promised to “talk to me later” … I am still waiting for the answer. 

It confirms my epistemological impression that acupuncture efficacy for treating angina has low level of biological plausibility. In biology, mechanisms of disease are complex and multifactorial. On the other hand, the true mechanism of an effective treatment is related to one pathway. It seems strange, almost a miracle, that one intervention has so many beneficial pathways (imunological, anti-anti-inflamatory, improves blood flow, relaxes muscle, improves muscle mobility and more).

Regarding empirical evidence, I surprisedly found at BMJ a systematic review of randomized clinical trials testing acupuncture for stable angina, which showed consistent (no heterogeneity) positive effect of this therapy in controlling angina. However, all trials were classified as high risk of bias due to lack of blinding. High risk of bias research should not increase the pretest probability of a hypothesis being true. 

Finally, people commonly use the “milenar therapy” argument in favor of acupuncture. Well, I do not know of any “milenar criteria” to be taken into account for the probability of things to be true. In fact, several myths are milenar. 

Thus, we should conservatively consider the efficacy of acupuncture for controlling angina as low pretest probability of being true. I am not saying it is false, just improbable. 

When we find a very good piece of evidence in favor of an improbable hypothesis, the final probability will not be high. Maybe the good evidence raises the probability to moderate, but is still needs further confirmation. 

But, is it really a good evidence? I decided to compare the methodologies of the ORBITA trial against the Chinese trial. Clearly, ORBITA has a greater respect for the null hypothesis. Two issues, not evaluated by standard checklist for appraisal of evidence.

  • Subjectivity of primary outcome: while ORBITA chose an objective criteria (exercise time in stress testing), Chinese chose a self-reported subjective criteria (number of angina events per week). 
  • Effectiveness of blinding: while ORBITA reported blinding indexes, the Chinese trial did not     bother. In this case, we must consider that the acupuncturist was not blind and the patient was fully conscious. How blind it really was?

In the end, I should recognize that stenting is definitely overrated in its role for stable coronary disease. But I there is no basis for finding acupuncture the future of coronary intervention.

Tuesday, April 2, 2019

Retire Statistical Significance? Really?

Last week, the article "Remove Statistical Significance" in Nature went viral. It  brought criticism on statistical dogmatism. In this text, I will discuss two opposite sides of the same coin. On the one hand, the value of authors’ point of view; on the other hand, unintended consequences of retiring the concept of statistical significance. The first point of view relates to overestimation bias, while the second is related to positivism bias.

The concept of statistical significance is dichotomous, that is, categorizes the analysis into positive or negative. Categorization brings pragmatism, but categorization is an arbitrary reductionism. We must understand categories as something of lesser value than the view of the whole. The paradox of categorization occurs when we come to value categorical information more than continuous information. Continuous information accepts shades of gray, intermediate thinking, doubt, while the categorical brings a (pseudo) deterministic tone to the statement.

Statistics is the exercise of recognizing uncertainty, doubt, chance. The definition of statistical significance was originally created to hinder claims arising from chance. The confidence interval was created to recognize imprecision of statements. Statistics is the exercise of integrity and humility of the scientist.

However, the paradox of categorization carries a certain dogmatism. In Nature's article Amrhein at al first point to overestimation of negative results. A negative study is not one that proves non-existence, which would be philosophically impossible; simply put, it is actually a study that has not proved existence. So, strictly speaking, "absence of evidence is not evidence of absence," as Carl Sagan said (a very good quote, usually kidnapped by believers). That is, "the study proved that there is no difference (P > 0.05)" is not a good way to put it. It is better to say "the study did not prove any difference".

We should not confuse this statement with the idea that a negative study does not mean anything. It has value and impact. The impact of a negative study (P > 0.05) is on reducing the likelihood of the phenomenon to exist. To the extent that good studies have failed to prove, the probability of the phenomenon falls to a point that it is no longer worth trying to prove it, so we take the null hypothesis as the most probable.

In addition, a negative study is not necessarily contradictory with a positive study. It may be that the results of the two are the same, only one failed to reject the null hypothesis and another was able to reject it. One could not see and another could. In fact, most of the time only one of the two studies is correct.

Finally, the paradox of categorization makes us believe in any statistical significance, although most are false positive (Ioannidis in PLOS One). P < 0.05 is not irrefutable evidence. Sub-dimensioned studies, multiplicity of analyses and biases can produce false statistical significance.

In fact, the predictive value (negative or positive) of studies does not lie only in statistical significance. It depends on the quality of the study, how appropriate is the analysis, the scientific ecosystem, and the pre-test probability of the idea.

Therefore, Amrhein at al are correct to criticize the deterministic view of statistical significance.

But should we really retire statistical significance, as suggested by the title of the article?

I do not think so. We would be retiring an advent that historically was responsible for a great evolution of scientific integrity. The problem of the P value is that every good thing tends to hijacked for alternative purposes. Artists of false positivation of studies hijacked the advent of the value of P (made to hinder type I error) to prove false claims.

If on one hand the retirement of statistical significance would avoid the paradox of categorization, on the other hand it would leave open space for positivism bias, meaning our tropism for creating or absorbing positive information, regardless of its true value.

The critique of statistical significance has become fashion, but a better ideia has not been purposed. Indeed, in certain passages Amrhein et al do not propose a total abandonment of the notion of statistical significance. In my view, the title does not match with the true content of the article. I think there should have been a question mark at the end of the title: "Remove Statistical Significance?"

Discussion about scientific integrity has become more frequent recently.  In addressing this issue with more emphasis than in the past, it seems that it is worsening these days. It's not the case. We have experimented some evolution of scientific integrity: multiplicity concepts are more discussed than in the past, clinical trials must have their designs published a priori, CONSORT standards of publication are required by journals, there is more talk about scientific transparency, open science, slow science. And the very first step of this new era of concern regarding scientific integrity was the creation of statistical significance, in the first half of the last century by Ronald Fisher.

My friend Bob Kaplan published in PLOS One a landmark study which analyzed results of clinical trials funded by the NIH. Prior to the year 2000, when there was no obligation for pre-publishing protocols, the frequency of positive studies was 57%, falling to 8% of positive studies after prior publication rule. Before, the authors positivated their studies by multiple post hoc analyses. Today, this was improved by the obligation to publish the protocol a priori. We are far from ideal, but we should recognize the scientific field has somehow evolved towards more integrity (or less untruthfulness).

It became fancy to criticize the P value, which in my view is a betrayal with some of great historical importance and until now has not found a better substitute. It is not P's fault to have been abducted by malicious researchers. It's the researchers' fault.

Therefore, I propose to keep the P value and adopt the following measures:

  • Describe P value only if the study has adequate sample size for hypothesis testing. Otherwise, studies should gain an exploratory descriptive nature, with no use of associations to prove concepts. This would avoid false positives results from the so common "small studies". Actually, the median statistical power of studies in biomedicine is only 20%.
  • Do not describe P value of secondary end-points analyses.
  • For subgroup analyses (exploratory), use only P for interaction (more conservative and difficult to provide significant result), avoiding P value obtained by comparison within a subgroup.
  • Include in CONSORT the obligation for authors to leave explicit in the title of substudies that it is an exploratory and secondary analysis of a previously published negative study.
  • Abandon the term “statistical significance”, replacing it by “statistical validity”. Statistics is used to differentiate between true causal associations and chance-mediated relationships. Therefore, a value of P < 0.05 connotes veracity. If the association is significant (relevant), it depends on the description of the numerical difference or association measures of categorical outcomes. The use of statistical veracity will prevent the confusion between statistical significance and clinical significance.
  • Finally I propose a researcher’s index of integrity.

This index will be calculated by the ratio of the number of negative studies / number of positive studies. An integrity index < 1 indicates a questionable integrity. This index is based on the premise that the probability of a good hypothesis to be true is less than 50%. Therefore, there should be more negative studies than positive studies. It is usually not observed due to techniques of positivation of studies (small samples, multiplicities, bias, spin in conclusions) and publication bias that hides negative studies. An author with integrity would be one who does not use these practices, so he or she would have several negative and few positive studies, resulting in an integrity index well above 1.

The viral Nature article was useful to discuss the pros and cons of statistical significance. But it  went too far with the suggestion to "retire statistical significance". On the contrary, statistical significance should remain active and progressively evolve in the form of use. 

Finally, let us learn to value P > 0.05 too. After all, the unpredictability of life is represented by this symbology, much of our destiny is mediated by chance. 

Monday, March 25, 2019

The Egg Study and elephant in the room

The Egg Study published in the Journal of the American Medical Association this month has been highly publicised and criticised among evidence-based thinkers. As it has been adequately questioned as proof that eggs increase risk of cardiovascular disease, criticism has missed the elephant in the room: it was a negative study! 

An elephant in the room is missed when our focus is shifted to a less important issue. In this case, the criticism has been mistakenly concentrated on the observational nature of the study. In this post, I will first explain why criticism is out of focus and second I will reveal the elephant in the room, explaining why it is a negative rather than a positive study. 

Observational Study for Harm

In the first half of last century, 80% of western population were smokers and it was not considered harmful. Gastroenterologist Richard Doll investigated smoking as a possible cause for peptic ulcer and found no association. Then, he looked beyond his specialty and investigated lung cancer, in collaboration with famous statistician Austin Bradford Hill. This investigation led to the landmark article published in the British Medical Journal in 1950 demonstrating beyond a reasonable doubt that smoking leads to lung cancer. It was an observational study. And so far, of course, there is no randomised clinical trial to cigaret smoking versus placebo smoking to prove this causal relationship. 

Would we criticise the ideia that smoking cause cancer because it came from an observational study? So why do we criticise the observational nature of the Egg Study for testing the ideia that eggs cause cardiovascular disease. 

This criticism misses the difference between testing harm and testing beneficial effects, which relies on the burden of proof. In testing harm, a positive study will lead to the recommendation of “avoid” or “not to do”. In testing beneficial effect, a positive result will lead to the recommendation of “to do”. The negative consequence of an inappropriate recommendation of “to do” tends to be worse than a recommendation of “avoid”.

It is appropriate to criticise recommendations to eat or to take a medicine based on observational studies. Hormonal replacement therapy was recommended for cardiovascular prevention based on observational data and later randomised data indicated this therapy increases cardiovascular events. Also, so many dietary myths has been created by observational data.

On the other hand, in testing for harm, observational studies should not be considered inadequate as a rule of thumb. If two conditions are satisfied, well-designed observational studies might be taken as evidence for causation: first, a high biological plausibility, leading to high pre-test probability of the hypothesis; second, a very strong association: the hazard ratio for smoking and cancer or for alcohol and hepatic cirrhosis are both around 20, meaning a 1900% relative risk increase. 

The hypothesis tested in the Egg Study was one of harm. So, instead of criticising the nature of the study, we must read it carefully in search for these two conditions.  

Regarding pre-test probability of this hypothesis, it is difficult to comprehend how half an egg per day would be enough to increase risk of cardiovascular events, since eggs are just an small portion of dietary cholesterol, which is a weak determinant of plasma cholesterol. Second, the Egg Study shows a very weak association not fulfilling our condition for causation: hazard ratio = 1.06, just a 6% relative increase.

Therefore, this observational study should not be considered confirmatory in the sense that egg consumption is a risk factor for cardiovascular disease. 

The Elephant in the Room

Along with egg consumption, the study evaluated total dietary cholesterol. The analysis of the direct effect of eggs and total dietary cholesterol, adjusted to each other, differentiates between the causal or non-causal nature of the relationship between eggs and cardiovascular events.

See how the analysis tells a history that makes sense: 

Both eggs and total dietary cholesterol were associated with incident cardiovascular events during a median follow-up of 17.5 years. Each additional 300 mg of dietary cholesterol per day increased the hazard by 17% after adjustment for risk factors. Each additional half an egg per day would increase a tine 6% of hazard after adjustment for risk factors. 

Now the multivariate analysis: when eggs were adjusted for total dietary cholesterol, eggs totally lost statistical significance. It suggests egg consumption is just a marker of a diet rich in cholesterol. To confirm this thought, when total dietary cholesterol were adjusted for eggs, its hazard ratio remained the same, equally significant. Thus, eggs are not a major determinant of total dietary cholesterol in this sample. 

The first analysis makes the study negative for independent prediction value of eggs to cardiovascular events. The second analysis shows that the independent predictor is total dietary cholesterol, regardless of eggs

In addition with that, there is another trick in differentiating causation and confounding: to compare specific to non-specific mortality. 

Mortality depends on a chain of events subjected to confounding. So the analysis of cause-specific mortality provides insight by comparing the different natures of deaths. 

An effective way to differentiate causation and confounding is to test the association between the preditor and an “outside outcome”. Cardiovascular mortality is an “inside outcome” on the hypothesis that egg causes cardiovascular disease. Non-cardiovascular mortality has nothing to do with this hypothesis, being an “outside outcome”, which can be related with the same confounding as the “inside outcome” does. If the candidate risk factor is equally associated with the inside (cardiovascular mortality) and the outside outcome (non-cardiovascular mortality), the association has little to do with causation. The same confounding are mediating the two associations. 

In this study, eggs consumption is associated with cardiovascular mortality. It may make sense. But it was similarly associated with non-cardiovascular mortality, which does not make sense. It indicates a strong influence of confounding in this epidemiological ecosystem. 

These two interpretations make the Egg Study strongly negative for the hypothesis that eggs cause cardiovascular disease. 
I admit it is harder to read an observational study in comparison with a randomised clinical trial. In observational studies, interpretation of results should take into consideration the multivariate analysis, which contains clues of the true reality. 

My Diet

I eat one egg per day, at breakfast. The average consumption in US is half per day. If the association demonstrated in the study were causal, my egg habit would increase my risk by 6%. As a 49 year old male, with no risk factors, I have 5% risk of cardiovascular events. Eating my egg at breakfast would increase risk from 5% to 5,3%. Therefore, I’d keep my eggs even if it was a randomised clinical trial.

It makes me think. Normally, we first analyse if the evidence is true. Then, we ask if it is relevant. Maybe we should invert this order. We should ask first if the association makes a difference. If not, we should not care if it is true.

Tuesday, January 29, 2019

The disaster of Brumadinho was a black swan?

Nassim Taleb's concept of the black swan defines (1) rare, (2) unpredictable and (3) highly impactful events. It is black swans that dictate the course of humanity: the crash of 1929, Hitler's insurgency, discovery of antibiotics, iPhone, internet, September 11. None of these rare events could be predicted, prevented or planned. The more unpredictable, the more impactful.

In science, the logic of the black swan makes scientists aware of the role of serendipity and chance for great discoveries. True scientists rely less on plausible theories, focus on experimentation, and recognise unusual results when they arise. They respect the black swan.

In the interpretation of social or clinical history, the lack of recognition of the concept of the black swan causes us to interpret events based on fallacious causal hypotheses, created by the phenomenon of "retrospective predictability": we tell the story backwards, inventing a meaning for the fact.

On the other hand, our mental elaboration must realize when the event is not a black swan. Thus, the differentiation between white and black swans must be at the heart of society's scientific literacy.

Three years ago, the worst environmental disaster in Brazil took place.  A dam belonging to mining company Vale collapsed, killing 19 people and destroying the city of Mariana.

At that time, I thought: was it a black swan? Although analysts blamed Vale, I wondered if the impression that such a catastrophe was preventable would be a narrative fallacy. That was a rare and highly impactful event. As no one predicted, it could be an unpredictable event. The three criteria of the black swan would be present.

Last week another such event took place in the same state of Minas Gerais, caused by a collapsed dam from the same company. It destroyed the region of Brumadinho and is supposed to have killed hundreds. 

Brumadinho unraveled the dilemma: the collapse of the Vale dams are not black swans. When the same event occurs in a short period, it is no longer rare and unpredictable. Two casual events probably do not repeat themselves in such a short period of time. Scientifically, reproducibility reduces the likelihood of chance.

The perception that this was not by chance implies the possibility of prevention from the identification of causes. 

However, a concern arises ...

Since the event lost its unusual characteristic, its potential impact in preventable attitudes may have decreased. According to the logic of the black swan, impact is proportional to how unusual the event is. Brumadinho was no longer unpredictable. Thus, after the trauma passes and the news are naturally diluted, the likelihood of a government behavioural change towards preventable mode may be lower than after Mariana's unusual disaster.

We should be aware: Brumadinho is not a black swan!

Sunday, December 16, 2018

The Parachute Trial: useful caricature or just a joke?

Caricature studies" have been used successfully in the scientific field to make relevant methodological discussions more palatable. I like this approach and often use them as teaching tools, such as the strong correlation between chocolate consumption and Nobel Prizes as an example of confounding bias.

In 2003, a systematic review on efficacy of parachute use in patients who jumped from great heights was published in the British Medical Journal. The review indicated no randomized clinical trials for this intervention. It was a clever way of demonstrating that not everything needs experimental evidence. That article inspired us to create the terms "parachute paradigm" and "principle of extreme plausibility".

Yesterday, I received a plethora of enthusiastic messages about the latest clinical trial published in the British Medical Journal as part of the Christmas series: Parachute use to prevent death and major trauma when jumping from aircraft: randomized controlled trial.

In this trial, airplane passengers were invited to enter a study where they would jump from the plane to the ground, after being randomized to the use of parachute or non-parachute backpack as a control group. The primary outcome was death or severe trauma. Based on the premise that 99% of the control group would suffer the outcome, for a 99% power to detect a huge (and plausible) relative risk reduction of 95%, only 10 patients per group would be needed. This was done and, surprisingly, the study was negative: zero incidence of the primary outcome in both groups. However, only individuals who would jump from planes parked on the ground agreed to participate in the work.

Funny, but what is the implicit message of this study?

"Randomized trials might selectively enroll individuals with a lower perceived likelihood of benefit, thus diminishing the applicability of the results to clinical practice."

According to the authors, the new parachute study would be pointing to the problem that randomized clinical trials select samples less predisposed to the benefit of the intervention, a phenomenon that would promote false negative studies. The authors explain that it happens because patients who are more likely to benefit from therapy are less likely to agree to enter a study in which they may be randomized to non-treatment. This would make clinical trial samples less sensitive to benefit detection as a partial exclusion of patients with a greater chance of therapeutic success would take place.

Caricatures serve to accentuate true traits. However, if we were to characterize clinical trial samples (ideal world), they tend to be more predisposed to finding positive results in comparison with a real world target population. Therefore, this study is not a caricature of the real world clinical trial.

Thus, the present article should lose the caricature status and be considered just a funny joke, with no ability to anchor our mind towards a better scientific thinking.

As proof of concepts, clinical trials rely on the use of highly treatment-friendly samples, by applying restrictive inclusion and exclusion criteria. Differences between patients who accept and do not agree to enter the study are not sufficient to generate a sample less predisposed to treatment benefit than reality.

The "joke study" commits an unusual sample bias: it allows the inclusion of patients who do not need treatment. It would be as if a study aimed at testing thrombolysis allowed the inclusion of any chest pain, regardless of the electrocardiogram. Doctors who already believe in thrombolysis would see the electrocardiogram, thrombolyze ST-elevation patients, and release those who do not need thrombolysis to be randomized to drug or placebo. A joke without scientific value.

Caricature studies are useful when they anchor the mind of the community to a sharper criticism of the results of studies. However, in this case, the anchoring occurred in the opposite direction.

First, when we think of the scientific ecosystem, the biggest problem is false positive studies, mediated by several phenomena: confounding bias in observational studies, outcome reporting bias, conclusions skewed to positive finding  (spin) and, finally, citation bias that favor positive studies. Behind all this lies the innate predilection of the human mind for false statements, to the detriment of true denials.

Secondly, there is the problem of efficacy (ideal world) versus effectiveness (real world). Clinical trials aim to evaluate efficacy, which could be interpreted as the intrinsic potential of the intervention to offer clinical benefit: "Does the treatment have beneficial ownership?" Therefore clinical trials represent the ideal condition for the treatment to work. In the face of a positive clinical trial, we must always reflect whether this positivity will be reproduced in the real world, which constitutes effectiveness.

Of course there is the problem of false negative studies and it should also be a concern. But the bias suggested by the funny parachute study does not represent an important false-negative mechanism. The most prevalent mechanisms leading to false negatives are reduced statistical power, excessive crossover in the intention-to-treat analysis and inadequate intervention applicability.

My concern is that a reader of this funny study would take the following message home: if a promising study is negative, consider that clinical trials tend to include patients less likely to the benefit from the intervention. This message is wrong, as clinical trials tend to select samples more predisposed to the benefit. Of course, there are exceptions, but if we are to anchor our mind, it should be in the direction of the most prevalent phenomenon.

My prediction is that this study will come to be cited by legions of believers not satisfied with negative results from well-designed studies. Just as the seminal article of the parachute has been used inadequately as a justification for many treatments that have nothing to do with the parachute paradigm under the premise that "there is no evidence at all." A recent study by Vinay Prasad has shown that most interventions characterized as parachute paradigm by medical articles are not that, many have had clinical trials with negative results.

The great attention received by the parachute clinical trial is an example of how information sharing on social networks occurs. The main criterion for sharing is the interesting, unusual or amusing character, to the detriment of the veracity or usefulness of the information. In the appeal for novelty, fake news end up getting more attention than true news, as was recently demonstrated by a paper published in Science. Although the article we are discussing should not be framed as fake news, it is not a good caricature of the real world either.

The work in question is not a caricature of the ecosystem of randomized clinical trials. It is a mere joke with the potential to bias our minds to the inadequate idea that the heterogeneity between clinical trial samples and the target population of the treatment reduces the sensitivity of these studies to detect positive effects. In fact, the samples enrolled in clinical trials usually have a greater chance to detect positive results (sensitivity) than if the entire target population were included.

When the learning of science is approached in a fun way, it arouses great interest of the biomedical community. But we should always ask ourselves: what is the implicit message of the caricature? It is the first step to the critical appraisal of such “thought experiments”.

Saturday, November 3, 2018

The Bright Side of “Many Analysis, One Data Set” Paper

An elegant paper led by English researchers and recently published in the journal Advances in Methods and Practices in Psychological Science has enhanced scientific skepticism regarding ascertainment of statistical data analyses. Using exactly the same database, 29 independent research groups provided a priori data analysis plan to test the hypothesis that referees tend to give red cards more often to dark-skin-toned soccer players in comparison with light-skin-toned players. The analysis performed by 20 groups statistically confirmed the hypothesis, while 9 groups had non-significant statistical analysis.

Amidst of the scientific concern hype ignited by this paper, I have to confess that this time my interpretation leaned towards optimism. Considering the complexity of the problem analyzed, the observational nature of the data and the large variability of statistical methods chosen by the researchers, I found the results presented by different groups surprising similar. 

The authors described that odds ratio of dark-skin-toned players for getting red cards, in relation to light-skin-toned players, varied between 0.89 and 2.93. Although this interval appears to suggest high variation of results, by looking carefully at the forest plot depicted in the figure below, it becomes clear that most studies have similar odds ratios and confidence intervals. Actually, there were two outliers with odds ratio of 2.88 and 2.93 and extremely large confidence intervals. Something in those statistical analysis made these two studies very imprecise. On the other hand, the rest of studies had quite similar results.

Considering all 29 studies, we calculated an average odds ratio of 1.39, with 95% confidence interval between 1.22 and 1.55. If we exclude the two outliers, the average odds ratio is 1.28 (95% CI = 1.21 - 1.33, very precise). In reality, agreement among studies regarding both point estimate odds ratios and confidence intervals is quite good.

Furthermore, while 20 studies demonstrated a positive association between the dark-skin-toned players and odds to get a red card, no studies suggested the opposite result. The remaining 9 analysis basically did not reject the null hypothesis.

Assuming the true result is the one presented by most studies, none of the 9 discordant studies had made the most serious random error of claiming falsity (type I error). All 9 studies would have made the type II error, that is, they simply failed to reject the null hypothesis. Considering the association being explored is not strong (odds ratio < 2), it is only natural that some of the analyses lacked sufficient statistical power.

The problem presented to the researchers was quite complex. The observational nature of the data leads to potential confounding, along with concerns regarding independence of observations. Statistical analysis had to address heterogeneities between players according to skin-tone, referees predisposition to give red cards, relationship among players and referees, different soccer leagues, among other things. 

I may comply with a “half empty glass” interpretation of the study: choices for statistical approaches for complex epidemiological data vary substantially and this variation leads to a certain level of   disagreement among studies. On the other hand, I am more inclined to a “half full glass” interpretation: for a very complex problem, odds ratio estimation was surprisingly reproducible, most studies rejected the null hypothesis in the same direction and no studies suggested the opposite result. Moreover, if we take into consideration less complex statistical circumstances, such as the case for a typical well-designed large randomized controlled trial, the prospect may be quite good.

Is acupuncture better than stenting for stable angina?

As a cardiologist, it was not easy to accept the negative result of ORBITA Trial two years ago. Stenting an obstructed coronary d...