Saturday, November 3, 2018

The Bright Side of “Many Analysis, One Data Set” Paper



An elegant paper led by English researchers and recently published in the journal Advances in Methods and Practices in Psychological Science has enhanced scientific skepticism regarding ascertainment of statistical data analyses. Using exactly the same database, 29 independent research groups provided a priori data analysis plan to test the hypothesis that referees tend to give red cards more often to dark-skin-toned soccer players in comparison with light-skin-toned players. The analysis performed by 20 groups statistically confirmed the hypothesis, while 9 groups had non-significant statistical analysis.

Amidst of the scientific concern hype ignited by this paper, I have to confess that this time my interpretation leaned towards optimism. Considering the complexity of the problem analyzed, the observational nature of the data and the large variability of statistical methods chosen by the researchers, I found the results presented by different groups surprising similar. 

The authors described that odds ratio of dark-skin-toned players for getting red cards, in relation to light-skin-toned players, varied between 0.89 and 2.93. Although this interval appears to suggest high variation of results, by looking carefully at the forest plot depicted in the figure below, it becomes clear that most studies have similar odds ratios and confidence intervals. Actually, there were two outliers with odds ratio of 2.88 and 2.93 and extremely large confidence intervals. Something in those statistical analysis made these two studies very imprecise. On the other hand, the rest of studies had quite similar results.



Considering all 29 studies, we calculated an average odds ratio of 1.39, with 95% confidence interval between 1.22 and 1.55. If we exclude the two outliers, the average odds ratio is 1.28 (95% CI = 1.21 - 1.33, very precise). In reality, agreement among studies regarding both point estimate odds ratios and confidence intervals is quite good.

Furthermore, while 20 studies demonstrated a positive association between the dark-skin-toned players and odds to get a red card, no studies suggested the opposite result. The remaining 9 analysis basically did not reject the null hypothesis.

Assuming the true result is the one presented by most studies, none of the 9 discordant studies had made the most serious random error of claiming falsity (type I error). All 9 studies would have made the type II error, that is, they simply failed to reject the null hypothesis. Considering the association being explored is not strong (odds ratio < 2), it is only natural that some of the analyses lacked sufficient statistical power.

The problem presented to the researchers was quite complex. The observational nature of the data leads to potential confounding, along with concerns regarding independence of observations. Statistical analysis had to address heterogeneities between players according to skin-tone, referees predisposition to give red cards, relationship among players and referees, different soccer leagues, among other things. 

I may comply with a “half empty glass” interpretation of the study: choices for statistical approaches for complex epidemiological data vary substantially and this variation leads to a certain level of   disagreement among studies. On the other hand, I am more inclined to a “half full glass” interpretation: for a very complex problem, odds ratio estimation was surprisingly reproducible, most studies rejected the null hypothesis in the same direction and no studies suggested the opposite result. Moreover, if we take into consideration less complex statistical circumstances, such as the case for a typical well-designed large randomized controlled trial, the prospect may be quite good.

No comments:

Post a Comment