Aside Posted on Updated on
This subject was brought up at a lab meeting that we had at the end of last year, and my oh my, did that talk cause a stir!
In essence, the debate that we had about p values revolves around the issue of reproducibility. There are many studies and exciting results, particularly in the field of Experimental Psychology (which much of the HRI community subscribes to), which the general research community has struggled to replicate. This is a serious cause for concern as it then becomes difficult to distinguish between results that are down to the “luck of the draw” and results that are caused by a real effect.
Recently there was an article published in Nature that sought to highlight the issues with p values. While providing an entertaining and insightful overview about the different attitudes toward p values and hypothesis testing, it seems that the main message of this article is to encourage researchers to think a little more critically about the statistical methods that they employ to test their hypotheses. Moreover, and perhaps more importantly, the article stresses that researchers should also cast a critical eye on the plausibility of the hypotheses that are actually being tested, as this can have an important impact on the statistical results that are obtained.
Prof Geoff Cumming made a very interesting YouTube video which outlines the problems with effect replication and p values. It’s well worth 10 minutes of your day (and spend another 30 minutes writing a MatLab script to replicate his results, you might need the sanity check). In what he calls “The Dance of the P Values”, Prof Cumming demonstrates that when repeatedly sampling data from two normal distributions with different means and checking whether the means of these samples are significantly different (p < 0.05), you tend to find inconsistent results regarding “significance” (which we have rather arbitrarily set a p < 0.05). In fact, his simulations show that more than 40% of the time, your data sample will fail a significance test! Oh dear!
So what is to be done? Well, suggestions are to firstly think critically about your hypothesis. Is this a plausible hypothesis in the first place? Secondly, it is suggested that you alter slightly the statistical results that you report. Rather than standard deviations, report 95% confidence intervals as these tell a broader story. Thirdly, when you interpret your significance tests, do not use these as the sole bit of evidence that you base your conclusions upon. Rather, use them as a just another piece of evidence to help gain insight. Finally, you might consider changing the actual test that you use also. There are other approaches to data analysis (e.g. Bayesian methods) that might provide you with an equally strong piece of evidence. Perhaps even use both.
I think that main message is clear, do not blindly follow the result that a single significance test gives you. There is a reasonable chance that you were just lucky with your population sample. Just think of all those little experiments that you decided not to follow up or publish just because your pilot study didn’t bare any significant fruit…