Replication is Good…Right?
We have all heard by now that replication is the panacea for the ills plaguing social psychology. Whether we fear single-study papers containing tenuous effects reached through undisclosed analytical flexibility, post-hoc justifications of results that were entirely unanticipated, or downright data faking, replication appears to be the best way to distinguish true effects from those reached through fishing expeditions. The argument is simple: an effect that is replicated multiple times has a greater probability of being true–regardless of how it is obtained–than one evinced only once. A replicable effect appears less likely to be fished, HARKed, or faked, because even such nebulous tactics are difficult to repeat in an identical fashion.
Yet, as pointed out in a recent paper by Ulrich Schimmack in Psychological Methods, a strong emphasis on replication may itself produce a new variant of false-positive findings. Schimmack discusses the notion of total power (i.e., the power of a set of studies in a given paper to all detect a false null hypothesis) and demonstrates that the majority of multiple study papers are woefully underpowered. For example, for a three-study paper to have total power of 0.8 to detect three significant results (p < .05) with a true population effect of medium size (e.g., d = 0.5), the researchers would need a total sample of 570 participants–or 190 per study! In our current research climate, in which “cell sizes” of 20-50 participants is normal for social psychology experiments, most multiple-study papers fall well short of this benchmark.
The fallout from such a chronic lack of statistical power in social psychology research is a lack of credibility for multiple-study packages. Schimmack introduces the incredibility index, a statistic that represents the probability that a set of studies in a single paper would produce an observed proportion of significant p-values given the observed total power in the paper. For example, imagine a paper with 6 studies, each using a sample of 84 participants across two conditions, with a true population effect of medium size. Each of these six studies would have power of 0.6 to detect a false null hypothesis, meaning that only 3 or 4 (3.6 to be precise) out of the 6 studies in the paper would be expected to produce a p-value of < .05. Notice a discrepancy? Such a six-study package seems like a reasonable representation of a JPSP-style paper–lots of replication, good cell sizes–but a paper in which only 3 or 4 out of 6 studies produce statistically significant results would never get accepted. How unlikely is it that these researchers would avoid observing any null effects? Based on power analysis alone, the incredibility index suggests that there would be a probability of .047 that each of these six studies produced statistically significant results. In other words, the beautiful multi-study JPSP package is highly improbable (p < .05) based on current standards with respect to sample size.
Schimmack’s analysis suggests that researchers are necessarily employing fishing and HARKing techniques to obtain data that fits cleanly into multiple-study packages. Methods such as dropping conditions, collecting additional participants after observing a null effect, and reporting only one of many DVs, would all increase the probability of obtaining a set of 3, 5, or 7 significant effects despite small samples that lead to underpowered designs–and, according to the incredibility index, researchers are in dire need of such a boost. Given the renewed emphasis in the field on replication within papers, perhaps editors should become more tolerant of messy results, such as marginally significant effects, reporting a single null finding amidst a set of studies. At present, though, the multi-study packages that appear in our best journals are often highly improbable.