More on the Red/Pink Dress Effect: A Response to an Unpublished (but possibly soon-to-be-published) Critique
Jessica Tracy & Alec Beall
It has come to our attention that Andrew Gelman and Eric Loken are seeking to publish an article critiquing our work.* We are here reporting new analyses that address the issues raised in their critique, and also offer several broader responses regarding our research and the practices we followed in conducting it.
Although Gelman and Loken are using our work as an example of a broader problem that pervades the field–a problem we generally agree about–we are concerned that readers will take their speculations about our methods and analyses as factual claims about our scientific integrity. Furthermore, we are concerned that their paper will misrepresent aspects of our research, because Gelman previously wrote a blog post on our research, published in Slate, which contained a number of mischaracterizations (see here for his post, and here for our response, in which we clarified important aspects of our methods and findings, and explained that many of Gelman’s concerns were mitigated by specific actions we took in conducting our research).** For these reasons, we are posting here new information that we have also directly provided to Gelman and Loken, so that others interested in this issue can easily obtain this information, regardless of whether their published manuscript ends up including this information. This information has important implications for the replicability and robustness of the effects we previously documented, and thus should be included in any discussion about the reliability of these effects.
Following the publication of our paper, “Women are more likely to wear red or pink at peak fertility” (Beall & Tracy, 2013, Psych Science; see here), we conducted a new study seeking to replicate our findings. This study produced a null result, but led us to formulate new hypotheses about a potential moderator of our previously documented effect (see here for a detailed description of this failure to replicate and our subsequent hypotheses). We found preliminary support for these new hypotheses in re-analyses of our previously published data, and so moved on to conduct a new study (N = 209) to directly test our new theory. This study proved fruitful; a predicted interaction emerged in direct support of our hypotheses. All of these results can be found in “The impact of weather on women’s tendency to wear red or pink when at high risk for conception” (Tracy & Beall, 2014, PLoS ONE). Of note, this paper and the Psych Science paper together report ALL data we have collected on this issue; there are no missing file-drawer studies, at least from our lab. The fact that the PLoS ONE paper reports a study that was a failure to replicate will, we hope, lend credence to this claim.
Regarding the robustness of our main effect, we have now run new analyses testing for this effect across all these collected samples—the two samples we originally reported in our Psych Science paper, and the two new samples that comprise the two new studies reported in the PLoS ONE paper. Together these comprise a sample of N = 779. Although we expected the main effect to be considerably weaker across these samples than it was in our initial studies, due to major variance in the moderator variable that we have now found to influence this effect, we nonetheless found consistent support for that main effect.
Specifically, including all eligible participants across the 4 samples (i.e., women who could be confidently placed in either a high or low fertility group based on their self-reported confidence in their recollection of their last menses onset, experienced menstrual cycles typical in length and regularity, were not pregnant, and were not using any form of contraception), we found a significant effect of conception risk on red/pink shirts; 16% of women at high risk reported wearing red/pink, compared to 10% of women at low risk, chi squared (1, N = 633) = 4.58, p = .032 (Odds ratio = 1.67).
Gelman and Loken’s central concern is that our analyses could have been done differently – including or excluding different subsets of women, or using a different window of high conception risk. They imply that we likely analyzed our results in all kinds of different ways before selecting the one analysis that confirmed our hypothesis. We did not. Moreover, additional analyses show that our main effect is robust to the different analytic choices Gelman and Loken propose, as described below.
First, Gelman and Loken expressed concern about our exclusion of women who could not be placed in a high or low conception risk category with 100% certainty, based on women’s self-reported estimates of confidence in their reported menses dates. Although we believe this is a more stringent analysis than that typically performed (where women are categorized based on reported menses dates without regard to their self-reported confidence in the accuracy of these dates), we have now re-run our test of the main effect, including all women whose self-reports met our inclusion criteria, regardless of their confidence in those self-reports. Doing so with our full sample, the reported effect is essentially unchanged; 17% of women at high risk reported wearing red/pink, compared to 11% of women at low risk, chi squared (1, N = 779) = 5.11, p = .024 (Odds ratio = 1.62).
Second, studies on ovulation have at times included and at times excluded women who are pre-menstrual and/or currently menstruating; there are good theoretical reasons for both approaches, depending on the research question at hand. For the research described in our Psych Science paper, we requested that women who were currently menstruating or pre-menstrual not participate; however some such women participated anyway. In our paper we report results both including and excluding those women (combining the two samples, reported effects held across both analyses). In this way, we followed the suggestion made by Simmons, Nelson, and Simonhnson (2011) that “If observations are eliminated, authors must also report what the statistical results are if those observations are included.” (p. 1363). More broadly, our view is that when an analytic decision is ambiguous, the best solution is to aim for openness—report results under both analytic strategies.
Following this approach, although the results described above for our large sample include menstruating and pre-menstrual women, we have also run the analysis excluding them (resulting in an N of 419), and again found that the main effect holds, with 16% of women at high risk wearing red/pink, compared to 9% at low risk, chi squared (1, N = 419) = 4.22, p = .040 (Odds ratio = 1.86). If we ignore item assessing women’s confidence in their self-reported menses dates, this effect is again unchanged; 17% of women at high risk reported wearing red/pink, compared to 10% of women at low risk, chi squared (1, N = 564) = 5.54, p = .019 (Odds ratio = 1.80).
Gelman and Loken also raise concerns regarding the specific window we chose to represent high conception risk (days 6-14). They note that although our specified window is based on prior published work (e.g., Penton-Voak et al., 1999; Penton-Voak & Perrett, 2000; Little, Jones, Burris, 2007; Little & Jones, 2012; Little, Jones & DeBruine, 2008; Little, Jones, Burt, & Perrett, 2007; Farrelly 2011; Durante, Griskevicius, Hill, & Perilloux, 2011; DeBruine, Jones, & Perrett, 2005; Gueguen, 2009; Gangestad & Thornhill, 1998), other researchers have used a slightly different window. In fact, both of these windows undoubtedly capture a time frame of higher risk than the comparison time frame, so it doesn’t particularly matter which window researchers use, as long as they make an a-priori decision about which to use and then run analyses for that window only. This is precisely what we did (and, of note, in all of our studies examining conception risk we have always used the same window, and only that window), but in the spirit of openness, we have now re-run the test for the main effect in our full sample using the Durante, Rae, and Griskevicius (2013) windows instead (i.e., no confidence estimates; High fertility = days 7-14; Low fertility = days 17-25). Again, the main effect holds, with 16% of women at high risk wearing red/pink, compared to 10% of women at low risk, chi squared (1, N = 465) = 3.97, p = .046 (Odds ratio = 1.76). This approach excludes pre-menstrual and menstruating women, but if these women are included (in the low-risk window), the effect holds, chi squared (1, N = 670) = 4.29, p = .038 (Odds ratio = 1.61).
Given the repeated replication of this effect across all these samples, we believe that we are talking about a fairly robust finding, albeit one that can vary quite substantially across a theoretically relevant moderator (see Tracy & Beall, 2014). We hope that these findings give greater confidence to our original results, and ameliorate the practical implications of the concerns raised by Gelman and Loken.
More broadly, while we certainly agree with the issues Gelman and Loken are raising—which have been raised by a number of other researchers in the field—their insinuation that these issues present problems for our particular research finding simply does not hold water. More research on this issue is still needed—particularly within-subject analyses, which are seen by many as the gold standard in ovulation research—but all currently available data of which we are aware support our original hypothesis and/or our moderator hypothesis. We welcome additional studies that seek to further replicate our results (taking into account the weather moderator we have now documented), and, if contradictory data emerge, this can only serve to further advance our understanding of the phenomenon. Indeed, we hope other researchers will join us in seeking a more precise estimate of the main effect and of the weather moderator, and in discovering other variables that no doubt also moderate this effect.
*Despite repeated requests from us, Gelman and Loken are unwilling to provide us with any information about their paper’s publication status that would allow us to ensure that we are included in the review process (e.g., they have refused to inform us of the name of the editor handling the manuscript, the journal where the paper is under review, and even whether it is in fact currently under review).
** In our view, critiques work best when there is an open dialogue, in which researchers whose work is critiqued are encouraged to review said critiques before they go to press, so that editors and readers are given all the information they need to weigh the various perspectives, before drawing conclusions about individual authors or research programs. We believe this is the best way to ensure accurate representation of discussed findings, which, in the end, allows for the most fruitful discussion, focused on actual areas of disagreement.