Blog Post »

Too Good Does Not Always Mean Not True


By Jessica Tracy & Alec Beall


While we agree several of Andrew Gelman’s broad concerns about current research practices in social psychology (see “Too Good to Be True”), much of what he said about our article, “Women are more likely to wear red or pink at peak fertility”, recently published in Psychological Science, was incorrect. Unfortunately, Gelman did not contact us before posting his article. Had he done so, we could have clarified these issues, and he would not have had to make the numerous flawed assumptions that appeared in his article. Here, we take the opportunity to make these clarifications, and also to encourage those who read Gelman’s post to read our published article, available here, and Online Supplement available here.

***UPDATE: after (or before) reading this response, please also check out subsequent responses we’ve made, here and here, which include more information about new data we collected to address this issue, and analyses conducted across all data collected.

We want to begin with the issue that received the greatest attention, and which Gelman suggests (and we agree) is most potentially problematic: that of researcher degrees of freedom. Gelman makes several points on this issue; we respond to each in turn below.


a) Gelman suggests that we might have benefited from researcher degrees of freedom by asking participants to report the color of each item of clothing they wore, then choosing to report results for shirt color only. In fact, we did no such thing; we asked participants about the color of their shirts because we assumed that shirts would be the clothing item most likely to vary in color.


b) We categorized shirts that were red and pink together because pink is a shade of red; it is light red. The theory we were testing is based on the idea that red and shades of red (such as the pinkish swellings seen in ovulating chimpanzees, or the pinkish skin tone observed in attractive and healthy human faces) are associated with sexual interest and attractiveness (e.g., Coetzee et al., 2012; Deschner et al., 2004; Re, Whitehead, Xiao, & Perrett, 2011; Stephen, Coetzee, & Perrett, 2011; Stephen, Coetzee, Law Smith, & Perrett, 2009; Stephen et al., 2009; Stephen & McKeegan, 2010; Stephen, Oldham, Perrett, & Barton, 2012; Stephen, Scott et al., 2012; Whitehead, Ozakinci, & Perrett, 2012). Thus, our decision to combine red and pink in our analyses was a theoretical one.


c) We are confused by Gelman’s comment that, “other colors didn’t yield statistically significant differences, but the point here is that these differences could have been notable.” That these differences could have been notable is part of what makes the theory we were testing falsifiable. A large body of evidence suggests that red and pink are associated with attractiveness and health, and may function as a sexual signal at both a biological and cultural level (e.g., Burtin, Kaluza, Klingenberg, Straube, and Utecht 2011; Coetzee et al., 2012; Elliot, Tracy, Pazda, & Beall, 2012; Elliot & Pazda 2012; Guéguen, 2012a; Guéguen, 2012b; Guéguen, 2012c; Guéguen & Jacob, 2012; 2013a; 2013b; Jung, Kim, & Han, 2011a; Jung et al., 2011b; Meier et al., 2012; Oberzaucher, Katina, Schmehl, Holzleitner, & Mehu-Blantar, 2012; Pazda, Elliot, & Greitmeyer, 2012; 2013; Re, Whitehead, Xiao, & Perrett, 2011; Roberts, Owen, & Havilcek, 2010; Schwarz & Singer, 2013; Stephen, Coetzee, & Perrett, 2011; Stephen, Coetzee, Law Smith, & Perrett, 2009; Stephen et al., 2009; Stephen & McKeegan, 2010; Stephen, Oldham, Perrett, & Barton, 2012; Stephen, Scott et al., 2012).  In order to test the specific prediction emerging from this literature, that fertility would affect women’s tendency to wear red/pink but not their tendency to wear other colors, we ran analyses comparing the frequency of women in high- and low-conception risk groups wearing a large number of different colored shirts. The results of these analyses are reported in detail in the Online Supplement to our article (which includes a Figure showing all frequencies). If any of these analyses other than those of pink and red had produced significant differences, we would have failed to support our hypothesis.


Gelman’s concern here seems to be that we could have performed these tests prior to making any hypothesis, then come up with a hypothesis post-hoc that best fit the data. While this is a reasonable concern for studies testing hypotheses that are not well formulated, or not based on prior work, it simply does not make sense in the present case. We conducted these studies with the sole purpose of testing one specific hypothesis: that conception risk would increase women’s tendency to dress in red or pink. This hypothesis emerges quite clearly from the large body of work mentioned above, which includes a prior paper we co-authored (Elliot, Tracy, Pazda, & Beall, 2012). We came up with the hypothesis while working on that paper, and were in fact surprised that it hadn’t been tested previously, because it seemed to us like such an obvious possibility given the extant literature. The existence of this prior published article provides clear evidence that we set out to test a specific theory, not to conduct a fishing expedition. (See also Murayama, Pekrun, & Fiedler, in press, for more on the role of theory testing in reducing Type I errors).


d) Our choice of which days to include as low-risk and high-risk was based on prior research, and, importantly, was determined before we ran any analyses. Gelman is right that there is a good deal of debate about which days best reflect a high conception risk period, and this is a legitimate criticism of all research that assesses fertility without directly measuring hormone levels. Given this debate, we followed the standard practice in our field, which is to make this decision on the basis of what prior researchers have done. We adopted the Day 6-14 categorization period after finding that this is the categorization used by a large body of previously published, well-run studies on conception risk (e.g., Penton-Voak et al., 1999; Penton-Voak & Perrett, 2000; Little, Jones, Burris, 2007; Little & Jones, 2012; Little, Jones & DeBruine, 2008; Little, Jones, Burt, & Perrett, 2007; Farrelly 2011; Durante, Griskevicius, Hill, & Perilloux, 2011; DeBruine, Jones, & Perrett, 2005; Gueguen, 2009; Gangestad & Thornhill, 1998). Although the exact timing of each of these windows is debatable, it is not debatable that Days 0-5 and 15-28 represent a window of lower conception risk than days 6-14.


Furthermore, if our categorization did result in some women being mis-categorized as low-risk when in fact they were high risk, or vice-versa, this would increase error and decrease the size of any effects found. Most importantly, we did not decide to use this categorization after comparing various options and examining which produced significant effects. Rather, we adopted it a priori and used it and only it in analyzing our data; no researcher degrees of freedom came into play.


e) In any study that assesses conception risk using a self-report measure, certain women must be excluded to ensure that those for whom risk was not accurately captured do not erroneously influence results. All of the exclusions we made were based on those suggested by prior researchers studying the psychological effects of conception risk, such as excluding women with irregular cycles (as it is more difficult to accurately determine when they are likely to be at risk), excluding pregnant women and women taking hormonal birth control (as they do not regularly ovulate), and excluding women currently experiencing pre-menstrual or menstrual symptoms (to ensure that effects observed cannot be attributed to these symptoms; see Haselton & Gildersleeve, 2011; Little, Jones, & Debruine, 2008). Although most of these exclusion criteria are necessary to accurately gauge fertility risk, several fall into a gray area (e.g., excluding women with atypical cycles). The decision of whether to exclude women on the basis of these gray-area criteria does lead to the possibility of researcher degrees of freedom. Because we were aware of this concern, we reported (in endnotes) results when these exclusions were not made. This is the solution recommended by Simmons, Nelson and Simonhnson (2011), who write: “If observations are eliminated, authors must also report what the statistical results are if those observations are included.” (p. 1363). Thus, while we did make a decision about the most appropriate way to analyze our data, we also made that decision clear, reported results as they would have emerged if we had made the alternate decision, and gave the article’s reviewers, editor, and readers the information they needed to judge this issue.


In addition to the degrees of freedom concern, Gelman also raises concerns about representativeness and measurement. We have addressed these issues in a longer version of this response, posted here, and we encourage those who are interested to read the longer version. In an effort to keep this response concise, however, we wish to close by mentioning a few broader issues relevant to Gelman’s piece.


First, like any published set of empirical studies, our article should not be viewed as the ultimate conclusion on the question of whether women are more likely to wear red or pink when at high risk for conception. We submitted our article for publication because we believed that the evidence from the two studies we conducted was strong enough to suggest that there is a real effect of women’s fertility on their clothing choices, at least under certain conditions, but not because we believe there is no need for additional studies. Indeed, many questions remain about this effect, such as its generalizability, its moderators, and its mediators. We look forward to seeing new research address these questions, both from our own lab (where follow-up and additional replication studies are already underway) and others.


Second, setting the ubiquitous need for additional research aside for the moment, Gelman’s claim that our two studies provide “essentially no evidence for the researchers’ hypotheses” is both inflammatory and unfair. For one thing, it is important to bear in mind that our research went through the standard peer review process—a process that is by no means quick or easy, especially at a top-tier journal like Psychological Science. This means that our methods and results have been closely scrutinized and given a stamp of approval by at least three leading experts in the areas of research relevant to our findings (in this case, social and evolutionary psychology). This does not mean that questions should not be raised; indeed, questioning and critiquing published work is an important part of the scientific process, and Gelman is correct that the review process often fails to take into account researcher degrees of freedom. But research critics—especially those who publish their critiques in widely dispersed forums like Slate blog posts—must ensure that they get the facts right, even if that means contacting an article’s authors for more information, or explicitly mentioning additional information that the authors provided in endnotes.


Indeed, a statistician like Gelman could go well beyond simply mentioning possible places where additional degrees of freedom might have come into play and then making assumptions about the validity of our findings on that basis. He could, and should, instead find out exactly the places where researcher degrees of freedom did come into play, then calculate the precise likelihood that they would have resulted in the two significant effects that emerged in our studies if these effects were not in fact true. In other words, additional researcher degrees of freedom increase the chance that we will find a significant effect where none exists. But by how much? The chance of obtaining the same significant effect across two independent consecutive studies is .0025 (Murayama et al., in press). How many researcher degrees of freedom would it take for this to become a figure that would reasonably allow Gelman to suggest that our effect is most likely a false positive? This is a basic math problem, and one that Gelman could solve. Without such calculation, the conclusion that our findings provide no support for our hypothesis would never pass the standards of scientific peer review. Researchers do have certain responsibilities—such as avoiding, to whatever extent possible, taking advantage of researcher degrees of freedom and being honest about it when they do– but critics of research have certain responsibilities too.


This is particularly important because there is a very real possibility that most readers of posts such as these will assume that they are accurate without checking against the original research reports. Indeed, most Slate readers do not have access to academic journal articles, so must rely on media summaries to form an assessment of the research. Added to the viral power of the internet, this creates a very real burden on critics and others who discuss scientific research in popular media forums to make serious efforts to maintain accuracy.


The field of psychology—and social psychology in particular—is currently experiencing an intense period of self-reflection. On the whole, this is a very good thing: psychologists are interested in finding and reporting true effects, and increased scrutiny of problematic research practices will help us do so. At the same time, it would be unfortunate if one consequence of this self-reflection is that researchers become afraid to publish certain findings for fear of reputational damage. Research articles that follow good research practices should not become suspect simply because their findings are unexpected.





Baker, A.H., Denning, A.C., Kostin, I. & Scharwtz, L., 1998. How accurate are women’s estimates of date of onset of next menses? Psychology and Health 13, 897–908.

Burtin L, Kaluza A, Klingenberg M, Straube J, Utecht C. 2011. Red shirt, nice flirt! How red influences the perception of self-attractiveness. Empiriepraktikumskongress Proceedings. Jena, Germany.

Coetzee V, Faerber SJ, Greeff JM, Lefevre CE, Re DE, Perrett, DI. 2012. African perceptions of female attractiveness. PLoS ONE. 7: e48116

DeBruine, L. M., Jones, B. C., & Perrett, D. I. (2005). Women’s attractiveness judgments of self-resembling faces change across the menstrual cycle.Hormones and Behavior47(4), 379-383.

Deschner T., Heistermann M., Hodges K., & Boesch C. (2004). Female sexual swelling size, timing of ovulation, and male behavior in wild West African Chimpanzees. Hormones and Behavior 46, 204-215.

Durante, K. M., Griskevicius, V., Hill, S. E., Perilloux, C., & Li, N. P. (2011). Ovulation, female competition, and product choice: Hormonal influences on consumer behavior. Journal of Consumer Research37(6), 921-934.

Elliot, A. J., Tracy, J. L., Pazda, A. D., & Beall, A. T. (2012). Red enhances women’s attractiveness to men: First evidence suggesting universality. Journal of Experimental Social Psychology.

Elliot, A.J, & Pazda, A.D. (2012). Dressed for Sex: Red as a Female Sexual Signal in Humans. PLoS ONE 7(4)

Farrelly, D. (2011). Cooperation as a signal of genetic or phenotypic quality in female mate choice? Evidence from preferences across the menstrual cycle.British Journal of Psychology102(3), 406-430.

Gangestad, S. W., & Thornhill, R. (1998). Menstrual cycle variation in women’s preferences for the scent of symmetrical men. Proceedings of the Royal Society of London. Series B: Biological Sciences265(1399), 927-933.

Guéguen, N. (2009). Menstrual cycle phases and female receptivity to a courtship solicitation: an evaluation in a nightclub. Evolution and Human Behavior30(5), 351-355.

Guéguen N. 2012a. Color and women attractiveness: When red clothed women are perceived to have more intense sexual intent. J. Soc. Psychol. 152: 261-65

Guéguen N. 2012b. Color and women hitchhikers’ attractiveness: Gentlemen drivers prefer red. Color Res. Appl.

Guéguen N. 2012c. Does red lipstick really attract men? An evaluation in a bar. Int. J. Psychol. Stud. 4: 206-9

Gueguen N, Jacob C. 2012. Lipstick and tipping behavior: When red lipstick enhance waitresses tips. Int. J. Hosp Manag. 31: 1333-35

Guéguen N, Jacob C. 2013a. Clothing color and tipping: Gentlemen patrons give more tips to waitresses with red clothes. J. Hosp. Tour. Res. In press

Guéguen N, Jacob C. 2013b. Color and cyber-attractiveness: Red enhances men’s attraction to women’s internet personal ads. Color Res. Appl. In press

Haselton, M.G., & Gildersleeve, K.A. (2011). Can men detect ovulation? Current Directions in Psychological Science. 20, 87-92.

Jung I, Kim M, Han K. 2011a. The influence of an attractive female model on male users’ product ratings. KHCI conference proceedings. Seoul, South Korea

Jung I, Kim M, Han K. 2011b. Red for romance, blue for memory. HCII International: Posters, Extended Abstracts, Communications in Computer and Information Science. 173: 284-88

Little A.C., Jones B.C., & DeBruine L.M. (2008). Preferences for variation in masculinity in real male faces change across the menstrual cycle. Personality and Individual differences, 45, 478–482.

Little, A. C., Jones, B. C., & Burriss, R. P. (2007). Preferences for masculinity in male bodies change across the menstrual cycle. Hormones and Behavior,51(5), 633-639.

Little, A. C., Jones, B. C., Burt, D. M., & Perrett, D. I. (2007). Preferences for symmetry in faces change across the menstrual cycle. Biological psychology,76(3), 209-216.

Meier BP, D’Agostino PR, Elliot AJ, Maier MA, Wilkowski BM. 2012. Color in context:  Psychological context moderates the influence of red on approach- and avoidance-motivated behavior. PLoS One. 7: e40333.

Murayama, K., Pekrun, R., & Fiedler, K. (in press). Research practices that can prevent an inflation of false-positive rates. Personality and Social Psychology Review.

Oberzaucher E, Katina S, Schmehl S, Holzleitner I, Mehu-Blantar I. 2012. The myth of hidden ovulation: Shape and texture changes in the face during the menstrual cycle. J. Evol. Psychol. 10: 163-175

Pazda AD, Elliot AJ, Greitemeyer T. 2012. Sexy red: Perceived sexual receptivity mediates the red-attraction relation in men viewing women. J. Exp. Soc. Psychol. 48: 787-90

Pazda AD, Elliot AJ, Greitemeyer T. 2013. Perceived sexual receptivity and fashionableness: Separate paths linking red and black to perceived attractiveness. Color Res. Appl. In press

Penton-Voak, I. S., & Perrett, D. I. (2000). Female preference for male faces changes cyclically: Further evidence. Evolution and Human Behavior, 21, 39-48.

Prokop, P., & Hromada, M. (2013). Women Use Red in Order to Attract Mates. Ethology.

Re DE, Whitehead RD, Xiao D, Perrett DI. 2011. Oxygenated-blood colour change thresholds for perceived facial redness, health, and attractiveness. PloS ONE. 6: e17859

Roberts SC, Owen RC, Havlicek J. 2010. Distinguishing between perceiver and wearer effects in clothing color-associated attributions. Evol. Psychol. 8: 350-64

Schwarz S, Singer M. 2013. Romantic red revisited: Red enhances men’s attraction to young, but not menopausal women. J. Exp. Soc. Psychol. 49: 161-64

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science22(11), 1359-1366.

Stephen ID, Coetzee V, Law Smith M, Perrett DI. 2009. Skin blood perfusion and oxygenization colour affect perceived human health. PLoS ONE. 4: e5083

Stephen ID, Coetzee V, Perrett DI. 2011. Carotenoid and melanin pigment coloration affect perceived human health. Evol. Hum. Behav. 32: 216-27

Stephen ID, Oldham FH, Perrett DI, Barton RA. 2012. Redness enhances perceived aggression, dominance and attractiveness in men’s faces. Evol. Psychol. 10: 562-72

Stephen ID, Scott IML, Coetzee V, Pound N, Perrett DI, Penton-Voak IS. 2012. Cross-cultural effects of color, but not morphological masculinity, on perceived attractiveness of men’s faces. Evol. Hum. Behav.

Stephen ID, Law Smith MJ, Stirrat MR, Perrett DI. 2009. Facial skin coloration affects perceived health of human faces. Int. J. Primatol. 30: 845-57.




Discussion (6 Comments)

  • Pingback: Seeing Red: A Statistics Debate

  • Brad Stiritz

    Hi, I thought the following comment of yours was quite interesting & posted a query to Andrew about it on his blog (URL below). The replies I got from Andrew & another poster were purely qualitative. The replies seemed to suggest that you might have been speaking rhetorically & not literally asking for a numeric calculation to be performed.

    As an interested layperson, I would greatly appreciate any follow-up you might have on this, whether clarification / re-emphasis / calculation method & results. Thank you in advance.

    >How many researcher degrees of freedom would it take for this to become a figure that would reasonably allow Gelman to suggest that our effect is most likely a false positive? This is a basic math problem, and one that Gelman could solve. Without such calculation, the conclusion that our findings provide no support for our hypothesis would never pass the standards of scientific peer review.


    What was the rationale behind choosing a sample size of 24? How long did the sampling take and how much did it cost?


      We collected data in this first study throughout a semester. We ended the study when the semester ended. Although 24 is a small sized sample, it’s far more difficult than people assume to get these samples, because women who are taking hormonal birth control (i.e., the pill) must be excluded, and they make up a large proportion of non-pregnant women between 18-40, especially on college campuses.


    Why don’t you publish the raw data and let a real statistician do an analysis? That is the spirit of reproducibility.


    I can’t publish an spss file on line, but we have already sent the raw data to several people who have requested it. We are happy to continue to do so; please email one of us if you would like the raw data.

    Make a Comment »