- Short’s article was not as controversial as it has been made out to be
- Short based his claims on two recent academic contributions – one statistical comment and one study on FIDE data – that suggested that gender gaps cannot be explained away by participation rates
- There are several technical issues with these papers that cast doubt on their conclusions, particularly for the majority of the population
- Overall, the scientific evidence does not suggest a biological basis for the gender gap for the vast majority of the population
- There is some evidence of a gap at the very highest professional levels; an open question is whether this is due to maternal reasons
I have gotten a lot of feedback on my recent post about GM Nigel Short’s NiC article about gender differences in chess. My previous post was meant to be informal and light on technical analysis, so as to put the gender debate more in the light of what is and isn’t important for us as a community to talk about. Still, the feedback I have received has indicated that there is at least some demand for an explanation of my views from a more scientific perspective. Also, given the amount of unsubstantiated rubbish reported in the media about Short’s article, there’s probably good cause for an objective academic review of his claims as well. What follows will be a little more technical than my regular posts, so for an easier read on the issue, refer back to my original response.
But let’s take a step back for a minute. In my opinion, the main powder keg for the surfeit of angry reactions that this story has incited is the lack of a well-defined question. What’s really the issue being discussed here? Here are some of the different angles that various media reports have claimed is Short’s main point:
- Women are worse at chess than men
- Women’s brains are naturally less suited to chess
- Women shouldn’t play chess
- Women aren’t smart enough for chess
…and some other things about driving.
Reading these reports, it became very clear to me that almost none of the journalists actually read Short’s article (you can see it in full here, if you like). Some journalists even went so far as to ascribe blatantly sexist but entirely fabricated quotes to Short in their reports – the most cardinal sin of Journalism 101.
In fact, Short says nothing about intelligence in his article, nor makes any normative claims saying that women shouldn’t play chess. In fact, his final sentence begins, “It would be wonderful to see more girls playing chess, and at a high level.” The two claims that do come out are as:
- There is a gender gap in chess
- This gap cannot be explained by participation rates
To back his claims, Short quotes a recent academic study by Robert Howard, an Australian psychologist and leading expert on chess research. (Actually, Short quotes Howard’s synopsis in a ChessBase article, which has several important differences to the original published study – see below.) Up until Howard’s study, the gender debate had pretty much been put to bed in the academic community thanks to the widely accepted ‘participation hypothesis’: that the gap in performance disappears once one accounts for the fact that far less females play tournament chess than men. The definitive article is Bilalic et al., 2009, which found that 96 per cent of observed performance differences between men and women at the top levels of chess can be explained by participation rates.
Short is highly critical of the participation hypothesis, elegantly lambasting the authors:
“Only a bunch of academics could come up with such a preposterous conclusion which flies in the face of observation, common sense and an enormous amount of empirical evidence too.”
While the vast majority of academic literature, and hence ‘observation’ and ‘empirical evidence’ on the topic, supports the participation hypothesis, there are two studies since 2009 that back up Short’s rebuttal: Howard’s 2014 work, and an academic comment by Michael Knapp that criticised the statistical reliability of Bilalic et al.’s study. Yes, only a bunch of academics! Semantics aside, though, it’s time to directly address these articles.
The Knapp comment, despite Short’s claims, can hardly be said to be backed by ‘common sense’. Instead of assuming that the population of chess players follows a normal distribution (as is the case for IQ, height and many other natural phenomenon), Knapp adopts a ranking-preference approach that employs a negative hypergeometric distribution.
This doesn’t sound very logical to me. Knapp’s motivation is that the original Bilalic et al. model has a lot of problems forecasting ratings at the tails of the distribution; that is, the model’s accuracy at predicting the strength of the very best players is dubious. This is true, but the same can be said for almost any population where a normal distribution is assumed; IQ tests, for example, are notoriously unreliable for extremely smart (or vice-versa) people.
Moreover, it seems to me that a ranking mechanism, as Knapp suggests, faces similar problems at the tails. Ordinal ranking mechanisms don’t distinguish strength differences between ranking positions, so essentially a lot of information that we have via ELO ratings is being ignored. For example, the difference in strength between Carlsen (ELO 2870 as of April 24) and Caruana (2800) is assumed to be the same as between Caruana and the current world number 3, Nakamura (2799). I can’t say for sure that one approach is necessarily better than the other at the extremes – my knowledge of the negative hypergeometric distribution is a little shaky – but for investigating performance gaps for the average population, I don’t find Knapp’s rebuttal very convincing.
It’s a similar story with the Howard study, which I found very interesting. The study itself makes only modest claims: At the very top levels, the difference in performance cannot be fully explained by the participation hypothesis. Short seems to exaggerate this result:
“Howard debunks [the participation hypothesis] by showing that in countries like Georgia, where female participation is substantially higher than average, the gender gap actually increases – which is, of course, the exact opposite of what one would expect were the participatory hypothesis true.”
I don’t know what Short is referring to here, because there is nothing in the Howard article that suggests this. Figure 1 of the study shows that the gender gap is, and has always been, lower in Georgia than in the rest of the world for the subsamples tested (top 10 and top 50). Short may be referring to Figure 2, which, to be fair, probably shouldn’t have been included in the final paper. It looks at the gender gap as the number of games increases, but on the previous page of the article, Howard himself acknowledges that accounting for number of games played supports the participation hypothesis at all levels except the very extreme (Chabris and Glickman, 2006). If anything, this figure seems to suggest that the often-quoted statistic of a gender gap of 250 ELO points is vastly inflated. (There is also a third figure in the paper, showing that the career progression of Judit Polgar was similar to that of Gary Kasparov. I have no idea what this is meant to demonstrate.)
There are a couple of issues with the Howard study. The first is that it uses only FIDE ratings data, which does not account for drop-out rates and is statistically biased towards the top of the distribution. The serious problems of using only FIDE data to make inferences about the population are highlighted in an excellent (but rather dry) paper by Vaci, Gula and Bilalic in 2014. In short, the bottom line is that analysis based on FIDE data messes up performance differences with the question “Which gender is more likely to drop out of a chess career?”, which introduces a whole new set of explanations.
The second issue I have is that Howard restricts his sample to players who have played at least 650 FIDE-rated games. That is a heck of a lot of games! Howard has good reasons for doing this from a statistical perspective (see above), but it casts some doubt on the representativeness of the sample. Once we move into this range, we are beginning to talk about gender differences between people who play chess as their profession, rather than just general ability differences among the broader population.
Judit Polgar, the most successful (and famous) female chess player in history, suffered uncharacteristic rating slumps in the period immediately following the birth of each of her children. Professional chess in the open category is a full-time commitment, requiring a rigorous and demanding training regime. Peak performance is usually registered in the age range of 30-40. Should we be terribly surprised that there is a small but persistent gender gap among this extreme subset of chess professionals? I wouldn’t expect anything less.
The final issue I have with the Howard study is in regard to the Georgian data. Howard’s claim is that the very high percentage of female players (around 30 per cent) gives us a good opportunity to test his theory. Unfortunately, as he himself mentions, the sample here is extremely small. There are only 12 Georgian women that met the criteria of 650+ games during the period, and so the power of the comparisons is very weak. (Note that the majority of these women are/were also professionals, and thus subject to the maternal pressures mentioned above.)
However, the data is useful to answer a different question: Given that the chess culture in Georgia has historically been much more supportive of female players than other countries, how does the gender gap compare to the rest of the world? One would assume that if there is a large social component to the gender performance gap, then the most successful country for producing professional female chess players should have less of a gap than the average. Figure 1 of Howard’s paper shows that this is indeed the case. This supports a nurture argument to the gender gap, but again, the sample size is too small for anything definitive to be concluded.
In saying all of the above, let me finish by stating that I quite like the approach taken by Howard and Knapp in their analyses. I think that it is all too easy for people to approach gender issues from a resolute emotive or philosophical base, rather than being open to new scientific arguments. The participation hypothesis has certainly not been debunked, but neither can one say for certainty that neurological differences don’t play a role, particularly at the highest level. It seems to me on the basis of the current evidence that if we took a newborn boy and girl and asked the question, “Which is most likely to become world chess champion?”, the boy’s chances are slightly higher. But we are talking about minute differences to incredibly minute probabilities to being with. As to the much more significant question of which would be more likely to beat the other in the future, cultural effects excluded, nothing to date has managed to convince me that there should be a difference at all.