April 24, 2015 / Chess

Men, Women and Nigel Short 2: An academic response

SUMMARY:

Short’s article was not as controversial as it has been made out to be
Short based his claims on two recent academic contributions – one statistical comment and one study on FIDE data – that suggested that gender gaps cannot be explained away by participation rates
There are several technical issues with these papers that cast doubt on their conclusions, particularly for the majority of the population
Overall, the scientific evidence does not suggest a biological basis for the gender gap for the vast majority of the population
There is some evidence of a gap at the very highest professional levels; an open question is whether this is due to maternal reasons

I have gotten a lot of feedback on my recent post about GM Nigel Short’s NiC article about gender differences in chess. My previous post was meant to be informal and light on technical analysis, so as to put the gender debate more in the light of what is and isn’t important for us as a community to talk about. Still, the feedback I have received has indicated that there is at least some demand for an explanation of my views from a more scientific perspective. Also, given the amount of unsubstantiated rubbish reported in the media about Short’s article, there’s probably good cause for an objective academic review of his claims as well. What follows will be a little more technical than my regular posts, so for an easier read on the issue, refer back to my original response.

But let’s take a step back for a minute. In my opinion, the main powder keg for the surfeit of angry reactions that this story has incited is the lack of a well-defined question. What’s really the issue being discussed here? Here are some of the different angles that various media reports have claimed is Short’s main point:

Women are worse at chess than men
Women’s brains are naturally less suited to chess
Women shouldn’t play chess
Women aren’t smart enough for chess

…and some other things about driving.

Reading these reports, it became very clear to me that almost none of the journalists actually read Short’s article (you can see it in full here, if you like). Some journalists even went so far as to ascribe blatantly sexist but entirely fabricated quotes to Short in their reports – the most cardinal sin of Journalism 101.

In fact, Short says nothing about intelligence in his article, nor makes any normative claims saying that women shouldn’t play chess. In fact, his final sentence begins, “It would be wonderful to see more girls playing chess, and at a high level.” The two claims that do come out are as:

There is a gender gap in chess
This gap cannot be explained by participation rates

To back his claims, Short quotes a recent academic study by Robert Howard, an Australian psychologist and leading expert on chess research. (Actually, Short quotes Howard’s synopsis in a ChessBase article, which has several important differences to the original published study – see below.) Up until Howard’s study, the gender debate had pretty much been put to bed in the academic community thanks to the widely accepted ‘participation hypothesis’: that the gap in performance disappears once one accounts for the fact that far less females play tournament chess than men. The definitive article is Bilalic et al., 2009, which found that 96 per cent of observed performance differences between men and women at the top levels of chess can be explained by participation rates.

Short is highly critical of the participation hypothesis, elegantly lambasting the authors:

“Only a bunch of academics could come up with such a preposterous conclusion which flies in the face of observation, common sense and an enormous amount of empirical evidence too.”

While the vast majority of academic literature, and hence ‘observation’ and ‘empirical evidence’ on the topic, supports the participation hypothesis, there are two studies since 2009 that back up Short’s rebuttal: Howard’s 2014 work, and an academic comment by Michael Knapp that criticised the statistical reliability of Bilalic et al.’s study. Yes, only a bunch of academics! Semantics aside, though, it’s time to directly address these articles.

The Knapp comment, despite Short’s claims, can hardly be said to be backed by ‘common sense’. Instead of assuming that the population of chess players follows a normal distribution (as is the case for IQ, height and many other natural phenomenon), Knapp adopts a ranking-preference approach that employs a negative hypergeometric distribution.

This doesn’t sound very logical to me. Knapp’s motivation is that the original Bilalic et al. model has a lot of problems forecasting ratings at the tails of the distribution; that is, the model’s accuracy at predicting the strength of the very best players is dubious. This is true, but the same can be said for almost any population where a normal distribution is assumed; IQ tests, for example, are notoriously unreliable for extremely smart (or vice-versa) people.

Moreover, it seems to me that a ranking mechanism, as Knapp suggests, faces similar problems at the tails. Ordinal ranking mechanisms don’t distinguish strength differences between ranking positions, so essentially a lot of information that we have via ELO ratings is being ignored. For example, the difference in strength between Carlsen (ELO 2870 as of April 24) and Caruana (2800) is assumed to be the same as between Caruana and the current world number 3, Nakamura (2799). I can’t say for sure that one approach is necessarily better than the other at the extremes – my knowledge of the negative hypergeometric distribution is a little shaky – but for investigating performance gaps for the average population, I don’t find Knapp’s rebuttal very convincing.

It’s a similar story with the Howard study, which I found very interesting. The study itself makes only modest claims: At the very top levels, the difference in performance cannot be fully explained by the participation hypothesis. Short seems to exaggerate this result:

“Howard debunks [the participation hypothesis] by showing that in countries like Georgia, where female participation is substantially higher than average, the gender gap actually increases – which is, of course, the exact opposite of what one would expect were the participatory hypothesis true.”

I don’t know what Short is referring to here, because there is nothing in the Howard article that suggests this. Figure 1 of the study shows that the gender gap is, and has always been, lower in Georgia than in the rest of the world for the subsamples tested (top 10 and top 50). Short may be referring to Figure 2, which, to be fair, probably shouldn’t have been included in the final paper. It looks at the gender gap as the number of games increases, but on the previous page of the article, Howard himself acknowledges that accounting for number of games played supports the participation hypothesis at all levels except the very extreme (Chabris and Glickman, 2006). If anything, this figure seems to suggest that the often-quoted statistic of a gender gap of 250 ELO points is vastly inflated. (There is also a third figure in the paper, showing that the career progression of Judit Polgar was similar to that of Gary Kasparov. I have no idea what this is meant to demonstrate.)

There are a couple of issues with the Howard study. The first is that it uses only FIDE ratings data, which does not account for drop-out rates and is statistically biased towards the top of the distribution. The serious problems of using only FIDE data to make inferences about the population are highlighted in an excellent (but rather dry) paper by Vaci, Gula and Bilalic in 2014. In short, the bottom line is that analysis based on FIDE data messes up performance differences with the question “Which gender is more likely to drop out of a chess career?”, which introduces a whole new set of explanations.

The second issue I have is that Howard restricts his sample to players who have played at least 650 FIDE-rated games. That is a heck of a lot of games! Howard has good reasons for doing this from a statistical perspective (see above), but it casts some doubt on the representativeness of the sample. Once we move into this range, we are beginning to talk about gender differences between people who play chess as their profession, rather than just general ability differences among the broader population.

Judit Polgar, the most successful (and famous) female chess player in history, suffered uncharacteristic rating slumps in the period immediately following the birth of each of her children. Professional chess in the open category is a full-time commitment, requiring a rigorous and demanding training regime. Peak performance is usually registered in the age range of 30-40. Should we be terribly surprised that there is a small but persistent gender gap among this extreme subset of chess professionals? I wouldn’t expect anything less.

The final issue I have with the Howard study is in regard to the Georgian data. Howard’s claim is that the very high percentage of female players (around 30 per cent) gives us a good opportunity to test his theory. Unfortunately, as he himself mentions, the sample here is extremely small. There are only 12 Georgian women that met the criteria of 650+ games during the period, and so the power of the comparisons is very weak. (Note that the majority of these women are/were also professionals, and thus subject to the maternal pressures mentioned above.)

However, the data is useful to answer a different question: Given that the chess culture in Georgia has historically been much more supportive of female players than other countries, how does the gender gap compare to the rest of the world? One would assume that if there is a large social component to the gender performance gap, then the most successful country for producing professional female chess players should have less of a gap than the average. Figure 1 of Howard’s paper shows that this is indeed the case. This supports a nurture argument to the gender gap, but again, the sample size is too small for anything definitive to be concluded.

In saying all of the above, let me finish by stating that I quite like the approach taken by Howard and Knapp in their analyses. I think that it is all too easy for people to approach gender issues from a resolute emotive or philosophical base, rather than being open to new scientific arguments. The participation hypothesis has certainly not been debunked, but neither can one say for certainty that neurological differences don’t play a role, particularly at the highest level. It seems to me on the basis of the current evidence that if we took a newborn boy and girl and asked the question, “Which is most likely to become world chess champion?”, the boy’s chances are slightly higher. But we are talking about minute differences to incredibly minute probabilities to being with. As to the much more significant question of which would be more likely to beat the other in the future, cultural effects excluded, nothing to date has managed to convince me that there should be a difference at all.

26 Commments

Pingback: Women’s chess tournaments – a necessity | Central Squares
Greg Waite says:

May 31, 2015 at 7:46 pm

I’m aged 60 now, played serious chess up to 20, only for fun since. I’ve occasionally pondered what pushed my contemporary Murray Chandler to the top. Reading about some of my era’s greats – Fisher, Tal, Spassky, Marshall – their lives intersected in so many ways with extraordinarily compelling people and cultures. And those special links I think are typically “male” cultures, less likely to assist or inspire girls who still today are socially conditioned pretty differently in all the families I’ve ever known.

I like statistics too, it’s great to see people taking gender analysis seriously, but technical analysis will never come close to capturing 100% of something so diverse and complex. I hope the talk helps to stoke discussion though on ways to encourage both sexes to enjoy chess 🙂
Pac says:

May 3, 2015 at 8:01 pm

looks to me like smerdon is being a smart ass while defending a misogynist

a whole article contemplating nigel short’s idiot comments…
David Smerdon says:

April 28, 2015 at 7:42 pm

Some excellent comments. Stringer Bell’s analysis does suggest that perhaps there isn’t a quirky natural bimodal distribution for females after all. (What a shame – that would be extremely interesting to research!) It definitely appears that using ELO data is a risky approach. On the other hand, I’m becoming more convinced that a gap persists at the top echelons. That doesn’t mean that believe that women are less hard-wired for chess – I need much more evidence to persuade me about that – but certainly with regard to, say, the probability of becoming world champion, there seems to be a distinct gap.
Jim Ratliff says:

April 27, 2015 at 2:46 pm

Thanks for the thoughtful and carefully analyzed article, which clearly needed to be written!

There’s one other argument that I don’t think is subsumed in your discussion: I’ve often heard, though I don’t know whether this is factual, that for many traits the distribution of male values has higher variance than the distribution of female values. (If this were true for intelligence, for example, assuming equal means, you’d expect the proportion of male genius to be larger than the proportion of female geniuses but also that the proportion of male morons to be larger than the proportion of female morons.)

If this were true of native chess ability, then one would expect that—even after equalizing all other factors—the World Champion would peristently be a man.

Are you aware of empirical research regarding whether that claim about higher male variance is true in general or about whether that’s a reasonable hypothesis for the relevant traits underlying chess ability?
Clifford says:

April 27, 2015 at 1:41 pm

Perhaps the older men’s ratings are also inflated by the old 2205 floor for men, 200 points higher than the women’s floor (which was later reduced from 2000 to 1800 before floors became uniform).
Stringer Bell says:

April 26, 2015 at 10:10 pm

Sure.

Here is a box-whisker-plot to compare percentiles
[img]http://i.imgur.com/DvVp8dw.jpg[/img]
Obviously, a strong woman is better than a mediocre man, but at the same percentile men have higher ratings.

Histograms of the distribution of ratings
[img]http://i.imgur.com/0ZY6xY5.jpg[/img]
Both distributions look Gaussian with the mean of the men’s ratings being around 200 points higher.

And overlapping histograms with the relative frequencies of ratings
[img]http://i.imgur.com/XHjyseD.jpg[/img]
As mentioned before men are overrepresented at higher ratings, women are overrepresented at the lower echelons.
Stringer Bell says:

April 26, 2015 at 10:08 pm

Sure.

Here is a box-whisker-plot to compare percentiles

Obviously, a strong woman is better than a mediocre man, but at the same percentile men have higher ratings.

Histograms of the distribution of ratings

Both distributions look Gaussian with the mean of the men’s ratings being around 200 points higher.

And overlapping histograms with the relative frequencies of ratings

As mentioned before men are overrepresented at higher ratings, women are overrepresented at the lower echelons.
Najdork says:

April 26, 2015 at 12:45 pm

Ah, a 2000 elo floor, that is a good explanation… I do vaguely remember you needed to have high rating to get fide elo once, maybe 20 years ago…

Now its easier and easier to get a fide elo, soon almost everyone will have one.

Btw Stringer if you could upload graph pictures that would be great!
Stringer Bell says:

April 26, 2015 at 10:20 am

A weird spike in the frequencies of women with a rating between 2000 and 2100 could be explained by the fact that the Elo floor has been decreased too recently, and there still are too many women with a rating that mostly stems from an overperformance. E.g., a player with a playing strength of 1850 played three tournaments, one with a performance of 1650, one around 1850, and one around 2050. With an Elo floor of 2000 this player would have gotten a rating of 2050.

Another reason to use different data than Elos. Alas, the distribution of ratings in germany looks Gaussian for women, too.
David Smerdon says:

April 26, 2015 at 3:45 am

The weird bimodal distribution of women’s ratings is totally uncanny. For me, this is the most interesting thing to come out of this debate by far!
Stringer Bell says:

April 25, 2015 at 11:06 pm

One problem with the Elo ratings is that we only observe above-average players. The sample is biased. As already mentioned, in germany every active player gets a rating (DWZ = Deutsche Wertungs-Zahl). Following Najdork’s approach and wondering if there still would be a difference in the average rating in an unbiased sample I downloaded the german ratings as a csv-file from http://www.schachbund.de/download.html.

Here is some R code:

data <- read.csv("spieler.csv")
dwz 1900 & Geburtsjahr 0)
dwz <- subset(dwz[,c(4,5,9:12)])
male = subset(dwz, Geschlecht == "M")
female = subset(dwz, Geschlecht == "W")

Let's compare the means and medians:

round(tapply(dwz$DWZ, dwz$Geschlecht, mean))
boxplot(DWZ~Geschlecht, data= dwz)

On average men's ratings are almost 200 points higher than women's! Though it is already pretty obvious we can humour ourselfes with a test for significance:

t.test(male$DWZ, female$DWZ)

A p-value < 2.2e-16 is as clear as it gets. The null hypothesis that on average men and women have the same rating has to be rejected.
You can play around and try to create nicer graphics

library(scales)
hist(male$DWZ, col=alpha(5, 0.5), xlab="DWZ", main="Gender Gap", xlim=c(0,3000), freq=F, ylab="", breaks=10, yaxt="n")
hist(female$DWZ, col=alpha('yellow3', 0.5), xlab="DWZ", xlim=c(0,3000), freq=F, , breaks=10, add=T)
legend("topleft",legend=c("Men","Women"), col=c(5, "yellow3"), lwd=8)

This one shows that women are overrepresented in the lower rating echelons while men are overrepresented in the higher spheres.

Conclusion: I don't see how the difference can be discussed away unless you don't want to let the truth get in the way of political correctness. Nigel is right.
Barry says:

April 25, 2015 at 11:14 am

Hi David,

Nice article although I get the feeling you are being more generous with participation side of the debate than perhaps is warranted on what I’ve seen (which is admittedly limited). The assumption of normal distribution is I think a serious problem with Bilalic’s study when you are trying to infer something about the general participation by analysing the tails.

Anyway two main comments are

(i) what do you make of the link between spatial analysis in IQ tests and ability in chess (and STEM disciplines) there are studies which show no gender differences until age 15 and then a small advantage for males thereafter.

(ii) I would avoid characterising any innate gender difference as neurological. Short and others who use the term “hardwired” obviously imply something like this but at present none of the research is close to showing a neurological difference – as far as I have seen, all they are attempting to prove is an innate difference. This could result from, say, hormonal difference with no significant contribution from the brain or CNS.

Minor comment is that I would have found a list of references at the bottom of your post helpful.

Keep up the good work.
Najdork says:

April 25, 2015 at 6:51 am

Looks like people many women (but also men) are satisfied when they reach 2k. Lol reminds me of myself, as a 1900 once I reach 2k I’ll retire 😉
Najdork says:

April 25, 2015 at 6:43 am

Thanks for your replies David.

Here are pics I made of the distributions of the fide ratings in case anyone is interested

male: http://i.imgur.com/LI6qtnn.png
female: http://i.imgur.com/jcsYnbA.png

The women’s one is weird! I thought I made a mistake but if Tattarisu gets the same…
David Smerdon says:

April 25, 2015 at 6:22 am

Dear Martin,

I think you’re right with the reference. However, as you guess, this cross-country comparison wasn’t included in the published paper. I’m always a bit nervous when I see an academic extend the findings of a study in less formal writings, as it usually means there was something at least a little dubious about these extra results – for example, they may not be statistically significant. However, without seeing the data, I can’t really judge.
David Smerdon says:

April 25, 2015 at 6:19 am

@Jean-Michel:

Yes, Nigel has read my posts. On IM Alex Wohl’s facebook page, Nigel wrote some very kind words: “That is a superb piece, David. Thank you for advancing my understanding.” It’s hard for me to criticise academic cherry-picking in this case seeing as I wasn’t even aware of the Howard study until I read Nigel’s article – so in that respect as well as others, both of us have benefited from the debate!
David Smerdon says:

April 25, 2015 at 6:15 am

@Najdork:

With regard to (1), my point is that a lot more needs to be said regarding the statistics before one can claim a significant correlation, let alone causation. In this respect Tattarisuo’s comment regarding a Gaussian distribution for men and a multimodal distribution for women is very interesting (and unexpected, at least to me). But I have to concede that my ‘extreme top’ suggestion doesn’t seem to have any support in your data.

With regard to (2), as has been shown before, most national ratings data (e.g. Germany) keeps track of drop-outs (e.g. by lapsing memberships), while in the case of FIDE, their rating just pauses, so to speak. So given that women have a higher drop-out rate, FIDE data biases the ratings distribution downwards for this gender (and no, I don’t think you can find FIDE drop-out data anywhere).

As to qualitive selection arguments: yes, you’re right; without further rigorous they can be used to support any side. I wonder whether it’s more likely that more of the ‘better females’ keep playing than the ‘better males’. Hard to say. At least there is some anecdotal evidence for my maternal-selection claim. The only tangible evidence we have regarding selection, I believe, is the higher drop-out rate for women.

I think we can probably conclude that the debate could certainly use more quantitative analysis, from all sides!
Najdork says:

April 25, 2015 at 4:47 am

As for qualitative selection/participation arguments I might as well argue that since chess is “female unfriendly” only the better females actually play tournaments and get a fide rating.
Najdork says:

April 25, 2015 at 4:28 am

1) I provided the standard deviation which is actually smaller for men. So the “average is pulled up by the extreme top” theory, as vague as it is, doesn’t sound consistent with this.

Actually I just checked out the Median values (1721 for women and 1893 for men) and it appears that the opposite is actually occurring (the average for men is lower than the median more significantly than for women)

2) This is a possible theory, but I don’t know how this dropout data is defined or where it is published, I do know however that women play on average more games than men considering the whole database, so IF there is a greater dropout rate it is actually more than compensated by women playing more games on average.

So yeah if you want to claim something you can always imagine hidden, unaccountable, unavailable variables that skew the available data.
Jean-Michel says:

April 25, 2015 at 4:21 am

Thanks David! Finally some actual facts and science to go with the overblown opinions we got everywhere else. This was very enlightening. Has Nigel read your post, and has he found it convincing? Because it certainly seems that he, like most everyone, picks and chooses from the scientific litterature those studies that back his prior assumptions.
Ole Petter Pedersen says:

April 25, 2015 at 3:21 am

Very interesting and easy to read article. As a complete amateur used to being easily beaten by talented young Norwegian Chess girls, I could use my anecdotal experience as ‘evidence’, which is how most people from an opinion. Just getting to the point of having FIDE rating requires so much time spent on Chess, that many never get there in the first place. And an average difference of 150 ego points is not really massive, considering the scale is from 0 to 2900 (almost). Though not a linear scale, 150 point is app 5 percent difference on this scale. Which is insignificant to normal people.

Hopefully, Hou Yifan will challenge for the world championship in a few years time, and then that anecdotal ‘evidence’ will put this silly discussion to rest.
Tattarisuo says:

April 25, 2015 at 3:20 am

I did the same as Najdork but instead of calculating mean and std I plotted normalized histograms of the data. The result was quite interesting. While the male curve was a skewed (towards higer rating) gaussian distribution peaking around 2100, the female curve looked like two overlayed distributions one peaking at 1600 and the other at 2100, similar to men. My hypothesis is that among the female players there are two groups, a casual group that plays but doesn’t study and an ambitious group that studies.
Martin Bennedik says:

April 25, 2015 at 3:17 am

Dear David,

thank you for the interesting discussion.

You ask:

“I don’t know what Short is referring to here, because there is nothing in the Howard article that suggests this.”

I think Short is referring to the following sentence from the Chessbase article:

“If the participation rate hypothesis is correct, the rating sex difference should decline as the percentage of females in a group of nations increases. But numerically we find just the opposite trend, as the figure shows.”

If I understand you correctly, this is not in the published paper?
David Smerdon says:

April 25, 2015 at 2:31 am

It’s hard to respond to this comment, because there is just so much wrong with the link from these computed averages to the conclusion “men are better than women at chess”. I can only assume that you wrote your comment without reading either of my posts at all. Let’s just highlight three issues for starters:

– Without distributional data, it’s impossible to tell whether the gap is being largely driven by the extreme top (as discussed in the post)
– Using FIDE data biases the analysis due to unaccounted drop-outs (the rate for which is statistically higher for women)
– Without accounting for total games played, the participation hypothesis predicts that the female rating distribution will be more ‘bottom-heavy’, which is wholly consistent with your calculations.

Nobody disputes these statistics, which have in fact been reported in academic studies in the past. One can find a correlation, but it would take more than your average mental gymnastics to turn that into a useful gender inference.
Najdork says:

April 25, 2015 at 1:14 am

I always wondered what the AVERAGE rating of male and women chess players was but I never found this data on the internet.

In the end I finally took the time and effort to do what very few of these “theoretical” articles bothered to do: download the latest raw fide data, parse it in a database and analyse it.

Based on January 2015 data (standard rating, not blitz or rapid) I found that

There are 178969 male rated players and 18600 female players

Male average rating is 1866 (standard deviation 288)

Female average rating is 1715 (standard deviation 305)

The average number of games per month for women was 0.96
The average number of games per month for men was 0.91

(In absence of better data average number of games per month could be used as a proxy for the average “effort” made by a player to improve)

It will take some convoluted mental gymnastics to argue against “men are better than women at chess”

26 Commments

Leave a Reply