Researchers led by a team from Emory University recently announced that they had used artificial intelligence to predict patients’ self-reported racial identity from medical images. It is an unexpected, unsettling result.
Neither expert radiologists nor the computer scientists who trained the algorithms can work out what it is in the images that the algorithms – they compared three different architectures of deep neural network – are using as the basis for the classification. The result is also astonishingly accurate and weirdly robust. Using a metric that ranges from 0.5 for totally random to 1.0 for absolutely perfect, the algorithms scored between 0.95 and 0.99 on the classification of subjects as Black, white or Asian when trained using chest X-rays, and between 0.80 and 0.96 using mammograms, CT scans and spinal X-rays.
Results as good as these are often a sign that something has gone wrong, that the method is flawed in some way. ‘Reading Race’, however, is an exceptionally careful and thorough piece of work. The experimenters used data from different hospitals, and carried out a wide range of experiments, such as training the algorithms on one dataset and testing them on a completely different one, to ensure that their results were valid. They looked to see if the algorithms were picking up diseases that Black people are more likely to have, or if age, sex, bone density or BMI was giving away the subjects’ racial identity. None of these had an impact strong enough to explain the effect.
The researchers tried removing parts of the images, blurring them and reducing the resolution. The worse the data that was fed into the algorithms, the worse the performance, but even with images that were so degraded as to be unrecognisable as X-rays some information about race was still being picked up. Almost unbelievably, on a set of chest X-rays reduced to a scale of four pixels by four the algorithm scores 0.63, only a little better than chance, but still better than chance.
I thought for some time about that ‘almost’. It stretches credulity that an algorithm could be able to pick up even a hint of a socially defined construct from just a few bytes of physical data. The paper is available on a preprint server in advance of peer review and other readers may well spot flaws that aren’t obvious to me. It is possible there is something unusual about the mix of patients at the university hospitals where much of the data comes from, or that patients whose race is recorded in a way that can be associated with their medical images are atypical and the result is in some way an artefact of the data.
The authors are clear that the result doesn’t mean there is some fundamental difference between races. They cite a 1986 paper summarising the reasons that race is not a biologically useful concept. Geographic variation in gene frequency is gradual and doesn’t fall into a natural set of categories. Individuals who share one trait will differ in another. There is no evidence for a package of genes that differentiates between races, and no biological reason to focus on the traits that are central to the assignment of race. Genetic differences within racial groups are orders of magnitude greater than those between racial groups.
Yet race is nevertheless an important variable in medicine. It is strongly predictive of poor outcomes across a range of diseases. This is not simply because in countries such as the US and the UK it correlates with lower socioeconomic status or less education: ‘Race is not confounded by these other variables, it is antecedent to them.’ The urgent medical question is how best to respond to the impact that race, as a social and political construct, has on health outcomes.
Crucial treatment decisions for patients recovering from Covid-19, for example, are made on the strength of measurements of lung function, assessed using devices known as spirometers. Spirometry software has a built-in adjustment for race, which assumes that the lung capacity of Black people is on average 10 to 15 per cent smaller, and that of Asians 4 to 6 per cent smaller, than that of white people. This means you could have two patients, one Black and one white, with the same lung function and the spirometry reading would indicate that only the white patient required treatment.
The idea that Black people have smaller lungs can be traced back to the racist ideology of the American South, but the values used in adjustments today are taken from a 1999 survey. There is no known genetic basis for the difference, and many doctors argue that it can be explained by socioeconomic factors, body proportions and occupational hazards, and that these are the factors we should be adjusting for.
The American Heart Association guidelines meanwhile categorise Black patients as at lower risk of death from heart failure, which may make them less likely to be allocated to more intensive forms of care. The guidelines give no rationale for this adjustment. And the algorithms used to estimate kidney function from measurements of creatine levels in the blood are routinely corrected for race because Black people, on average, have higher creatine levels, but the reasons for this are not understood and it is unclear if the adjustment is appropriate.
In these cases it could be argued that including race as a variable in calculation is at worst a relic of racist ideology and at best an inadequate proxy for variables we should be measuring directly. But the social significance of race is so great that, in other cases, ignoring it will exacerbate rather than remove inequalities.
Take for instance an algorithm used to assess candidates for a university programme. If it is blinded to the applicants’ race, the consequences of race on candidates’ scores for other variables, such as educational attainment, will still have an impact on the outcome. A fairer algorithm could be created by including race explicitly in a causal model of the relationships between these variables. This would be more complicated and less transparent, however, and it is easy to see why simply removing any explicit reference to race might appear the pragmatic solution. Yet the conclusion of ‘Reading Race’ is that, at least for machine learning algorithms applied to medical imaging, this just won’t work. If AI is as good as it seems to be at working out for itself who is Black and who isn’t, we can’t hope to overcome racial bias by blinding an algorithm to a patient’s race.