As chest X-rays of Covid-19 patients began to be published in radiology journals, AI researchers put together an online database of the images and started experimenting with algorithms that could distinguish between them and other X-rays. Early results were astonishingly successful, but disappointment soon followed. The algorithms were responding not to signs of the disease, but to minor technical differences between the two sets of images, which were sourced from different hospitals: such things as the way the images were labelled, or how the patient was positioned in the scanner.
It’s a common problem in AI. We often refer to ‘deep’ machine learning because we think of the calculations as being organised in layers and we now use many more layers than we used to, but what is learned is nevertheless superficial. If a group announces that their algorithm can infer sexual orientation from photographs of faces, it may well be responding to differences between the ways people present themselves on dating sites. If a start-up boasts that it can identify criminality from photographs, it’s worth asking if it’s merely sorting police mug shots from Instagram selfies, or, worse, telling us that people with certain skin tones are more likely to get convicted.
Awareness of these problems, and their social consequences, has been growing. In 2019, an algorithm used to allocate healthcare resources in the US was found to be less likely to recommend preventative measures if the patient is black, because the algorithm is optimised to save costs and less money is spent treating black patients. Around the same time, Timnit Gebru, a leader of Google’s ‘ethical AI’ team and one of the few black women in a prominent role in the industry, demonstrated that commercially available face recognition algorithms are less effective when used by women, black people and, especially, black women, because they are underrepresented in the data the algorithms are trained on.
From the perspective of an individual researcher, the solution to these problems may be to try harder: to use data that is inclusive and metrics that aren’t discriminatory, to make sure that you understand, as best you can, what the algorithm is learning, and that it isn’t amplifying an existing injustice. To some extent, though, the problems are structural. One reason we don’t pay enough attention to the consequences that our algorithms have for women or ethnic minorities is that so few women and so few people of colour work in tech. Groups like Black in AI, co-founded by Gebru, have been set up to try improve the situation, but the barriers are significant.
AI’s problem with fairness is structural in another way. Much of the work done by small teams builds on huge datasets that were created by large collaborations or corporations. ImageNet, part funded by Google, contains the URLs of 14 million images allocated, by anonymous online workers, to more than 20,000 categories. Training algorithms to replicate this classification has been a key challenge in AI, and has done much to transform the field. Many algorithms developed for more specialist tasks take generic networks already trained on ImageNet as their starting point. Most of this research ignores the 2833 categories that deal with people, and it’s easy to see why: the four most populated categories are ‘gal’, ‘grandfather’, ‘dad’ and ‘chief executive officer’; a 2020 audit concluded that 1593 categories used ‘potentially offensive’ labels. Birhane and Prabhu report finding pornographic and non-consensual images in the collection. ‘Feeding AI systems on the world’s beauty, ugliness and cruelty,’ they write, ‘but expecting it to reflect only the beauty, is a fantasy.’
Perhaps more troubling than ImageNet is the development of large-scale language models, such as GPT-3, generated from petabytes of data harvested from the web. The scale of the model is incredible and its capacities are bewildering: one short video shows how to use it to create a kind of virtual accountant, a tool that, given half a dozen sentences describing a business, will generate a working spreadsheet for its transactions.
Gebru helped write a paper last year ‘On the Dangers of Stochastic Parrots’ which argues, among other criticisms, that much of the text mined to build GPT-3 comes from forums where the voices of women, older people and marginalised groups are under-represented, and these models will inevitably encode biases that will affect the decisions of the systems built on top of them.
The authors of ‘On the Dangers of Stochastic Parrots’ advocate ‘value sensitive design’: researchers should involve stakeholders early in the process and work with carefully curated, and smaller, datasets. The paper argues that the dominant paradigm in AI is fundamentally broken. Their prescription is not state regulation or better algorithms but, in effect, a more ethically grounded way of working. It is hard to see how this can brought about while the field is dominated by large, ruthless corporations, and recent events give little grounds for optimism.
Google had already circulated a memo calling on its researchers to ‘strike a positive tone’ in discussions of the company’s technology. A pre-publication check of the stochastic parrots paper seems to have alarmed the management. They suggested changes, but Gebru stood her ground and, according to Google, resigned in December. By Gebru’s account, she was sacked. Shortly afterwards her colleague Margaret Mitchell was suspended, allegedly for running scripts to search emails for evidence of discriminatory treatment of Gebru. One of the authors on the published preprint of the paper is named as ‘Shmargaret Shmitchell’ and an acknowledgment notes that some authors were required by their employers to remove their names.
There has been a response. More than a thousand Google employees have signed an open letter calling on the company to explain its treatment of Gebru. Workers have formed a trade union, at least partly in response to these events. A director and a software developer have resigned.