Research​ into the generation and interpretation of what computer scientists call natural language processing has made extraordinary progress over the last ten years, and powerful systems now have an astonishing capacity to emulate written thought. I decided to ask the new AI chatbot, ChatGPT, some of the exam questions I’d written for a course on using digital technology in healthcare. Here’s part of the answer to a question about how to evaluate the use of video consultations in primary care:

To design this study, we would need to recruit a large sample of patients who are eligible for video consultations and randomly assign them to either receive care via video consultations or in-person consultations. We would then need to follow these patients over time to assess their health outcomes, such as their blood pressure, cholesterol levels and overall health status.

The key metrics to be measured in this study would include the percentage of patients who achieve target levels for key health indicators, such as blood pressure and cholesterol levels, as well as patient satisfaction with the care they received. We would also need to measure the cost-effectiveness of video consultations, including any savings in terms of travel time and costs for patients who use this service.

There aren’t any extraordinary insights here – it’s not the sort of question that seeks a distinctive response – but it’s coherent, comprehensive and sticks to the point, something students often fail to do.

Google transformed the field of natural language processing in 2013 with ‘word2vec’, a program that computed ‘word embeddings’, statistical constructions that allow the meaning of a word to be represented mathematically. Imagine we analysed a corpus of millions of sentences containing, say, 10,000 different words and used the cells of a spreadsheet to count how often words appeared together. There might be a row for ‘red’ and a column for ‘car’: the cell at their intersection would show how often they co-occurred. Since a spreadsheet is in effect a matrix and any matrix can be expressed as the product of two smaller matrices, it is possible to find a more compact representation of the data, one with a row for each word but only, say, a hundred columns, each containing a number that summarises something about the way the word is used in conjunction with other words (for instance, the likelihood of ‘car’ occurring with ‘red’, ‘blue’ or any other colour). If each number is taken to be a position on a dimension in space, we can use the sequence of numbers as a set of co-ordinates determining the position of the word and, therefore, a representation of its meaning. The dimensions define a conceptual realm within which words with similar meanings will be found close together and words with dissimilar meanings further apart.

The second big breakthrough came in 2017 when a team at Google Brain presented a conference paper called ‘Attention Is All You Need’ describing the way word embeddings could be used in a neural network to get a representation of the meaning of sentences and longer passages of text. Machine learning is particularly suited to the task of predicting the next word in a sequence – a subject of special interest to Google because it makes it easier for users to complete search queries. A simple approach is to ignore any meaning the sentence might have and to treat it simply as a sequence of words. We have data on how often each word occurs in a sentence with the possible candidates for the next word in the sequence, and this allows us to estimate the most likely candidates, perhaps giving a greater weight to those that occur with the more recent words in the sequence. Addressing this problem with machine learning essentially involves feeding word embeddings into an algorithm and working out the weighting of associated words by training the algorithm on examples. The difficulty, it turns out, is that machine learning programs such as neural networks struggle to calculate the appropriate weights for more distant words in long sequences. This problem can be addressed by using an ‘attention mechanism’, a layer in a neural network that learns which parts should be focused on and adjusts the weights accordingly.

The revelation in the conference paper was that a network that contained only attention layers outperformed all existing networks for processing language. These networks, known as transformers, capture information about the way the association between a word, or rather its embedding, and the target in a given task, for example, a candidate to be the next word, is altered by the words around it, including those some distance away. When a transformer is trained to predict missing words in millions and millions of sentences, the network acquires a representation not just of the meanings of the individual words but of larger semantic structures.

OpenAI, a company co-founded by Elon Musk and now part-owned by Microsoft, started using transformers to develop Large Language Models in 2018. The most recent, GPT-3, released in May 2020, was trained on 45 terabytes of text data and has 175 billion parameters. The journalists and scientists who were given access to it were amazed at the fluency of the text it generated in response to simple requests or queries. The most exciting thing, for the team developing it, was that GPT-3 could tackle tasks it hadn’t been trained to do. A standard approach in machine learning is to train a network on a simple task for which a vast amount of data is available – GPT-3 was trained on the task of predicting a missing word in a sentence – and then use that trained network as the starting point for a more specialised task, fine-tuning it on a smaller amount of data. The unexpected thing about GPT-3 was that no fine tuning was necessary. Like a human being, it could learn tasks when it had seen only one or two examples.

OpenAI recruited volunteers to play with GPT-3 and rate its responses to questions. This allowed it to train a chatbot using reinforcement learning, the class of algorithms used to develop strategies and the method employed by DeepMind to master games such as Go and chess. ChatGPT is so good at generating convincing answers it is easy to forget that it is a model of language and not a source of wisdom. When I asked it to explain why nurses shouldn’t go on strike, it lucidly stated the ways in which a nurses’ strike might lead to public harm. I then asked it to argue the opposite position, but it managed the dialogue by again suggesting that, in fact, nurses shouldn’t strike. The point isn’t that it has a right-wing bias, although it might, it’s that it only has access to a synthesis of things that have been written, and is trying to have a dialogue that previous users would have rated as successful. I logged in again later and asked if nurses should go on strike: ‘It is not for me to say whether nurses are right or wrong to go on strike. The decision to go on strike is a complex one that depends on a variety of factors, including the specific circumstances and reasons for the strike.’ When I asked if there were circumstances in which nurses should go on strike, it cited issues of patient safety.

ChatGPT is good at providing succinct, articulate responses to clearly framed questions on matters about which there is a reasonable amount of published material. That’s why it can answer the kinds of question you might find on an exam paper. Even though my university has returned almost completely to face to face teaching, exams have so far stayed online. My students have just sat the exam with the questions I got ChatGPT to answer, but I’ve no way of knowing whether they also used it. When I mark it, I might be able to spot a resemblance between the answers ChatGPT generates and a student’s submission, but the algorithm responds differently to each interaction, so they won’t get the same answer I did. Plagiarism checkers are no use. Next year we will have to set a different kind of exam or bring the students into an exam hall and deprive them of internet access. I suppose we will have to think differently about written assignments too.

Most people don’t have to worry just yet. ChatGPT can write something that reads like a newspaper article, draft a vanilla press release or make a plausible attempt at a legal agreement, but although the content is sensible enough, it isn’t based on a detailed knowledge of events, individuals or their circumstances. Don’t get too comfortable though: GPT-4 is due to be released later this year.

Send Letters To:

The Editor
London Review of Books,
28 Little Russell Street
London, WC1A 2HN

Please include name, address, and a telephone number.

Read anywhere with the London Review of Books app, available now from the App Store for Apple devices, Google Play for Android devices and Amazon for your Kindle Fire.

Sign up to our newsletter

For highlights from the latest issue, our archive and the blog, as well as news, events and exclusive promotions.

Newsletter Preferences