As the UK moved into lockdown in March, Gavin Williamson, the education secretary, announced that this summer’s GCSE and A level exams would be cancelled. The exams regulator, Ofqual, was instructed to put in place an alternative system to allow students to move on to further study or employment while ensuring that they would be neither advantaged nor disadvantaged compared to those taking exams in previous years.
In the absence of formal exams, the best available indicator of a student’s attainment is their teachers’ assessment of their progress. From Ofqual’s point of view, however, there were two problems with simply asking teachers to estimate the grades students would have been awarded. Teachers are only imperfectly able to judge what a child is capable of, and since in any given year a percentage of children will fail to achieve their potential, teachers’ assessments consistently overestimate performance. Treating these assessments as gospel would therefore create the kind of anomaly the secretary of state was determined to avoid: this year’s students would end up with better results, overall, than those in other years. The second difficulty was that Ofqual had to get the assessments in quickly: although they sent out clear guidance on how schools were to use evidence of past performance in generating ‘Centre Assessed Grades’, they anticipated that schools would approach the task in different ways, so some moderating of results would be required.
The solution that Ofqual came up with is described in Awarding GCSE, AS, A Level, Advanced Extension Awards and Extended Project Qualifications in Summer 2020: Interim Report. It’s long but, for the most part, accessible. Teachers were required not only to submit predicted grades for each student but to rank the students within each grade: who would get the best A, who the second best, and so on. Ofqual compared 11 possible algorithms for ascribing grades that would use this information together with historical data on each school’s performance, national data on how students generally progress in each subject, and data on the prior performance of this year’s cohort in national tests.
The algorithm they chose was fairly straightforward. For A level, national data on how students generally progress between GCSE and A level was used to calculate, from each school’s GCSE results, how its students would have fared in this year’s A levels if the school had followed the national pattern. The same formula, again based on GCSE results, was applied to students who had taken A levels at the school over the previous three years. The difference between these estimates was taken as a measure of how much better or worse this year’s students were than their predecessors, and used to adjust an average of the grades obtained by the school in each subject over the last three years, generating an expected distribution of grades for this year’s cohort.
The predicted grades and rankings submitted by each school were then revised to fit this distribution, moving the lowest ranked children down if a downward shift was required, and the highest ranked children up if an upward shift was needed. There were more complex tweaks to accommodate cases where prior data wasn’t available, and an additional step to ensure that the overall distribution of grades nationally would be in line with previous years, but it was a simple enough approach, not involving any arcane mathematics and clearly designed to be transparent. It wasn’t, however, obviously fair. The modelling of the way prior attainment influences final qualifications was done at a national level, causing schools that do better than the national average to lose out. Data from one school, published by the Fischer Family Trust, show that even when no students had failed in the last three years, it was still possible for the percentage expected to fail to be above zero, and that even a percentage that equated to less than one pupil would require someone to be downgraded to a fail.
The various algorithms that Ofqual considered were compared by testing how accurate they would have been if applied to students who took exams in 2019. It wasn’t possible to test the process completely because there were no teacher-predicted rankings for previous years. So Ofqual generated a ranking for 2019’s exam takers by placing them in order according to the marks they got in their actual exams. This made the test rather weak: given that the real and the estimated grades were ultimately based on the same marks, a pretty high level of agreement would be expected, certainly higher than would be achieved in practice, when the ranking would be based on teachers’ predictions. The report shows, however, that the accuracy of the algorithm selected to calculate A level results varied wildly between subjects even when actual marks were used to generate the rankings. At its best it was 68 per cent accurate; at its worst, 27 per cent.
The statisticians must have known how hard it would be to predict something as volatile as individual educational success, and it’s hard to believe that Ofqual didn’t look at these numbers with concern. At this point, a competent minister would have begun to consider alternatives or at least put in place a process for dealing with what would inevitably be a large number of appeals. But Williamson seems not to have registered that anything was wrong until after the results were released. Ofqual’s way of dealing with public opinion was to hire a PR firm run by former associates of Michael Gove and Dominic Cummings, in a contract awarded without a process of competitive tendering.
When the A level results were published on 13 August, 39 per cent of teacher-predicted grades had been revised downwards. The resulting furore – including a crowd of defiant teenagers in Westminster chanting ‘Fuck the algorithm!’ – led Williamson to perform a U-turn: on 17 August he announced that the adjusted results would be ignored and awards instead based on Centre Assessed Grades – returning, in other words, to teachers’ predictions. As a result, university admissions departments have been left to deal with a ‘shitshow at the fuck factory’, to quote one despairing tweet. Of the 160,000 students whose results were re-upgraded, 100,000 had already been accepted by their first-choice university, and 45,000 were still adrift. That left 15,000 students now eligible for places at universities that no longer had room for them. Fifteen thousand isn’t that big a number, given the size of the sector, but 90 per cent of these students were holding offers for the most selective courses and the most demanding universities. By 19 August, the admissions department of University College London, for example, had received more than 8000 inquiries from anxious students and their parents; at the time of writing, it is dealing with 1700 appeals.
The situation with medicine is especially problematic. At the start of the crisis, medical schools were funded to provide 7500 places in England, and 9500 in the UK as a whole. The current estimate is that 2000 extra students are now entitled to places, with capacity for only around 600 of them. The universities minister wrote to vice chancellors on 20 August promising funding for more medical school places but acknowledged that they could only be provided if the Department of Health, whose civil servants are now dealing not only with a pandemic but with the wholesale restructuring of the public health system, was able to guarantee clinical placements during their university courses for the extra students and training places once they graduate.
Algorithms of one sort or another have been used in university admissions for years. In 1979, St George’s Hospital Medical School began using a computer algorithm to screen applications to its undergraduate programme. Unlike Ofqual’s algorithm it was remarkably accurate, agreeing with the judgments of a human selection panel 90 to 95 per cent of the time. It achieved this accuracy partly by mimicking the panel’s prejudices. Candidates with ‘non-European-sounding’ names were marked down and as many as sixty applicants a year were excluded on that basis. The Commission for Racial Equality was alerted in 1986 and the algorithm was scrapped. An article in the BMJ called it ‘a blot on the profession’. The algorithm was nakedly racist, though it’s worth noting that while it was in use the proportion of non-European candidates accepted at St George’s was higher than at other medical schools.
Ofqual tested this year’s algorithm for ethnic and other biases and found that it was broadly speaking fair, in the sense that it accurately reflected the injustices inherent in the system. Black candidates did less well than white candidates, but the difference was comparable to previous years – that, it’s implied, is as it should be. The algorithm also seems to have been more reliable in better-performing schools, meaning that some children in less advantaged schools were unfairly penalised. The most widely reported bias followed from the judgment that any form of statistical adjustment would be methodologically unsound in classes with fewer than five students and only a slight adjustment should be allowed in classes with between five and fifteen students. Since smaller classes are much more common at private schools than at state schools and further education colleges, private schools’ results were much less likely to be downgraded.
There is now clearly an attempt to shift blame for the fiasco onto Ofqual and the algorithm. The chief regulator of Ofqual has resigned and the permanent secretary at the Department for Education has been removed. Ofqual is not an independent agency; it is a government department and acted on the instructions of the minister. The problems with the algorithm aren’t technical but a consequence of the political decisions made at the outset.