Kept Alive for Thirty Days
- The Tyranny of Metrics by Jerry Z. Muller
Princeton, 220 pp, £19.95, February 2018, ISBN 978 0 691 17495 2
- The Metric Tide by James Wilsdon et al
Sage, 168 pp, £19.99, February 2016, ISBN 978 1 4739 7306 0
How much does your spouse or partner love you? Is it more or less than other people love their partners? To find out, we would need to measure the available evidence. Suppose that having identified the relevant population, we count how many times a week partners are brought tea in bed, how many times they are cuddled, how many times they are told ‘I love you,’ and so on. In this way, we establish benchmarks against which individual performance can be calibrated, with the intention of driving up the quality of loving among failing partners. To further incentivise this group, we maintain lists of those who are now unattached but who exceeded the benchmark when they had a partner; the threat of competition from these alternative love-providers should stimulate out-performance. Studies show that three months after the introduction of the benchmarks, the average increase in frequency of tea in bed is 36 per cent.
The requisite record keeping can now easily be outsourced. An app called LovStat is used to record evidence of loving from your partner at all times, and the LovBand on your wrist displays real-time statistics indicating where you currently stand on the Global Lovedness Index. In order to identify the value added, and not just the gross incidence of expressions of loving, an algorithm automatically adds a premium for those from historically under-loved backgrounds. (Despite the government’s commitment to a level loving-field, studies show that last year 58 per cent of all loving was received by just 11 per cent of the population.)
At first, the metrics may be a little inexact, but they can be improved: for example, instead of simply recording the number of instances of ‘I love you,’ LovStatPlus will distinguish when it is said sincerely from when it has been prompted or said as part of an attempt to have sex. However, it transpires that people are gaming the system: one partner recorded exceptionally high scores by bringing the same cup of tea into the bedroom several times in the space of a few minutes. What’s more, critics say there is evidence to suggest that partners are increasingly neglecting any form of loving behaviour not recognised by the categories in the program, a pattern known as ‘loving to the test’. And as scores rise across the board, some are beginning to wonder whether this system of measurement really does allow meaningful discriminations to be made. Where loving is concerned, we seem to be reaching a situation in which, as so often with self-estimation, the entire population considers itself ‘above average’.
If this all seems fanciful, then consider the following example. In 2001 the US government introduced measures designed to improve educational outcomes in underperforming schools – popularly known as ‘No Child Left Behind’. As the American historian Jerry Muller reports in his hair-raising analysis of this and similar schemes: ‘Under NCLB, scores on standardised tests are the numerical metric by which success and failure are judged. And the stakes are high for teachers and principals, whose raises in salary and whose very jobs sometimes depend on this performance indicator.’ The measures had various unintended but predictable effects. For example, teachers diverted time and energy away from all those aspects of education that didn’t show up in the metrics (which were based on scores in maths and English), so that the education the children received declined even as the scores rose in metrics that were supposed to be a proxy for improved standards in education. But that wasn’t all:
High-stakes testing leads to other dysfunctions as well, such as creaming: studies of schools in Texas and in Florida showed that average achievement levels were increased by reclassifying weaker students as disabled, thus removing them from the assessment pool. Or out and out cheating, as teachers alter student answers, or toss out tests by students likely to be low scorers – phenomena well documented in Atlanta, Chicago, Cleveland, Dallas, Houston, Washington DC and other cities. Or mayors and governors moving the goalposts by diminishing the difficulty of tests or lowering the grades required to pass them, in order to raise the pass rate and thus demonstrate the success of their educational reforms.
As a way of improving school education, the scheme was not an unqualified success.
Here’s another example. In 2011 the American bank Wells Fargo introduced a scheme to encourage the ‘cross-selling’ of its products. ‘It set quotas for its employees,’ Muller writes, ‘to sign up customers who were interested in one of its products (say, a deposit account) for additional services, such as overdraft coverage or credit cards, which were more lucrative for the bank. Failure to reach the quota meant working additional hours without pay and the threat of termination.’ But the level at which the quotas were set was unrealistically high: most employees couldn’t hope to meet them by legitimate means, so ‘thousands of Wells Fargo bankers resorted to low-level fraud, creating PIN numbers to enrol customers in online accounts or debit cards, for example – without informing the customer.’ When the scandal broke in 2016, the bank fired some 5300 employees, but it was still fined $100 million by the federal Consumer Financial Protection Bureau, $50 million by the Los Angeles City Attorney and $35 million by the office of the Controller of the Currency. As a way of increasing the bank’s profits, the scheme was not an unqualified success.
And another. ‘Nowhere are metrics in greater vogue than in the field of medicine,’ according to Muller. Forms of performance-related pay can be particularly pernicious in this field. Hospitals and clinics tend to ‘treat to the test’. They concentrate on boosting their metrics by focusing on the categories of patient or illness measured by the metrics, neglecting other, possibly needier, cases. It is especially difficult to ‘provide reliable criteria of measurement for the treatment of many patients, such as the frail elderly, who suffer from multiple, chronic conditions’, so they are among the main casualties of the reliance on performance metrics. There are similar problems in surgery:
In New York State, for example, the report cards for surgeons report on post-operative mortality rates for coronary bypass surgery, that is, what percentage of the patients operated upon remain alive thirty days after the procedure. After the metrics were instituted, the mortality rates did indeed decline – which seems like a positive development. But only those patients operated on were included in the metric. The patients who the surgeons declined to operate on because they were more high-risk – and hence would bring down the surgeon’s score – were not included in the metrics.
In addition, patients whose operations had not been successful were ‘kept alive for the requisite thirty days to improve their hospital’s mortality data, a prolongation that is both costly and inhumane’.
Muller’s evidence is drawn from the UK as well as the US. When in the early 2000s the Department of Health introduced penalties for hospitals with A&E waiting times exceeding four hours, ‘some hospitals responded by keeping incoming patients in queues of ambulances, beyond the doors of the hospital’, only starting the clock when they were actually admitted. And, of course, so much of what affects medical statistics is beyond medicine’s reach: metrics for doctors and hospitals in less privileged neighbourhoods are invariably lower, but not because they provide a lower quality of care. ‘As in the case of schools punished for the poor performance of their students on standardised tests, by penalising the least successful hospitals, performance metrics may end up exacerbating inequalities in the distribution of resources.’ As ways of improving medical outcomes, these schemes were not unqualified successes.
Muller provides equally chilling examples from other areas. One way to improve crime statistics is to record fewer incidents as crimes in the first place. Another, memorably represented in the TV series The Wire, is to increase the number of arrests, even if those arrested are street-level small fry (who will instantly be replaced anyway) rather than the people actually orchestrating the crimes. When careers hang on the figures, then ‘juking the stats’ just becomes one of the things you learn on the job. (In Britain, according to Muller, ‘this process of directing police resources at easier-to-solve crimes in order to boost detection rates is known as “skewing”.’)
Overseas aid programmes supply plenty of examples of how the reliance on metrics exerts constant pressure to replace activities that contribute to sustainable long-term improvement with policies designed to produce measurable outcomes in the present. Muller quotes a former official with long experience in international development saying that those who work in this field ‘become infected with a very bad case of Obsessive Measurement Disorder, an intellectual dysfunction rooted in the notion that counting everything in government programmes will produce better policy choices and improved management’. As a USAID official acknowledged, ‘no one has come up with a valid way to quantify the effectiveness of capacity building activities … So instead of focusing on effectiveness in reporting, USAID focuses on what can be measured, such as the number of workshops held or the number of people who have participated in training.’ The illusion of efficiency created by such numerically precise but substantively empty record keeping is now so pervasive that we are in danger of thinking such practices both normal and desirable.
Muller provides a helpful checklist of the main failings involved in the inappropriate use of metrics. Misdescription of purpose is fundamental: in the attempt to find outcomes that are measurable, complex characterisations of purpose are replaced by quantifiable results. ‘Goal displacement’ is also a major problem: where a metric is used to judge performance, energy will be diverted to trying to improve the scores at the expense of the activities for which the metric is supposed to be a proxy. As well as the likelihood of diminishing utility in terms of real improvement, the sheer cost of the exercise, in time and resources, also has to be taken into account. But there are less obvious effects too, such as discouraging risk-taking, undervaluing co-operation and common purpose, and the degradation of the experience of work. Observing the stagnant productivity of economies where such measures are most prevalent, such as the US and the UK, Muller asks ‘to what extent the culture of metrics – with its costs in employee time, morale and initiative, and its promotion of short-termism – has itself contributed to economic stagnation?’
There is also the phenomenon that Muller calls ‘rule cascades’: ‘In an attempt to staunch the flow of faulty metrics through gaming, cheating and goal diversion, organisations institute a cascade of rules. Complying with them further slows down the institution’s functioning and diminishes its efficiency.’ The research assessment exercises to which UK universities are subject are a striking illustration of this. Once you embark on the attempt to produce a single number to represent the quality of the research of an entire university department, you end up having to elaborate a baroque set of requirements, prohibitions and clarifications to ensure the ‘fair’ working of the exercise. The guidelines setting out the criteria and procedures to be followed in the 2014 Research Excellence Framework ran to 789 numbered paragraphs, plus 23 pages of annexes.
Muller makes clear that he isn’t launching some wrong-headed crusade against measurement in general. He is well aware that quantification can be a legitimate and beneficial way of extending understanding:
There are things that can be measured. There are things that are worth measuring. But what can be measured is not always what is worth measuring; what gets measured may have no relationship to what we really want to know. The costs of measuring may be greater than the benefits. The things that get measured may draw effort away from the things we really care about. And measurement may provide us with distorted knowledge – knowledge that seems solid but is actually deceptive.
This credo may suggest something more wide-ranging than the book actually delivers. Muller’s real topic is what he calls the ‘metric fixation’, summarised as follows:
1) the belief that it is possible and desirable to replace judgment, acquired by personal experience and talent, with numerical indicators of comparative performance based upon standardised data (metrics); 2) the belief that making such metrics public (transparent) assures that institutions are actually carrying out their purposes (accountability); 3) the belief that the best way to motivate people within these organisations is by attaching rewards and penalties to their measured performance, rewards that are either monetary (pay-for-performance) or reputational (rankings). Metric fixation is the persistence of these beliefs despite their unintended negative consequences when they are put into practice.
To his credit, Muller isn’t interested only in documenting the ways in which the metric fixation produces unintended consequences. Beyond that, he wants, first, to work out what causes this high level of dysfunction, and second, to identify ways in which metrics might be used more productively. The effects of numerical targets on individual behaviour are at the heart of the problem. Metrics are frequently used in an attempt to replace intrinsic motivation with extrinsic motivation, that is, in trying to create a situation in which people are responding to a uniform system of rewards and penalties, usually financial, rather than being driven largely by the satisfactions that come with the exercise of a skill, the enjoyment of respect, the achieving of shared purpose and so on. The more a task is simple, repetitive and wholly quantifiable, and the more those who perform it are already motivated by extrinsic rather than intrinsic motivation, the more effective a system of financial rewards based on metrics can be. It may still underestimate or misrepresent the elements of skill and judgment involved in even the simplest tasks, but it may be effective in incentivising effort and identifying underperforming individuals or units.
Where a set of tasks is unique, complex, and requires a high level of intrinsic motivation for its successful accomplishment, then metrics will only be beneficial if they strengthen the existing professional judgment of those involved. The more that metrics are experienced as irrelevant, misleading or harmful, the more they will be subject to gaming and other perversions. As an example of the beneficial use of metrics, Muller cites a project in the US that set out to provide comparative data on the rate of infections introduced by ‘central lines’, the catheter tubes inserted into a large vein through the neck or chest as a conduit for medicines, nutrients and fluids. The introduction of a checklist of five procedures to be followed when using such equipment was shown drastically to reduce the rates of infection. Publicising the results enabled other hospitals to refine their procedures and improve their statistics. The project succeeded, Muller observes, because it worked with the grain of doctors’ professional ethos and desire for peer approval rather than threatening them with punitive consequences if they failed to meet an externally imposed target. Muller further argues that attempts to set performance targets are more likely to misfire when the consequences are ‘high stakes’. If an institution stands to lose a large part of its funding, or individuals their jobs, when demanding targets aren’t met by legitimate means, the system itself starts to provide incentives for corrupt behaviour: better to fiddle the figures, with only a moderate chance of being caught, than to let the unfiddled data destroy one’s livelihood.
Muller’s book is a brief introduction to the topic for the general reader and so it does not engage with the extensive technical literature on the subject that has grown up in recent decades. It is fair to say that many of those who do spend their lives analysing the operation of metrics would be likely to give a much more positive account, and such specialists may feel that Muller’s examples are used to grab the reader’s attention rather than explore how undesirable consequences might be avoided by adopting more careful procedures in the first place. Beyond this, there is a more fundamental way in which Muller’s focus on the eye-catching absurdities may limit the value of his book. Metrics are a means: the important questions concern the ends for which they are used, and this comes down to some version of Lenin’s famous question: ‘Who, whom?’ In other words, the use of metrics is very often not the result of a neutral or benign impulse to see how ingeniously we can replace the messiness of existence with the apparent clarity of numbers, but part of more systematic attempts by one group of people to control the behaviour of others.
‘Accountability’ is the fig-leaf that covers up this systematic bullying. If I can say that my taxes help to pay your salary, then, according to contemporary wisdom, I can claim the right to monitor how well you are doing your job. Of course, it is in practice never that simple, because – as an individual taxpayer – I may be in no position to understand what your possibly very complex job involves, and anyway I don’t have the time or resources to monitor it directly. So an elaborate ecology of intermediaries appears, to ensure ‘accountability’ in specific areas on behalf of the taxpayer. The result is that whole populations of people doing difficult and socially useful tasks can find themselves at the beck and call of a smaller group whose expertise consists chiefly in setting measurable targets and enforcing a system of penalties and rewards: managers tell doctors what they should be doing, managers tell teachers what they should be doing, managers tell academics what they should be doing, and so on. In each case, metrics are the blunt instruments used by managers who seek to control activities they do not fully understand.
Although a lot of guff is generated in both private and public organisations about using performance indicators to ‘promote excellence’, the immediate practical intention is principally to catch slackers – or, in approved management-speak, to identify ‘underperforming’ individuals. ‘Underperforming’ is a revealing term in itself, since it implies there is a bar or norm below which a person or unit falls when measured by ‘results’. Someone can, in objective terms, be doing their job well enough, but the use of comparative data shows them to be ‘underperforming’, although presumably once they are dismissed the person or unit with the next lowest scores will come to be seen as underperforming in their turn. As a result, the effect of the punitive use of performance indicators can be to take an organisation which may have contained a few members who were coasting, but fulfilled its purposes pretty well on the whole, and replace it with an organisation in which everyone is subject to intrusive surveillance and, as a consequence, possibly less productive in ways that really matter, but in which no one is ‘getting away’ with anything. Persisting in the gathering and publication of performance data even when this is demonstrably not improving the activity being measured is, as Muller drily notes, a ‘form of virtue signalling’, an announcement that no one is being allowed an easy ride.
The increased reliance on performance metrics is part of the wider spread of the business school ideal of the manager equipped with a set of procedural, analytical and motivational skills that are transferable across all types of organisation and require no first-hand familiarity with the defining activities of a given organisation. The assumption is that the right structure of incentives and penalties will ultimately improve the bottom line of any business. Organisations whose rationale is not the maximisation of profit, such as schools, hospitals, universities, museums and so on are a challenge to this idea because their ‘product’ does not take financial form. So some equivalent has to be found – the numbers passing certain exams or being treated within certain times – on the basis of which quantitative targets can be set and performance rewarded or punished accordingly.
That’s the unsympathetic description, but what would a more positive account look like? There are now, as I mentioned, specialists who use highly sophisticated techniques to improve the operation of metrics in specific fields. In 2015 the Department for Business and the Higher Education Funding Council jointly commissioned an ‘independent review of the role of metrics in research assessment and management’ from a team chaired by James Wilsdon, professor of research policy at the University of Sheffield. Keeping tightly to its brief, Wilsdon et al’s report, published as The Metric Tide, is an extremely thorough and careful survey of the large literature on the topic. Where Muller’s book is enjoyably readable and broad-brush, The Metric Tide is austerely technical and committee-cautious. It offers a balanced assessment, emphasising the benefits of ‘responsible metrics’ while being alert to the damage that can be done by misconceived or misapplied schemes (it seems the report may have helped to nudge government policy away from some of the more mechanical forms of research assessment). Yet even its cool, passionless prose can occasionally prompt an uneasy feeling.
Take, for example, its discussion of that central statistical term, ‘an indicator’. This is defined as ‘a measurable quantity that “stands in” or substitutes for something less readily measurable and is presumed to associate with it without directly measuring it. For example, citation counts could be used as indicators for the scientific impact of journal articles even though scientific impacts can occur in ways that do not generate citations.’ This brings out the fact that the indicator, unlike whatever it is that we really want to know about, possesses the key property of being ‘measurable’. But as the cited illustration suggests, if the indicator does not exactly and comprehensively match what it is standing in for (which, by its nature, it cannot do), then there is a risk that by focusing on the indicator we shall misdescribe the thing itself. We can easily come to think that what ‘scientific impact’ means is ‘citation counts’, to treat ways in which impact might not show up in citation counts as secondary or irrelevant, and to believe we have an exact and sufficient measure of scientific impact when we don’t.
Advocates of metrics are prone to respond by saying: ‘OK, it’s not exact, but at least it gives us a rough indication.’ But does it? Citation counts give us an indication of the number of times a piece of work has been ‘cited’ (to be defined) in a given body of ‘scientific literature’ (to be defined) when certain boundary conditions are in place. The figure may be an exact quantitative representation of that incidence and yet not be even a ‘rough’ indication of anything else. It may still have an intellectual or nerdish interest of its own, but when the number is used – as in a system of rewards and penalties – it is not because it is an exact indication of a carefully limited finding, but because we have all allowed ourselves to slip into taking for granted that it’s a rough indication of something important, which it may not be.
More disturbing still is the following passage: ‘Evidence of the performativity of quantitative data – their capacity to influence the activities they are supposed merely to indicate – suggests that the availability of metrics creates a demand for such information. Such information-generating functions carry authority even if their limitations are known.’ Despite its matter-of-fact tone, that last sentence ought to set all kinds of alarm bells ringing. It seems to say that so insatiable is the appetite for this intellectual equivalent of fast food that we go on eating it even when we know it’s off. One of the most striking examples of the continuing authority of acknowledgedly flawed metrics is provided by global rankings of universities. One might have expected Wilsdon and his colleagues to be fairly receptive to these rankings, which claim to be a distillation of a large number of more specific metrics, but, to the team’s credit, their scientific consciences compel them to object:
Close inspection reveals varying degrees of arbitrariness in the weighting of different components in different league tables. The aggregate scores suffer from the same problems of all composite indicators in that their meaning or value is not clear. Also, no effort is made to estimate errors and, with rare exceptions, there is no clear acknowledgment that they might exist. Ranking in fact magnifies differences beyond statistical significance. Rankings assume degrees of objectivity, authority and precision that are not yet possible to achieve in practice, and to date have not been properly justified by vendors.
Clearly, any adequate attempt to explain the collective irrationality of the indulgence of rankings would have to make allowance for human weakness. League tables are irresistible and addictive: junkies are unmoved by the most devastating methodological critiques, and with pushers on every street corner, there is little prospect of getting addicts to kick the habit. ‘OK, they’re flawed, but they do tell us something worth knowing, don’t they?’ Do they? For the most part, they give misleadingly tabular, pseudo-statistical form to a series of incommensurate indicators that themselves do not adequately represent the realities they are taken to stand for. Yet has a single university been willing to repudiate the whole farrago rather than trying to put the most positive spin it can on the figures? Should you ever find yourself gulled into taking such claims seriously, I suggest you try totting up how many universities are in the UK’s ‘top twenty’.
At a deeper level still, the whole audit culture rests on a superstition about numbers. As soon as numbers come into play, we are all liable to fall into what Oscar Wilde called ‘careless habits of accuracy’. A number holds out the promise of definiteness, exactness and objectivity. But a number is a signifier like any other, a way of representing something. We appeal to numbers as a way of replacing imprecise, subjective human judgment with precise, objective measurement, but in fact we are just swapping one language system for another. (And in fact not a whole ‘language’, just a limited vocabulary: almost any use of numbers, outside certain areas of mathematics and science, will be embedded in words that specify what the numbers are supposed to stand for.) The existence of any statistic is the outcome of a process of human judgment. The digital revolution has brought with it a huge increase in quantifiable information, the very existence of which provides a constant temptation to metric misbehaviour. If there are numbers to be had, we come to feel that we must have them, even though they may mislead us into thinking we have solid information about something important when in reality all we have is the precise and selective misrepresentation of something insignificant. Muller again has some wise words: ‘Measurement is not an alternative to judgment; measurement demands judgment: judgment about whether to measure, what to measure, how to evaluate the significance of what’s been measured, whether rewards and penalties will be attached to the results, and to whom to make the measurements available.’ Some people speak numbers better than others and, as always, knowledge is power.
It is commonly observed that the rise of metrics is an expression of, and a response to, a decline in trust. That there has been a marked decline in some of the traditional forms of social trust over the past couple of generations is undeniable, but the dominant contemporary version of ‘accountability’ rests on something more deep-seated and visceral than even that. It has become the vehicle for a suspicion of, and hostility towards, professions on the part of those who suspect that members of the professions have a nicer life than they do. It may be nicer because there are thought to be more intrinsic rewards to the work such professionals do; it may be because they have more autonomy in their working lives; it may be because they were historically accorded more respect; and it may be that their salaries and benefits were greater or at least more secure. But just as Nietzsche analysed the imposition of morality on the strong by the weak as an expression of ressentiment or a form of revenge, so metrics – the moral code of a sourly reductive managerial culture – are the means to make sure that professionals’ working conditions should more and more correspond to the alienated, insecure, hollowed-out working conditions of so many other members of society.
There was a time when the authorities had to deploy squadrons of mounted dragoons to quell the unruly mob. Now they just set them quarterly sales targets. Spreadsheet capitalism is much more effective than old-style ruling class repression, not least because it pulls off the conjuring trick of seeming to give priority to individual agency while in fact subordinating everyone to supposedly impersonal market forces. At bottom, performance metrics operate through a culture of fear, but one in which the arbitrary whim of a lord or master has been replaced with the terrifying implacability of a row of figures. ‘I’m sorry, John, your numbers aren’t good enough, we’ll have to let you go.’ The metric fixation is an attempt to extend that mechanism to activities that cannot be reduced to the equivalent of sales figures. No matter that you may have been a deeply loving partner in all the ways that are humanly important, that’s not ‘what the figures show’: ‘I’m sorry, Jane, your numbers aren’t good enough, I’ll have to let you go.’
Still, it is hard to see what might lead to the overthrow of the ‘tyranny of metrics’. The dominant managerial culture is not about to surrender its most cherished weapons, and somehow I don’t see the great uncounted storming the Bastille of performance data to demand the release of human activities imprisoned in graphs and tables. Critique may do something to weaken the superstitious faith in the omnipotence of numbers, and we may even come to accept that not everything that matters to us can be measured. But the desperate yearning for unobtainable forms of precision and objectivity may be too strong even for those modest triumphs. ‘How do I love thee? Let me count the ways.’ That’s how it starts.