Measurement: A Judgment Call?

A review of Kahneman, D., Sibony, O., & Sunstein, C. R. (2021). Noise A Flaw in Human Judgment. NY: Little Brown. 454 pp. $32.00

Krishna Kumar
V. Krishna Kumar, PhD

Many modern societies rely on test scores and personal judgments (based on submitted documents, observations, interviews, portfolios, hearings) for screening candidates, court sentencing, custody evaluations, making a diagnosis, etc. Often, life-altering decisions (e.g., admission to a college) are handed down based on such judgments. The question, of course, is how good are such judgments? Kahneman et al. (2021) note that “most judgments are made in the state of objective ignorance, because many things on which the future depends can simply be not known” (italics in original, p. 109). 

The book Noise A Flaw in Human Judgment with 6 parts and 28 chapters focuses on how the human mind as the instrument of judgment is flawed and how we can improve the process of judgment, both evaluative and predictive. The central reason behind faulty human judgment is “noise” as the title of the book suggests. Our mind, “a [noisy] measuring instrument” (p. 39), makes noisy judgments. 

Noise, a random pattern of missing a target (random error), is distinguished from bias, a systematic deviation from a target. Psychometrics textbooks routinely describe the Classical Test Theory in terms of Observed Score X = True score + Error Score, where the true score is estimated by computing the mean, and error is estimated by computing deviation scores from the mean. Kahneman et al. go beyond the Classical Test Score Theory by discussing error in a single measurement to consist of Bias (average error) + Noise (a residual error that averages to zero). They differentiate error in a single measurement (i.e., Bias + Noise) from overall error, which is defined as Mean Squared Error, a concept based on the “method of least squares, invented by Carl Frederick Gauss in 1795” when he was 22 years old (p. 59)! MSE is calculated as the average of the squared individual errors of measurement (p. 59); thus, “Overall MSE = Bias2 + Noise2” (p. 62).

Kahneman et al. nicely illustrate the difference between bias and noise by using the example of target shooting. If all shots by a team hit the bull’s eye or are closely packed around the bull’s eye, there is neither bias nor noise; if all shots are clustered systematically on the left or right side, there is only bias; if some shots are systematically off-target and others scattered, there is both bias and noise; and, if the shots are scattered in a random pattern, there is only noise. But in judgments of human performance, we do not have a clear target. When we ask supervisors to rate their employees on their work performance, we might use a variety of items such as communication ability, job engagement, etc. all of which are difficult to define, and, consequently, such judgments are likely to vary highly across raters. Raters not only have different standards for evaluating performance, but they also differ in their interpretation of anchor points used on a typical rating scale. 

Kahneman et al. point out that bias can lead to serious problems of discrimination in personnel decisions, college admissions, granting of asylums, custody evaluations, and sentencing. Such biases can be unconscious and difficult to identify. However, bias may be easier to reduce than noise if institutions become aware that such biases exist and take measures to train decision-makers to raise their awareness, perhaps something easier said than done. 

Kahneman et al. well illustrate how to apply the concept of noise to a wide range of situations in different fields where decisions (evaluative or predictive) are made (e.g., hiring, medical and psychiatric diagnoses, estimating repair costs, wine tasting, restaurant ratings, forecasting, sentencing, granting asylums, custody, and patents). They state, “Wherever there is judgment, there is noise—and more of it than you think” (italics in original, p. 12). Noise that arises from variability in judgments is “unwanted,” but they make it clear that not all variability is unwanted (p. 27). They emphasize that it is erroneous to think that random errors cancel out in the long run, and, consequently, do not matter, because “In noisy systems, errors do not cancel out. They add up.” (p. 29).

“Noise is variability in judgments that should be identical” (p. 363)—system noise occurs when an organization uses “interchangeable professionals,” or “respect-experts”  (e.g., claims adjusters, physicians), and gets variability in their judgments. A car repair estimate by one professional should be identical to another professional’s estimate but they often tend not to be, as we all know from our experience. Two physicians, given the same symptoms, may differ in their diagnosis and/or treatment. System noise has two components: level noise and pattern noise. Level noise occurs when a judge is lenient or harsh by disposition in rating candidates (“variability in the average level of judgments by different judges” (p. 78). Pattern noise, typically the larger of the two, occurs when a judge is idiosyncratically harsh or lenient towards a candidate (judge by candidate interaction, p. 76, i.e., variability in a judge’s responses to a candidate). Kahneman et al. also discuss another, but a less frequently measured component of pattern noise, which they refer to as occasion noise or within judge variation over repeated occasions of measurement. 

Per Kahneman et al., an effort to reduce noise in measurement within a system requires a noise audit, a systematic investigation of both bias and noise. Bias may be detected in hindsight as to why a decision went wrong. But noise needs to be evaluated statistically. They recommend a variety of decision hygiene strategies for reducing noise and bias, which include getting the best (trained) judges, setting standards, structuring judgments, getting independent judgments, resisting premature intuitions (i.e., avoiding decisions based on irrelevant information and first impressions), aggregating judgments, using relative scales as opposed to absolute scales, and having observers to detect bias. The goal of ratings is accuracy, not individual expression which is a source of noise. Judgment accuracy requires agreement, not disagreement. A radical solution is to use algorithms, but they ask would humans ever completely come to trust algorithms despite evidence they “can outperform” human judgment (p. 336)? Don’t we love our committees! Of course, as Kahneman et al. point out an algorithm can be programmed or trained to be biased against certain groups. Professional readers will greatly benefit from the systematic discussion of bias and noise reduction strategies in the book (a Bias Observation Checklist and suggestions for correcting predictions are included in the Appendix.)

Noise A Flaw in Human Judgment is an insightful book about an old problem and its influence is bound to be far-reaching. Although the problem of measurement error has been well known, Kahneman et al. take us beyond what is commonly found in psychometrics textbooks by explicating the more general notion of noise that includes both bias and error, and by defining error in terms of level, pattern, and occasion errors. Although the book is written for a diverse audience, some may find it highly technical in places and possibly difficult to follow. Professionals engaged in the business of measurement in such diverse fields as business, court sentencing, human resources, and medicine will find the book very useful for improving the process of judgments, albeit it may be quite a challenge to do so.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.