Calibration: What is it Good For?

Calibration meetings are increasingly part of performance management (PM) and reward processes in many organizations. In these meetings (usually at the end of the year), managers discuss their employees’ accomplishments with the purpose of agreeing on final performance ratings for their employees. Similar meetings are part of other HR decision processes as well, including hiring, talent identification, and succession planning decisions. While the specific procedures and techniques vary across organizations, the stated goal in a PM context is to increase the accuracy of performance ratings by discussing and agreeing on the criteria and standards used for making final performance ratings. This happens real-time by having each manager publicly discuss and debate the accomplishments and performance of each of their employees. I have been a part of dozens of these sessions and I have designed and facilitated them. I am generally critical of their use in PM for several reasons.

The logic is suspect

The basic logic of these meetings seems flawed at the outset. We know from decades of research that human beings are flawed; they have all manner of biases, idiosyncrasies and dubious motives that can interfere with the quality of the judgments they make.[1] Human beings have many talents, but observing, recording, and analyzing their own experiences, let alone others’ experiences isn’t among them.[2] This explains why the quality of performance ratings is so low.[3] So, the logic of getting more flawed people involved in the rating process makes little sense--if the goal is accuracy. Calibration sessions, by design, involve people in the rating process who may not observe your employees perform during the year. And these meetings frequently involve more senior leaders (for reasons I discuss below). These leaders may have little or no exposure to a given employee during the year and yet may play an outsized role in their final rating. It is hard to imagine how this practice could improve the quality of final ratings.

Calibration meetings don’t necessarily improve rating accuracy

Surprisingly, there isn’t much research on this topic. The limited research available suggests they don’t improve accuracy much, if at all, and may even make things worse. Andrew Speer, Andrew Tenbrink, and Michael Schwendeman recently summarized the current state of calibration nicely and conducted one of the few studies of calibration in a PM context.[4] They compared the correlations of PM ratings with objective measures of performance (for call center employees) using ratings made before calibration meetings and ratings made again after calibration meetings. They found the correlations were higher using post-calibration ratings. This suggests that post-calibration ratings more closely mirror real performance measures for these employees and must be more accurate. But the correlations were small (larger with the composite objective performance measure) and the increases were relatively small (with 3 of the 5 statistically significant).

This was a helpful article but there are some other possibilities I would want to rule out before rushing to implement calibration meetings. It is possible these increases could have been due to more variability in post calibration ratings (this data wasn’t reported), which frequently happens when there is more public accountability for ratings. It is also possible that calibration meetings included discussion of objective performance data for employees which can make this information more salient to supervisors when making their post-calibration ratings. I also wonder if increases in correlations between ratings and objective performance measures is truly an indicator of increased accuracy of ratings. Subjective measures of performance are typically not highly correlated with objective measures for many reasons.[5] So, the best-case scenario may be that calibration meetings improve the quality of performance ratings slightly, perhaps from very poor to poor—hardly a strong endorsement.

The only other empirical research I’ve seen in this area is a study by Jerry Palmer and James Loveland who showed calibration processes can actually make things worse. In an employment interview context, they found group discussion combined with raters’ preexisting impressions made ratings more extreme and less accurate.[6]

We clearly need more research on this increasingly common practice in organizations. But, after reading this research and reflecting on my own experience, I can’t feel confident that calibration sessions increase the quality of performance ratings enough to justify the effort.

Calibration meetings have little to do with improving accuracy

My own experience with this practice in organizations is they have less to do with rating quality and more to do with rating quotas. An important goal of these meetings is ensuring compliance with rating distribution guidelines. It isn’t a coincidence these meetings emerged around the same time that forced distributions were becoming popular in the mid-1990’s. Higher-level managers and HR professionals are frequently present in these meetings precisely for this reason. Again, conventional wisdom holds that, left to their own devices, supervisors tend to give too many high ratings and too few low ratings (referred to as leniency bias), so senior leaders and HR professionals are there to “police” the distribution. There is also simply too much at stake to leave supervisors to their own devices, large reward budgets are based on these ratings and these budgets are fixed and can’t be overspent.

In fact, rating distribution versus the target and reward spend versus budget are the most important metrics reviewed by senior leaders in evaluating the effectiveness of the year end PM and reward processes. There is no question calibration meetings are very effective at achieving this objective. In all my years as a supervisor, I never gave out more than my quota of top ratings and I never exceeded my merit or bonus budget. It was always clear to me in these meetings when I needed to change my rating and when I’d been over-ruled by someone above me. The last time I studied this topic, 80% of supervisors reported having ratings for their employees overturned in calibration and review meetings. It isn’t exactly clear what this experience does to supervisors, but it certainly isn’t likely to make them work harder to make their ratings more accurate.

If your goal is to drive compliance with rating targets, calibration meetings will certainly help. If your goal is to motivate and facilitate the performance of your employees, I would steer clear of them. Anyone who has tried to explain to employees why there is a limit on top ratings or why people who know little or nothing about their performance are involved in determining their rating, or who has had to “sell” a lower rating to an employee than they deserved will understand my reservation.

Calibration meetings are political exercises

We also can’t forget the politics present in these meetings. Supervisors have many agendas to work in these meetings. In many respects, supervisors play the role of advocate, trying to get the highest rating for their employees. Ratings have real consequences; they are tied to raises, bonuses, and other down-stream decisions. Alternatively, supervisors may want to punish employees who have slighted them or others during the year, giving them a lower rating than they may deserve. Supervisors also develop alliances with other supervisors to help support their decisions. This means employee ratings and rewards depend on the political skill of the supervisor and accuracy isn’t necessarily high on the list of supervisor motives in these meetings.

Calibration is part of an administrative exercise

My chief complaint with calibration meetings has more to do with the larger system of which they are a part. This is why tinkering with them may not help much in the long run. Over the past 50 years, PM (including practices like calibration meetings) has become an administrative exercise to efficiently distribute rewards budgets. [7] Conventional wisdom holds that rewards must be differentiated, which means the ratings that determine those rewards must be differentiated. And since rewards budgets are fixed, organizations must grade on a curve--not everyone can get top ratings. An algorithm linking rewards to ratings makes the process efficient and mitigates the risk of overspending these budgets, as does calibration. At the end of the process, what counts is hitting rating and budget targets.

Calibration meetings are part of a larger process that is flawed at the core. The goal for PM and rewards processes shouldn’t just be efficiency, it should be motivation and performance. There is little evidence calibration meetings and other traditional PM and reward practices like feedback, ratings, differentiation, reliance of financial rewards, and pay for performance drive motivation and performance. And there is little evidence the principles (“conventional wisdom”) on which they’re based are relevant to today’s work, organizations, and workers. In fact, the evidence shows quite the opposite.

If you want to know more, go deeper, or read about more effective alternatives to traditional PM practices, check out more articles on this page or on my LinkedIn page. You can also check out my book “Next Generation Performance Management: The Triumph of Science Over Myth and Superstition.”

References and End Notes

[1] For a nice overview of these biases, see: Kahneman, D. (2012). Thinking, fast and slow. New York: Farrar, Strauss and Giroux.

[2] Colquitt, A. L. (2017). Next generation performance management: The triumph of science over myth and superstition. Charlotte: Information Age Publishing, 2017.

[3] Murphy, K. (2020). Performance evaluation will not die, but it should. Human Resource Management Journal, 30, 13-31.

Adler, S., Campion, M., Colquitt, A., Grubb, A., Murphy, K. R., Ollander-Krane, R., & Pulakos, E. D. (2016). Getting rid of performance ratings: Genius or folly. Industrial and Organizational Psychology: Perspectives on Science and Practice, 9, 219–252.

[4] Speer, A. B., Tenbrink, A. P., & Schwendeman, M. G. (2019). Let’s talk it out: The effects of calibration meetings on performance ratings. Human Performance, in press.

[5] Heneman, R. L. (1986). The relationship between supervisory ratings and results-oriented measures of performance: A meta-analysis. Personnel Psychology, 39(4), 811–826.

Bommer, W. H., Jonathan, J. L., Rich, G. A, Podsakoff, P. A., & MacKenzie, S. B. (1995). On the interchangeability of objective and subjective measures of employee performance: A meta-analysis. Personnel Psychology 48, 587-605.

[6] Palmer, J. K., & Loveland, J. M. (2008). The influence of group discussion on performance judgments: Rating accuracy, contrast effects and halo. The Journal of Psychology, 142, 117-130.

[7] In a survey of HR and Compensation employees, reward distribution was the number one purpose of PM. See: WorldatWork (2017). Performance management and rewards. WorldatWork research report. This change in the purpose of PM seems to have been triggered by the rise of pay for performance in the 1980’s. Since objective performance measures weren’t available for many jobs, performance ratings provided the performance measure used by these programs. As this occurred, PM became indelibly tied to reward distribution, and we haven’t seriously examined the sensibility of this linkage since then. It is now institutionalized in organizations—it is simply a given.