QA Calibration: Why Your Best Scorer and Your Worst Scorer Heard the Same Call

We ran a calibration exercise with afinancial services contact center last year. Standard format: five QA analysts,ten randomly selected calls, blind scoring against the existing 28-criterionscorecard. The exercise was meant to be a quick check before a larger QAprocess refresh.

The result was uncomfortable. On the 10calls, the gap between the highest scorer and the lowest scorer averaged 23points out of 100. On three of the calls, the gap exceeded 30 points. Twoanalysts scored the same call as a 91 and a 58, working from identicalcriteria, listening to the same audio, in the same room.

Leadership’s first reaction was toquestion the methodology. Once we ran the test a second time and got similarresults, the question shifted to something harder. If the QA team can’t agreeon what a good call sounds like, what exactly is the QA program measuring?

The answer, uncomfortably, was opinion.The scorecard had been honed over years. The criteria were specific. Thetraining was extensive. And the inter-rater reliability was sitting at roughly67%, well below the 90%+ threshold that calibration researchers generallyconsider necessary for a measurement program to be operationally meaningful.

This isn’t an unusual situation. McKinseyresearch has put manual QA inter-rater reliability across the industry atroughly 70-80%, which sounds reasonable until you do the math on what 25%disagreement means for the agents being evaluated, the coaching decisions beingmade, and the compliance reports being filed based on those scores.

What CalibrationActually Is

Calibration in QA refers to the process of ensuring that differentevaluators score the same interaction in the same way against the samecriteria. The term covers several related practices:

Inter-rater calibration. Multiple QAanalysts score identical interactions and discuss any scoring disagreementsuntil a shared interpretation of each criterion emerges. The goal is to reducevariance in how the criteria are applied.

Intra-rater calibration. A singleanalyst scores their own work over time consistently. The goal is to preventscore drift as the analyst’s understanding of the criteria evolves, hardens, orrelaxes.

Calibration with stakeholders. QAanalysts calibrate not just with each other but with operations leaders,training leads, and (in regulated industries) compliance officers. The goal isto ensure the QA program is measuring what the business actually cares about,not what the QA team thinks is important.

Calibration with the customer. This oneis rarely formalized but it’s the most important. QA scores correlate withcustomer satisfaction at only roughly 0.3-0.5 in most contact centers,according to industry benchmarking from organizations like COPC and SQM Group.A QA program that scores agents highly on calls customers rated poorly has acalibration problem with reality, not just with other analysts.

When calibration is missing or weak, every downstream artifact ofthe QA program is compromised: coaching priorities, performance managementdecisions, compliance reporting, and the trust agents place in the scores theyreceive.

WhyManual Calibration Fails Structurally

The traditional model — periodic calibration sessions where QAanalysts gather to discuss disagreements on a small sample of calls — hasstructural limits.

Sample size. A monthly calibrationsession reviewing 5-10 calls cannot meaningfully address scoring patternsacross the thousands of calls scored by the team in that month. Whateverconsensus emerges from the calibration session applies cleanly only to thediscussed cases.

Edge case bias. Calibration sessionsnaturally focus on disagreements, which tend to cluster in ambiguous edgecases. The clearly-scoreable middle of the distribution — which represents 80%+of actual calls — gets less attention because nobody disagrees about it.

Decay over time. Inter-rater reliabilityachieved at a calibration session decays measurably within 4-6 weeks asanalysts return to scoring solo. Without continuous reinforcement, the teamdrifts back toward individual interpretations.

Hawthorne effect on calibration itself.When analysts know they’re being calibrated, they apply criteria morecarefully. The reliability measured in the session is systematically higherthan the reliability of day-to-day scoring. This isn’t dishonesty. It’s normalhuman behavior under observation.

The combined effect is that calibration programs typicallydemonstrate higher reliability than the QA operation actually achieves inroutine scoring. Leaders make decisions believing they have an 85%-reliablemeasurement system when the actual figure is closer to 70%.

The AI Calibration Shift

AI-powered QAscoring changes the calibration problem in three structural ways, each ofwhich matters.

Consistency by design. An AI scoringengine applies the same criteria the same way to every call. It doesn’t have abad Monday. It doesn’t develop fatigue at hour seven of the shift. It doesn’thave an unconscious bias about which agents are usually strong performers. Onceit’s trained correctly, its inter-rater reliability with itself is functionally100%. Two passes of the same engine on the same call produce the same score.

Bias becomes traceable. When AI scoringproduces a result the human team disagrees with, the disagreement becomesinvestigable. You can trace which criterion produced the disputed score, whichaudio segment triggered it, and which training examples the model used asreference. Manual scoring disagreements often have no traceable source —analyst A and analyst B simply heard the call differently.

Calibration moves to the model, not the people. Instead of recurring calibration sessions for the QA team, thecalibration question becomes: is the model scoring calls the way the businesswants them scored? This is a different question, with different operationalimplications. It moves the calibration workload from the QA analysts to thedata team, and it makes calibration a continuous process rather than a periodicone.

This isn’t an argument that AI scoring eliminates the need for humanjudgment. It doesn’t. Human QA analysts remain essential for edge cases, forcompliance review, and for coaching context that requires human reading ofhuman situations. The shift is in what the human analysts spend their time on —less on scoring routine calls, more on the work that genuinely requires humanjudgment.

The ComplianceCalibration Problem

Calibration becomes a compliance issue specifically in regulatedindustries where QA scores feed into supervisory reports or internal controldocumentation.

If your QA program scores a sample of calls and reports an aggregatecompliance rate to senior management or to a regulator, the reliability of thataggregate depends on the reliability of the underlying scoring. A 70%inter-rater reliability means your compliance rate has a confidence intervalwide enough that two equally-defensible calculations could reach differentconclusions about whether your control framework is operating effectively.

This matters more under regimes like Consumer Duty in the UK, wherefirms are required to demonstrate evidence-based assessment of customeroutcomes, and under SEC/FINRA scrutiny in the US, where supervisory frameworksfor communications increasingly require documented quality monitoringmethodology.

A QA program with documented high reliability is defensible. A QAprogram with documented low reliability — or worse, with undocumentedreliability — produces an audit finding waiting to happen.

What GoodCalibration Looks Like in 2026

Modern quality managementprograms that take calibration seriously typically share four characteristics.

Continuous reliability measurement.Inter-rater reliability is measured on an ongoing basis, not just in formalcalibration sessions. A sample of calls each week is scored by multipleanalysts (or by a human and an AI engine) and the reliability metric is trackedover time.

AI-human consensus scoring. High-stakescalls (compliance flags, escalations, complaints) are scored by both AI andhuman analysts. Disagreements trigger a review process. Agreement provideshigher confidence than either source alone.

Documented calibration methodology. Theexact process by which the QA program ensures scoring reliability isdocumented, version-controlled, and producible on demand. This becomes part ofthe compliance evidence package.

Calibration to customer outcome. Theultimate calibration question — do high QA scores correlate with positivecustomer outcomes — is tracked explicitly, with quarterly review. If thecorrelation is weak, the scorecard is wrong, not the agents.

Five Things YouCan Do This Week

1. Run a blind calibration test. Pick 10calls. Have your top three QA analysts score them independently. Calculate thevariance. If the average disagreement exceeds 15 points out of 100, you have ameasurement problem.

2. Audit your scorecard for ambiguous criteria. For each criterion, ask: “Could two reasonable people listen to thesame call and reach different conclusions about this criterion?” Any criterionwhere the answer is yes is a calibration risk.

3. Cross-reference QA scores against CSAT. Pull last month’s data. For each agent, plot their QA score againstthe CSAT scores their calls received. If the correlation is below 0.5, your QAprogram is measuring something other than customer-facing quality.

4. Document your current calibration process. Write down exactly how reliability is currently assured. If thedocument is short or full of “we plan to” language, you have a documentationgap that becomes a compliance issue under audit.

5. Pilot AI-assisted calibration on a subset of calls. Even a small pilot, scoring 200 calls per month with both AI andhuman analysts, will give you data on where the disagreements concentrate andwhat they tell you about your scorecard design.

Calibration is the invisible foundation of every other claim a QAprogram makes. Without it, the scores are opinions, the coaching is guesswork,and the compliance reporting is a confidence trick the leadership team isplaying on itself. The two analysts who heard the same call and disagreed by 30points aren’t an aberration. They’re the system working exactly as it wasdesigned, in the absence of the calibration discipline that would have made thedisagreement visible enough to fix.

Client
Burnice Ondricka

The AI terminology chaos is real. Your "divide and conquer" framework is the clarity we needed.

IconIconIcon
Client
Heanri Dokanai

Finally, a clear way to cut through the AI hype. It's not about the name, but the problem it solves.

IconIconIcon
Arrow
Previous
Next
Arrow