
A multinational BPO we worked withoperated six contact centers across three continents serving the sameend-customer base. Each center ran its own QA program, used the same scorecard,and had internal calibration practices. Each center’s manager reported theirteam as “well-calibrated.” The end-customer was using the aggregate QA scoresto evaluate performance across centers.
We ran a calibration test across centers.Same scorecard, same set of calls, ten QA analysts from each of the six centersscoring blind. The results were uncomfortable. Inter-center variance wassubstantially larger than within-center variance. The same call could score 91in one center and 67 in another, with both centers’ analysts confident theirscoring was correct. Each center was internally consistent. Across centers,they were measuring different things using identical criteria.
The end-customer’s performance evaluationwas therefore comparing centers on a basis that wasn’t actually comparable. Thecenter scoring highest wasn’t necessarily performing best. It might just bescoring the same performance more leniently. The performance managementimplications — incentives, capacity allocation, contract renewals — were allbeing made on data that wasn’t comparable across the units it was comparing.
This is the structural problem withmulti-site QA calibration. Within-team calibration is well-understood.Cross-team and cross-center calibration is significantly harder, much lesscommonly addressed, and almost universally weaker than the program documentationimplies.
Within a single team, calibration can be maintained through regularsessions, shared coaching, and ongoing discussion of edge cases. The analystswork in the same building, talk to each other, build a shared interpretation ofthe scorecard through accumulated discussion.
Across centers, none of this happens organically. Each centerdevelops its own shared interpretation, which is internally consistent butdoesn’t align with what other centers have developed. The same criterion getsinterpreted differently because no cross-center discussion is forcingalignment.
Several specific factors amplify this.
Cultural and language differences. What“empathy” sounds like in one cultural context is different from what it soundslike in another. The scorecard says “demonstrates empathy”; what counts asdemonstration varies.
Manager influence. Each center’s QA leadshapes their team’s calibration. Different managers emphasize different aspectsof the scorecard, and the team’s scoring reflects that emphasis.
Local context. A center handling aspecific market segment develops scoring norms suited to that segment, whichmay not match scoring norms suited to other segments.
Tenure and training cohorts. Neweranalysts at one center calibrate to the established analysts there; the same inanother center. Both groups are calibrated locally and divergent globally.
The cost of weak inter-center calibration shows up in severalplaces.
Performance comparisons are not comparable. Center A’s 89 average isn’t comparable to Center B’s 84 average.The scoring difference may be entirely about calibration drift, not performancedifference.
Best-practice transfer is corrupted.Identifying the best-performing center and trying to replicate its practicesfails when the “best performance” was actually best scoring. The replicationdoesn’t produce the expected improvement.
Compensation and incentive systems become unfair. When centers compete for contract retention or capacity allocationbased on QA scores, the centers with more lenient scoring have a systematicadvantage that has nothing to do with the work being done.
Customer experience predictions are wrong. QA scores are often used to predict customer satisfaction outcomes.The prediction quality depends on calibration; weak inter-center calibrationproduces predictions that work locally but fail when aggregated.
Programs that genuinely calibrate across centers tend to share fourcharacteristics.
Regular cross-center calibration sessions. Not just within-center. Analysts from different centers score thesame calls, discuss disagreements, and converge on shared interpretations ofevery criterion that produces disagreement. This is operationally expensive butessential.
Shared training and rotation. New QAanalysts trained against a common standard, ideally with rotation acrosscenters. Cross-center exposure prevents the local interpretation drift thatproduces calibration divergence.
Centralized calibration anchoring. Acentral QA function — small, expert, not site-based — that owns the canonicalinterpretation of every criterion and resolves edge cases. Without an anchor,sites drift.
Continuous inter-center reliability measurement. Not just annual or quarterly checks. Ongoing measurement of howscoring varies across centers, with intervention when drift is detected.
This is heavier than most multi-site QA programs operate. It’s alsowhat’s required to make the aggregated QA data actually meaningful acrosssites.
AI-drivenQA changes inter-center calibration meaningfully. An AI scoring engineapplies the same criteria the same way regardless of which center the call camefrom. Inter-rater reliability across centers becomes functionally automaticbecause the rater is the same engine.
This doesn’t eliminate the need for human QA judgment — context,edge cases, and coaching depth still require humans. But it shifts thecalibration question from “are our analysts in different centers scoringconsistently” to “is our AI scoring engine calibrated to our actual qualitystandards.” The second question has a different answer profile, and it’s aquestion that can be answered systematically rather than through ongoinginter-center meetings.
Multi-site contact center operations are some of the highest-valueenvironments for AI-driven QA precisely because the calibration problem atscale is so structurally hard with humans alone.
1. Run a cross-center calibrationtest. Same calls, multiple centers, blind scoring.The variance will tell you the actual state of your inter-center calibration.
2. Compare scoring distributionsacross centers. If center A’s median score differsmaterially from center B’s, ask why. Performance difference and calibrationdifference are very different problems.
3. Identify the criteria with theworst inter-center alignment. Some criteriacalibrate naturally; others diverge. The divergent ones are your priority forshared training.
4. Audit your performance comparisons. If you compare centers on QA scores for compensation or contractpurposes, the comparison may be measuring calibration variance rather thanperformance variance.
5. Centralize calibration anchoring. Designate one function — not one of the operating sites — as theauthority on calibration questions. Sites can disagree with the anchor, but theanchor decides.
The “well-calibrated” QA programs in yourmulti-site operation are usually well-calibrated within their own walls andsubstantially divergent across centers. The aggregate scores you’re managing onaren’t comparable in the way the dashboard implies. The fix is heavier thanmost operations want — sustained cross-center calibration is real work — butit’s the difference between management based on data and management based ondata that just happens to be wrong in the same way across reports.