We ran a test last year that changed how we think about speech analytics. We took the same 500 call recordings from a banking client and processed them through three different speech-to-text engines. One was a general-purpose API from a major cloud provider. One was an open-source model that tops industry leaderboards. The third was our own ASR, built from scratch for contact center audio.
On the clean calls (good headsets, quiet rooms, single speaker), all three scored within 2 percentage points of each other. Above 93% accuracy. Practically identical.
Then we ran the noisy calls. Agent in a 200-seat open-plan floor. Customer on a mobile phone in a car. Background chatter bleeding into both channels. Compressed 8 kHz telephony codec. The general-purpose API dropped to 71%. The leaderboard champion hit 74%. Ours held at 88%.
That 14-17 point gap doesn’t just mean worse transcripts. It means your QA scores are wrong. Your compliance flags miss violations. Your sentiment analysis confuses frustration with satisfaction. And your coaching recommendations are based on conversations your system didn’t actually understand.
Here’s the number that should concern every CC leader investing in conversation analytics: Deepgram’s 2025 production metrics study found a 2.8x to 5.7x degradation factor between benchmark scores and real production accuracy. A model reporting 5% word error rate (WER) on clean benchmarks can hit 25% or higher in a live contact center.
The reasons are specific and measurable:
Narrowband telephony kills accuracy. Most contact center calls run at 8 kHz sampling rate with a 300 Hz to 3.4 kHz frequency range. That’s telephone-grade audio from the 1990s. Modern speech models were trained on 16-48 kHz wideband recordings. Voicegain’s 2025 benchmark tested four providers on real 8 kHz call center recordings: Google Video dropped to 68.38% accuracy. Even the best performer (AWS Transcribe) only managed 87.67%.
Background noise compounds the problem. Every 5 dB decrease in signal-to-noise ratio roughly doubles the word error rate. A typical open-plan contact center runs at 10-15 dB SNR. At 10 dB, you’re looking at ~15% WER on average. At 5 dB (agent next to a loud colleague), WER jumps to ~35%.
Accents and non-native speakers add another layer. For contact centers serving international markets or running offshore operations, WER increases 2-3x compared to native adult speakers. We see this constantly across our European deployments. A model that performs well on American English often struggles with Indian English, South African English, or the heavily accented calls that make up the majority of many global CC operations.
Domain vocabulary is the quiet killer. Product names, internal codes, medical terms, financial jargon, regulatory references. General-purpose models weren’t trained on “HIPAA-compliant callback scheduling” or “KYC verification for tier-two accounts.” They hear something close but not right. And “close but not right” in compliance monitoring can mean the difference between catching a violation and missing it entirely.
This is the part most vendors don’t talk about. Speech-to-text isn’t the product. It’s the foundation. Every analytics feature your platform offers sits on top of transcription quality. When that foundation cracks, everything above it cracks too.
Sentiment analysis degrades 15-30% when transcription accuracy falls below acceptable thresholds, according to Deepgram’s research. Cresta documented a real example: a system misrecognized “this is ridiculous” as something entirely different, inverting the sentiment score. One call is an anecdote. Multiply that error across 50,000 calls per month, and your customer satisfaction dashboards are telling you a story that isn’t true.
Compliance monitoring becomes unreliable. Contact centers in banking and healthcare face regulatory requirements for specific disclosures, consent language, and prohibited phrases. When your contact center quality assurance system can’t accurately transcribe what was said, it can’t flag what was missed. At 70% accuracy, roughly 3 out of every 10 words are wrong. A compliance phrase like “this call is being recorded for quality purposes” might get garbled enough that the system marks it as delivered when it wasn’t. Or worse, flags a compliant call as non-compliant, wasting your compliance team’s time on false positives.
QA scoring becomes guesswork. If the transcript says the agent said “I understand your frustration, let me help you with that” when they actually said something less professional, your automated quality assurance scores that interaction higher than it deserves. Scale that across thousands of evaluations, and you’re coaching agents based on fiction. The data from Deepgram’s semantic error analysis is telling: systems with ~14% WER showed semantic error rates above 20%. Acceptable word accuracy was masking critical business-level failures.
Coaching recommendations go sideways. We wrote about how contact center coaching grounded in real call analysis cuts attrition. But coaching grounded in bad transcription does the opposite. It erodes agent trust. When an agent reviews their scored call and the transcript doesn’t match what they said, they stop trusting the system. And when agents don’t trust the QA system, you’ve lost the foundation for every performance improvement initiative.
Cresta’s analysis put a specific number on it: a 1% improvement in WER across 1 million minutes of audio eliminates 10,000 transcription errors. Each of those errors potentially affects a sentiment score, an intent classification, a call summary, or a compliance flag.
For a 500-seat contact center handling 50,000 calls per month, that’s roughly 25,000-40,000 hours of audio per year. At the accuracy levels we see from general-purpose ASR on real CC audio (15-25% WER), you’re looking at hundreds of thousands of word-level errors per month. Some are harmless (misrecognizing “uh-huh” as “uh”). Some are catastrophic (missing a cancellation request, misclassifying a complaint as a compliment, failing to detect a regulatory disclosure gap).
CallTrackingMetrics reported that 40% of their transcripts were unusable before they switched ASR providers. After the switch, usable transcripts jumped above 90%. Five9 doubled their IVR authentication success rate by moving to a lower-WER speech engine. Another company saw a 16% jump in conversion rates after implementing customized speech-to-text for their call center.
These aren’t marginal improvements. They’re the difference between a speech analytics investment that actually works and one that generates dashboards full of confident-looking numbers built on bad data.
NVIDIA published a case study that cuts to the heart of this. They fine-tuned an ASR model on just 20 hours of challenging contact center conversations. The result: their customized model achieved 20.32% mean WER on US English CC audio, compared to 22.22% for the best third-party provider and 44.51% for the worst. On UK English, the gap was even wider: 20.99% vs 33.46%.
That’s the difference between purpose-built and off-the-shelf. And 20 hours of fine-tuning data is nothing. At Ender Turing, we’ve trained our ASR on thousands of hours of real contact center recordings across 30+ languages. Noisy calls. Overlapping speech. Agents mumbling wrap-up notes while the customer is still talking. The specific audio conditions that general-purpose models were never optimized for.
Cresta’s research showed that domain-specific fine-tuning dropped WER from 63.5% to 32.0%. That’s a 31.5 percentage point improvement from adapting to the domain. Keyword boosting alone delivers 5-15% WER improvement. Audio preprocessing adds another 5-10%. These aren’t expensive retraining projects. They’re configuration-level changes that dramatically improve downstream analytics.
But here’s what most buyers miss: the accuracy gap between providers is small on clean audio and massive on degraded audio. On studio-quality recordings, everyone clusters within 1-2 percentage points. It’s on the noisy, compressed, accented, overlapping-speaker calls that actually make up your contact center’s daily reality where the real differentiation happens. And that’s exactly where generic models fail.
The conversation intelligence market hit $28.5 billion in 2025 and is growing at 13% annually. Speech analytics adoption in contact centers jumped from 28% in 2022 to 37.5% in 2023, and broader AI-driven conversation analytics has reached 78% of contact centers by 2026.
But adoption doesn’t equal effectiveness. Only 34% of CX leaders feel fully prepared to execute AI at scale. And 56% are failing to realize ROI on their AI investments, according to COPC. A significant chunk of that failure traces back to the accuracy problem. When your foundational layer, speech-to-text, is operating at 70-80% accuracy instead of 90%+, every analytics feature built on top of it underperforms. Your contact center ROI calculations look worse because the signals are murky.
The technology is getting better. OpenAI’s latest transcription model cut hallucinations by 90% compared to Whisper v2. Deepgram’s Nova-3 achieves 5.26% WER in batch mode. But these numbers are benchmarks on controlled audio. The question isn’t whether speech-to-text is improving. It’s whether your specific deployment, on your specific audio, with your specific vocabulary and accent mix, is actually accurate enough to trust.
1. Audit your current transcription accuracy on real calls. Don’t trust vendor benchmarks. Pull 50 calls that represent your actual audio conditions (noisiest floor, worst phone quality, highest accent diversity). Have a human transcribe 10 of them. Compare against your system’s output. Calculate your real WER. If it’s above 15%, your downstream analytics are materially compromised.
2. Check whether your ASR is optimized for your audio. Are you running 8 kHz narrowband audio through a model trained on 16 kHz wideband? Is keyword boosting enabled for your domain vocabulary? Are your models adapted to the accents and languages in your operation? These configuration-level fixes can improve accuracy by 10-25% without changing vendors.
3. Test your compliance flags against ground truth. Pull 20 calls your system flagged as compliant. Listen to them. Were the required disclosures actually made? Then pull 20 calls flagged as non-compliant. Were they actually violations? If your false positive or false negative rate is above 10%, your compliance program has a transcription problem masquerading as a process problem.
4. Measure the downstream impact, not just WER. Word error rate tells you about transcription. But what you actually care about is: are the right calls getting flagged? Are sentiment scores reflecting reality? Are QA evaluations matching what humans hear? Build a small validation set. Score it both ways. The gap between automated and human-verified scores is the real measure of whether your speech analytics is working.
5. Ask your vendor about telephony-specific training. General-purpose ASR is fine for podcasts, meetings, and dictation. Contact center audio is a different beast. Purpose-built models trained on narrowband telephony, domain vocabulary, and accent-diverse datasets outperform general models by 10-30+ percentage points on real CC audio. If your vendor can’t tell you specifically how their model handles 8 kHz compressed telephony with background noise, that’s your answer.
The speech analytics market is booming. But the gap between what vendors promise on benchmark data and what actually works on your contact center floor is the most expensive blind spot in the industry. The companies that close that gap first are the ones whose call monitoring, compliance, coaching, and customer intelligence programs will actually deliver.