AI Voice Agents: The Quality Questions Nobody Is Asking Yet

A regional bank we worked withlaunched an AI voice agent in late 2025 to handle routine balance inquiries,payment confirmations, and basic account questions. The deployment was treatedas a phone-channel extension of their existing IVR — same regulatory framework,same QA approach (essentially none), same launch criteria (the bot answeredcorrectly in user acceptance testing). Six months later, the bot was handlingabout 14% of inbound calls.

We were brought in to look at a specificincident. A customer had received what they described as confusing andcontradictory information from the AI agent about their loan. The customerended the call frustrated, escalated to a human agent the following day, andlodged a formal complaint. The bank’s compliance team wanted to understand whathappened.

We pulled the transcript. The AI agenthad said three things in the course of the conversation that, taken together,were misleading about the customer’s payment options — not lies, exactly, but acombination of partial information and confident phrasing that would havecaused a reasonable customer to misunderstand their position. Any of the bank’shuman agents would have been QA-flagged for any one of those statements. The AIagent had said all three to this customer, and no one had reviewed the conversationuntil the customer complained.

This is the situation modern contactcenters are walking into with AI voice agents, and almost no one has built thesupervision infrastructure that the situation requires. The technology hasoutpaced the governance, and the governance gap is widening as deploymentaccelerates.

WhatMakes AI Voice Agents Different From IVR

The temptation is to treat AI voice agents as smarter IVR. They’renot. The differences matter substantively.

IVR is deterministic. It plays pre-recorded prompts in a treestructure. Every utterance has been reviewed and approved. The customer hearsexactly what the company intended them to hear, in the order intended.Compliance review can be done once, at design time, and trusted to apply atruntime.

AI voice agents generate responses dynamically. The exact wordsspoken to a given customer have not been pre-approved. They’re produced by amodel interpreting the conversation in real time and generating language thatfits the context. This generation is influenced by training data, systemprompts, conversation history, and the specific phrasing the customer used. Thecompliance posture that worked for IVR doesn’t extend to AI voice agents,because the assumption it depended on — pre-approved language — no longerholds.

This shift is structural and most deployments haven’t recognized ityet. The result is AI voice agents operating in regulated industries witheffectively no per-conversation oversight, on the assumption that they’re anevolution of IVR. They’re not. They need different governance.

The FiveQuestions That Should Govern Voice AI Deployment

We covered the broader version of this in Who’sQA-ing Your AI Agents. The specific voice-agent application sharpens the questions.

1. How do you know what the agent actually said? Logging the conversation isn’t enough. Someone or something has tobe reviewing what was said against what should have been said. Without thatreview loop, your only signal that something went wrong is a customercomplaint, which is too late and too narrow.

2. How do you detect compliance failures? A misstatement about fees, a misleading description of a product, amissed required disclosure — these need to be caught by systematic review, notby customer complaints. This requires running compliance criteria against theactual conversation content, not against the training documentation.

3. How do you handle conversations that go wrong mid-call? When the AI agent encounters a situation it isn’t equipped for, thefailure mode matters. Does it gracefully transfer to a human with full context?Does it confidently invent an answer? Does it loop the customer throughcircular responses? Each failure mode has different downstream costs.

4. How do you measure customer outcome, not bot containment? The bot containment metric — the percentage of conversationscompleted without human transfer — measures the wrong thing. The right questionis whether the customer’s issue was actually resolved, whether they had tofollow up, and whether their experience would predict retention.

5. What’s your kill switch? When the AIagent is producing systematically poor outcomes — wrong information, regulatoryviolations, customer dissatisfaction — how quickly can you take it offline? Inlegacy IVR, this is straightforward. In AI voice deployments, the answer isoften “we’d have to figure it out,” which means there’s no real kill switch.

The RegulatoryAcceleration

The regulatory environment is moving faster on AI voice agents thanmost contact center operators have noticed.

The EU AI Act, in its high-risk classifications, brings AI systemsused for employment-relevant or essential-service customer interactions undersignificant transparency, oversight, and outcome-monitoring requirements. AIvoice agents handling financial services, healthcare, and government servicesqualify, and the enforcement provisions begin substantial application through2026.

The CFPB has signaled increasing scrutiny of automated customerinteraction in consumer finance, with explicit reference to the limitations ofbot-based responses and the obligation to ensure substantive equivalencebetween human and automated channels.

State-level legislation in the US — California, New York, others —is moving toward disclosure and accountability requirements for AI-drivencustomer interactions. The patchwork will be operationally complex for anycontact center handling cross-state customers.

The pattern is consistent: AI voice agents are losing the regulatorygrace period in which they were treated as enhanced IVR. They’re beingrecognized as substantive automated decision systems with the oversightobligations that come with that recognition.

WhatRealistic Voice AI Governance Looks Like

Contact centers serious about voice AI deployment are converging ona four-part governance approach.

Sampling-plus-comprehensive review. Asubset of conversations gets deep human review. A larger subset — ideally all —gets automated review against compliance and quality criteria. The combinationcatches what either alone would miss.

Continuous evaluation against ground truth. Periodic blind tests, where the AI agent’s responses are evaluatedagainst expert-defined correct answers. This catches drift between the model’sbehavior and the policies it’s supposed to reflect.

Human-in-the-loop for high-stakes interactions. Conversations involving certain topics (account closure, financialadvice, vulnerable-customer situations) get routed to humans or get humanreview of the AI’s response before it goes to the customer. This isoperationally heavier but it’s what regulatory expectations are moving toward.

Outcome measurement. The bot’s actualeffect on customer experience and outcomes is measured continuously, notassumed from its design specifications. This data is what proves the deploymentis working — or reveals that it isn’t.

Five Things You Can Do This Week

1. Pull 20 random conversationsfrom your AI voice agent. Review them for accuracy,compliance, and customer outcome. The distribution of issues will be moreinformative than any vendor benchmark.

2. Identify your three highest-stakesconversation topics. What does the AI agentcurrently handle that, if it went wrong, would create real damage? Those areyour priority for additional governance.

3. Audit your detection capability. When the AI agent makes a mistake, how do you find out? If theanswer depends on customer complaints, your governance has a gap.

4. Map your AI deployment againstcurrent regulatory frameworks. EU AI Act,state-level US legislation, sector-specific rules. The gap between where youare and where compliance is heading is the work ahead.

5. Build an explicit kill switch. Documented process, tested at least once, with criteria for when itgets used. The absence of this is one of the most common gaps in current voiceAI deployments.

The AI voice agent that’s been deployedinto your customer-facing operation isn’t enhanced IVR. It’s a dynamicgeneration system making consequential statements to customers in regulatedcontexts, with governance built for static IVR systems. The gap is widening asdeployment accelerates and oversight doesn’t. The bank that didn’t know aboutits loan-misstatement incident until the customer complained isn’t anaberration. It’s the predictable result of the current state of voice AIgovernance, and most operators are about to learn the same lesson the same way.

Client
Burnice Ondricka

The AI terminology chaos is real. Your "divide and conquer" framework is the clarity we needed.

IconIconIcon
Client
Heanri Dokanai

Finally, a clear way to cut through the AI hype. It's not about the name, but the problem it solves.

IconIconIcon
Arrow
Previous
Next
Arrow