The Promise and Reality of AI Voice Features

Testing Claude's Live Feature and What It Reveals About the Current State of Voice AI

Jun 07, 2025

Some thoughts about the gap between marketing promises and technical reality.

So, Claude's new live voice feature has just been released, and I have, of course, been busy testing it. While I'm definitely positive that Anthropic added this feature, I'm not blown away by it. The experience, however, got me thinking about the current state of voice AI across the board - not just Claude, but ChatGPT, Gemini, and the broader landscape of what we're calling "conversational AI."

The Architecture of Illusion

Here's the thing that struck me immediately: these "live" features aren't actually live at all, obviously. What we have is fundamentally a 4-step process that every major AI company seems to be implementing in roughly the same way:

1. I speak

2. Speech gets converted to text (speech-to-text)

3. AI reads and responds in text

4. Text gets converted back to speech (text-to-speech)

It's like having a conversation through multiple translators, each introducing their own potential for error and delay. And those conversion steps? They breaks everything the live feature is supposed to deliver, sadly.

The Danish Accent Problem (and What It Reveals)

When I say "Gemini" with my Danish accent, the system consistently hears "Germany." When I spell out "ChatGPT" letter by letter, it becomes "TBT." This happens both for the already existing speech-to-text feature (which I still do use a lot) and for the new live feature. Even when I pronounce Gemini very pointedly, it still translate it into Germany. I have messages where I'm apparently having conversations about Germany and Czech, even though I'm sure that I said Gemini and ChatGPT.

This isn't just a quirky anecdote about my accent - it reveals something fundamental about the limitations of current speech recognition technology. The systems seem to default to more common words (like country names) even when the acoustic input doesn't quite match, rather than considering less common but contextually relevant terms (like AI assistant names in a conversation about technology).

The irony is that these AI assistants can engage in sophisticated reasoning about complex topics, but they can't reliably distinguish between "Gemini" and "Germany" when spoken by someone with a non-American accent. It's a reminder that the most basic layer of the technology stack - speech recognition - often becomes the weakest link in the chain.

The Interruption Paradox

A curious aspect of the live feature is the ability to interrupt the AI mid-speech. This is marketed as making conversations feel more natural, and in some ways it does. The AI can detect when I'm trying to interrupt and will pause its response to let me speak.

But here's where it gets interesting: the system can't actually hear if the noise I'm making is me wanting to interrupt it, or if it's me coughing, sneezing, or telling someone who just walked into the room to wait a moment. This happens consistently across all the AI assistants I've tested this with.

So we have systems that are sophisticated enough to detect vocal interruptions but not sophisticated enough to understand the context of those interruptions. The AI doesn't know whether I'm clearing my throat or trying to correct it. It's responding to audio patterns, not to communicative intent.

The Uncanny Valley of Conversation

What emerges from all this is something I'd call the uncanny valley of AI conversation. We're getting the illusion of natural conversation while still dealing with all the friction of text-based interaction, plus new friction points introduced by the conversion processes.

The "live" features promise seamless communication but deliver something that feels almost-but-not-quite natural. In some ways, this might be worse than clearly artificial interactions, because the expectation is set for natural conversation, but the reality keeps falling short in jarring ways.

Use Cases and Context Dependency

Don't get me wrong - I love the idea. Speaking is often faster and more natural than typing, especially on phones. But the current implementation feels like we're in this awkward transitional period where the technology hints at what direct speech-to-speech could be, without actually delivering on it.

The value probably varies significantly by use case. Simple Q&A might work fine despite the limitations - if you're asking about the weather or basic facts, transcription errors and context gaps might not matter much. But for nuanced conversations where context, tone, and conversational flow matter? Those conversion barriers becomes real problems.

Consider the difference between asking "What's the capital of France?" versus having a detailed discussion about technical implementation details, or trying to brainstorm creative solutions to complex problems. The first works fine with current voice AI; the latter often breaks down precisely where the technology promises to add the most value.

The Broader Implications

This raises interesting questions about how we evaluate and adopt new AI features. The marketing around voice AI emphasizes the naturalness and convenience, but the reality is that we're still dealing with multiple layers of technological mediation, each with its own failure modes.

There's also something to be said about accent bias and linguistic inclusivity in these systems. If the speech recognition consistently fails for non-native speakers or people with non-standard accents, then the "convenience" of voice features becomes a privilege available primarily to native speakers of the training language.

What Would True Speech-to-Speech Look Like?

True speech-to-speech AI would eliminate many of these friction points. Instead of converting speech to text and back, the system would process audio directly, potentially understanding not just the words but the tone, emphasis, context, and even environmental sounds that provide communicative information.

Such a system could distinguish between an intentional interruption and a cough. It could pick up on sarcasm, uncertainty, or excitement in ways that text-based processing simply can't. It could understand when someone says "Gemini" even with a Danish accent, because it wouldn't be trying to match audio to a limited vocabulary of likely words.

But we're not there yet, and it's worth being honest about the gap between current capabilities and the marketing promises.

The User Experience Reality

I'm curious about other people's experiences with these voice features. Are you finding genuine value despite the limitations? Or are you, like me, waiting for true speech-to-speech technology to make the leap worthwhile?

My sense is that the current generation of voice AI features are useful in specific contexts - particularly for simple, transactional interactions - but fall short of the seamless conversational experience they promise. They're good enough to be occasionally helpful, but not good enough to replace text-based interaction for anything complex or nuanced.

Perhaps that's enough for now. But it's worth understanding what we're actually getting when we use these features, versus what the marketing suggests we're getting.

The technology will undoubtedly improve. Speech recognition will get better, especially for non-native speakers. The conversion processes will become faster and more accurate. Eventually, we might get true speech-to-speech AI that can engage in genuinely natural conversation.

But for now, we're in that transitional period where the promise exceeds the reality, and it's worth being clear-eyed about both the potential and the limitations.

Note: When I mention AI in this post, I'm specifically referring to Generative AI. I'm just too lazy to write "Gen AI" every time.

Samuel Koltov

Discussion about this post