Conversational Authenticity

Voice AI agents will transform how we interact with technology — but only if they solve the same turn-taking problems that plagued transatlantic phone calls for decades. Sometimes the simplest solution is the most authentic.

image contrasting painful 80s telephony with modern authentic voice calls
audio-thumbnail
Audio Narration
0:00
/306.494693877551

For conversational voice agents to transform our interaction with technology they must feel authentic, which some say will only come with video understanding. Granted, body language and facial expression are key elements of human communication, but they are not essential. I seem to manage OK despite being blind, and when the voice quality of a phone call is first-rate, the conversation feels pretty authentic.

Nonetheless, there are some engineering challenges beyond the power and reliability of the large language models themselves. The evolution of long-distance telephony over the past 50 years provides some useful insights.

Readers of a certain age will remember when transatlantic phone calls were miserable. You burned enough cash to buy a Michelin-star dinner, but you burned it for the audio equivalent of something even your dog wouldn't eat.

There was a painful pause before the other party started talking because of long transmission delays known as latency.

And reflexively muttering "uh huh" or a similar sound of encouragement into the low-fidelity microphone could counter-productively cause horrible stop/start stuttering.

Here's why. Telephone lines are necessarily duplex, meaning they work in both directions—always useful for a dialogue, not a monologue. But to save infrastructure costs, long-distance links were traditionally only built half-duplex, meaning you had to take it in turns to acquire the voice channel.

The network detected sound at each end of the link to switch the channel automatically, like a pressure pad turning the lights green when a car crosses it. So your muttered encouragement would interrupt the other party mid-flow.

Although half-duplex transmission has long since disappeared, excessive transmission delays still make badly behaving WhatsApp or Zoom conversations very painful. In some cases you think the other person hasn't heard you, so repeat yourself, only to find you are talking over what they had already started saying. Other times distortion or jitter just make the experience frustrating and very tiring.

Wrestling with the technology rather than having a free-flowing conversation is a real barrier—one of the reasons we're willing to spend so much time and money travelling for face-to-face meetings.

Having said that, after many, many billions spent on network upgrades, a landline conversation with a colleague in another continent is now often indistinguishable from one with a colleague next door.

Equally, transatlantic WhatsApp or Zoom can be marvellously pain-free if everyone participating has great WiFi and the connecting high-speed networks are all flowing well.

This brings us to conversational voice AI agents, which will never feel truly like a human assistant if they don't get the ergonomics of turn-taking as natural as a face-to-face conversation. And right now, real-time conversational voice AI feels like 1980s telephony, because:

  • State-of-the-art models convert audio to text, think about it, generate output text, and then convert that back to audio, which makes the conversation once again fundamentally half-duplex.
  • Turn detection is either too sensitive or too slow.
  • Network latency is compounded by the speech transformation to and from text and the model's thinking time.

Keeping the whole flow in the audio domain is under active research, but that feels a long way away for more than basic conversations. A more promising angle is full-duplex processing where one part of the model continues to listen to the user, generating text inputs, which the main model can interpret and either use to adjust its thinking on the fly or simply take as confirmatory muttering, rather than an interruption.

But there is a really simple solution to "turn detection", which is not hit or miss, and which shaves a good half-second off the time delay which we find so painfully inauthentic.

When walkie-talkies dispensed with the push-to-talk button in favour of speech auto-detection, this enabled hands-free operation—essential in some environments. But Button Computer, from the latest Y Combinator cohort, have intentionally combined a low-tech physical push-to-talk button with a high-tech AI model, and they have something very authentic. The founders are both ex-Apple engineers with mechanical and AI skills, and as you'd expect, they have brought much of the Apple design philosophy to their product. I'm investing in the business.

If you're interested in trying out the button towards the end of this year and you have a US mailing address, you can pre-order it for $179.