By Chris Mairs — 08 Jul 2025

Voice first does not mean voice only

Conversational voice is a brilliant inclusive, communication mode. But as illustrated here, we have limited bandwidth to process a noisy soundscape, so sometimes we need bi-modal solutions.

Audio Narration

0:00

/294.3216326530612

I am super bullish about the assistive potential for vision impaired users of the latest conversational voice models. Hobson has a ‘voice first’ paradigm and conversational voice start ups are hot funding prospects. A16Z and YC in particular have a strong voice thesis. But audio is not without its problems as a communication medium.

Anyone who has spent time recently on a British train station is likely to have experienced the irritation of excessive announcements. Yesterday, in the 5 minutes before my train arrived I was advised to:

a) not be a complete pillock standing too close to the platform edge
b) buy a ticket to avoid a confrontation with a ‘revenue protection officer’
c) report anyone else who I thought might need sorting out
d) be aware that optimising my journey time is doomed to failure due to the doors automatically locking a minute before departure
e) remember my luggage
f) desist from skate boarding
g) go online to find out why I won’t be able to travel over the coming weekend
h) be prepared for everything to go pear shaped due to British weather, British workers or British rolling stock.

All of the above was interspersed with 3 repeated announcements of the 15 stations where my train would stop, and the two stations where it would not. The whole audio montage made me yearn for the days when there were no announcements at all and I just asked the driver where the train was going. I can actually recall a halcyon interlude when public address systems were ubiquitous but used judiciously and sparingly, before they became the current source of wall to wall superfluous, condescending or arse covering pre-recorded messages on infinite loops. As someone who used to advocate fervently for audio signage I find myself switching camps and wondering why we don’t have an Office for Peace And Quiet , entrusted with seeking out and destroying gratuitous pre-recorded public announcements. Or maybe AI could detect if recent platform arrivals included a pillock, a scoundrel, an amnesiac, a skateboarder or a naïve optimist, before repeating the loop yet one more time.

The above self-indulgent rant is simply to say that converting visual information to an audio stream is extremely prone to overloading the listener. We really do not have much capacity to absorb sound, which creates a real challenge for audio-based assistive technology for blind people.

We can simultaneously process many, many aspects of a visual image including the relationships between objects in 3 dimensions, facial expressions, nuanced body language, colours, textures, perspectives, gradients, and the list goes on. In the real world of course there is also the time dimension introducing trajectories, movement and the constant realignment between objects. In contrast our other senses are much less rich. We can absolutely extract meaning from tone of voice to augment the spoken word, but an audio description of all but the simplest visual scene will be simultaneously very lengthy and very partial. It’s surprisingly tricky to even describe the contents of a plate of food to a blind person succinctly and functionally, let alone conveying the artistic way it is arranged. Hint: using a clock face metaphor is the best way I know.

Audio feedback is particularly tricky for navigating an unfamiliar space where the timeliness of precise guidance is critical. Anyone who has guided a blind person has probably done it by linking arms or by some sort of physical contact. Similarly guide dog users are physically connected to their dog. And maybe the white cane has persisted for so many years as the blind tool of choice because physical contact with the physical world is so profoundly visceral.

When developing or selecting assistive tools, there are sometimes choices to be made regarding the modality - audio feedback or haptic feedback. But very often when interacting with the physical world bi-modal feedback works best. An upcoming post explores just such a navigation solution which is now a reality, leveraging powerful, low cost, AI enabled robotics.