The Blind Man's Waltz

A reflection on possible paths for AI

Image of a blind folded couple chaotically waltzing in a crowded restaurant
audio-thumbnail
Audio Narration
0:00
/483.4496458333333

At last week’s Entrepreneurs First Global Retreat, 35 colleagues gamely donned blindfolds for some illuminating exercises which were a bit of an eye opener, shining a light on the challenges inherent in object recognition, communication and navigation. How many visual metaphors can you cram into a single contrived sentence?

Despite encouragement to the contrary, I did cancel the after dinner blind waltzing and the blind conga race around the swimming pool in deference to our esteemed COO who owns the company risk register. 

The title of this post is actually a reference to an enduring frustration throughout my life. That’s not my repeated failure to be selected for Strictly, but rather the one step forward, one step backward dance  between inclusivity and technology.

Naively, you might think that, over time, technology would make the world more accessible to those with sensory impairments, but as next week’s post highlights further, the opposite is in fact very often true.

Ten days ago, Mary Meeker, the legendary tech investor and analyst, released a 340 page report on the state of AI and its accelerating adoption.

If you're not looking for something to occupy your entire weekend, Nate's Blog on Substack has a well considered commentary on Meeker’s opus. Many other summaries and commentaries are also available.

Meeker is famous for her reports on internet trends released every year for a quarter of a century from 1995. This new report is her first since 2019  and is focused exclusively on AI. It is a must read for all AI developers, policy makers and investors, but unfortunately it’s a cannot read for me due to the sheer volume of graphical and tabular data supporting Meeker’s observations and predictions. I was unsurprised to be excluded from reading the report having spent the entire period since Meeker’s first 1995 report fighting with the inaccessibility of pdf documents. For a standard that was introduced before Windows 95 and before most people had a PC on their desk, let alone at home, it frankly stuns me that pdf document accessibility is still optional, requiring specific actions by the author to make it work. I’m not talking here about sophisticated image descriptive tags, but just basic human rights, like identifiable spaces between words. where is Greta Thunberg when you need her? OK, fair enough, she’s got bigger fish to fry.

Two recurring themes in Meeker's report are firstly the astonishing flow of capital and computing resource to the training of foundation models like ChatGPT, and secondly the equally astonishing exponential reduction in the cost of responding to user questions, instructions or prompts (the model’s participation in these user interactions is technically called inference).  This plummeting cost of inference ought to be music to the ears of blind people who can benefit enormously from AI getting much, much better and cheaper at assisting them in the digital and physical worlds. Part of me is incredibly excited about AI's potential to include me in these predominantly visual worlds, not least through ultra realistic conversational ‘voice first’ interfaces. 

But another part of me is genuinely terrified that the exponential power of AI may actually enable a new even more visual ‘video first’ user interface paradigm, which  leapfrogs ‘voice first’ interfaces in an acrobatic evolution of the mere waltz I was dancing until now.

Here’s a thought experiment. “a picture is worth a thousand words” and “every picture tells a story” are extremely prescient refrains and are in many ways understated. Entire PhDs have been written on The Mona Lisa, and any competent driver can instantly assimilate a really complex and dynamic visual scene, registering signposts, avoiding potholes and still finding time to admire the ever changing view.

Increasingly, YouTube shorts or TikTok are the preferred method for communicating and consuming tutorial content as well as user generated entertainment. The barrier to acceptable quality video production is already low but is about to get entirely blown away by AI autogenerated video content. What if this becomes the normal way that all information is communicated? Instead of sending a text message with the time and location for our dinner date, what if your personal AI assistant sends me a beautiful, informative video short including a clock face, an animated personalised map, some food imagery that doubtless looks much more appealing than it tastes, a clip of beautiful waiting staff flitting around the restaurant and some background music turning a simple dinner invite into a Hollywood blockbuster? Maybe very engaging if you can see but infinitely less accessible for me than the current web experience. And just for context, the current web experience usually ranges between mildly frustrating and utterly impenetrable. If a ‘video first’ paradigm really does become a thing, and if AI is given legal personhood, I may well be the first offender prosecuted under a new crime of grievous bodily harm against a digital personage.  

More positively, if we do things right, AI could dynamically generate ultra-personalised content in the appropriate format for the target audience. audio for me, visual images for people with hearing impairments and Hollywood blockbusters for anyone who cannot consume whole sentences after early life exposure to the Cocomelon engagement algorithm.

But if we don’t have this thoughtful strategy, separating form (i.e. the representation of content ) from function (i.e. the intrinsic meaning of the content), we will be continuously challenged by converting ever richer visualisations to audible text for blind access. Whether imagery and video is necessary or gratuitous eye candy, it is extraordinarily difficult to convert into a succinct but sufficient audio stream.

I’ll leave you with one final thought. The basic text that ChatGPT generated to describe the image in this post is:

“The image is a whimsical illustration of a chaotic yet humorous scene in a restaurant. A blindfolded couple is waltzing between tables. The man, wearing a blue suit and red bow tie, has accidentally stepped on the woman's foot—she's wincing in pain while still dancing in her red polka-dot dress.

Around them, the restaurant shows signs of disruption:

  • A chair has been knocked over in the foreground.
  • One table has a broken wineglass and red wine spilled across the white linen tablecloth, with more wine pooling on the floor.
  • Diners nearby are visibly shocked. A woman in a green dress has her hand over her open mouth in disbelief.
  • A man in a brown suit stands behind another diner who looks stunned, staring at the unfolding scene.

The overall style is colorful, textured, and playful, with exaggerated expressions and soft, rounded forms that add charm to the chaotic setting”.

That’s over 700 characters – more than 5 times the maximum allowed for Alt Tag descriptive text.