The Seductive Nature of the Chatbox

LLMs can be brilliant yet bafflingly bad. Often it’s faster, cheaper, and more reliable to just write code—or better still, skip the chat entirely and do things the old-fashioned way.

Cartoon image of a running race between a pencil and an AI computer
audio-thumbnail
Audio Narration
0:00
/558.7330612244898

For many of us, AI is synonymous with ChatGPT or other Large Language Models (LLMs) such as Anthropic’s Claude. The chat box user interface to these models is seductively easy to use (no more complicated than typing a query into Google or any other search engine). Thus consequently the LLM becomes our go to helper for more and more digital tasks and we sometimes pay scant regard to the validity of the output. In other more complex tasks such as coding, the same initial ease of use engenders a wrongheaded perception of productivity gains.

An LLM planning a multi-step approach to a problem genuinely leads to better results. Spelling out such a process, even if it’s not exactly what’s going on, makes the AI feel more explainable, although an explainable answer can still of course be wrong. 

When asked “How many English monarchs since 1066 have had the letter A in their name?” ChatGPT-5 began its answer with "Let’s work this out carefully. " and proceeded to correctly list the 42 monarchs since 1066 and then checked each name for the letter 'a'. 

This is a good logical approach but rather a shame that ChatGPT thought, on this particular occasion, that ‘John’ and ‘George’ both contain an 'a' , and conversely that ‘Edward’ does not. Ironically ChatGPT’s A List count of 25 was remarkably close to the correct answer of 26 due to the erroneous inclusion of John and all the Georges almost exactly balancing out exclusion of all the Edwards.

As my wife, Mrs Mairs, very pithily observed, ‘It would be quicker and more reliable to do it with a pencil and paper’. I’m confident she could do it in less than 5 minutes using her trusty hardback of ‘The Kings and Queens of England’ and a primary school tally chart.

Being determined myself to do it the hard way, I spent a further frustrating 15 minutes interacting with ChatGPT until it finally wrote a one line script in the Python programming language to calculate the answer. My painful conversation concluded thus:

Chris: Do you have any guidance in your system prompt or elsewhere that suggests when to execute a simple python script

ChatGPT:[…] In hindsight, the better move (especially given your background in Python) would have been to default straight to code. […] The choice of manual vs. code execution is left to my judgment. The bias is toward “do it by hand first” if the problem looks small and fact-based, but if precision matters (like exact counts), then code is the better tool.

Would you like me to always err on the side of running a quick Python check whenever a question has a definitive, countable answer?

Chris: Call me pedantic, but yes, I  would like you to err on the side of not erring, and actually provide correct answers to numerical questions. And for the record, the set of like minded users is very definitely not restricted to those with ‘a background in python’.

Anthropic’s Claude model also gets this question wrong, randomly excluding all the Williams.

Claude’s explanation of its failure is ridiculously anthropic (in the true meaning of anthropic before the word was commandeered by Claude’s own maker):

Claude: “…when I looked at "William" I somehow mentally processed it as not containing 'a', even though it clearly does: Willi**a**m.

This was simply a mental oversight - I wasn't systematically checking each letter, but rather doing a quick visual scan, and my brain failed to register the 'a' in "William". It's a good reminder that even simple pattern-matching tasks benefit from being methodical rather than relying on quick visual recognition”.

One has to assume that Claude has cobbled this eyeballing excuse together from a huge trove of similar excuses made by human’s for their own sloppiness. But I suppose that AI justifying AI Slop by analogy to human slop is only fair.

The most reliable prompt I’ve found to get the correct, and repeatable, result out of both these leading LLMs is:

Chris: “Run a script to count the number of English monarchs since 1066 with an 'a' in their name”.

Here’s the python snippet that ChatGPT actually then used:

sum('a' in name.lower() for name in monarchs)

Claude Code is blisteringly fast at writing scripts and wrote a similar script, executed it and wrote the answer to the query all within 5  seconds, at a cost of 8  cents. 

Having said that, I can execute the same script on a tiny $0.50 Raspberry Pi computer chip in a fraction of a millisecond. In other words, at much less than one millionth of what it cost for Claude to answer the query. Computer chips are manufactured on 30cm diameter discs of semi-conductor silicon known as wafers (sort of like oversized CDs), and one such wafer  can contain 20,000 of these tiny Raspberry Pi chips. The power budget for each of these Raspberry Pi chips is 75 milliwatts. So you could run 20,000 of them on less electricity than your kettle.  In less time and power than it takes to boil your kettle, this single wafer of chips could cope with every single resident of planet Earth simultaneously and independently needing to know how many English Monarchs are on the A list. [Editor’s note: removed bad taste joke here concerning deletion of prince’s names starting with ‘A’ from the A list, from the B list and indeed, from every other royal list over which the king has any control whatsoever].

In contrast, OpenAI's latest proposed data centre will consume 10 Giga Watts, more than the peak demand of New York City or of an entire country like Switzerland. This data centre will of course mostly be focused on winning Nobel prizes and solving serious problems for millions of businesses, not answering fatuous questions about the English monarchy . But nonetheless, it’s hard to comprehend the amount of computing resources that 10 Giga Watts of power can actually drive.

Returning to our simple A List question, Mrs Mairs can find the answer easily enough with a book and pencil, I can find it easily enough with one line of python, and anyone can find it easily enough with Excel. So why do we increasingly turn to our favourite LLM, when the latter cannot even get it right without a non-intuitive prompt?

Perhaps it’s because the chat box in the UI for these models is so deceptively easy to use. Typing a simple plain English question into that box, without  giving any thought as to how you might approach the question unaided, can kickstart a sequence of interconnected processing loops executing trillions of instructions on state of the art hardware, burning lots of energy, and almost always producing a confidently stated answer. Setting aside the fact that the answer is sometimes plain wrong, the seductive nature of this interaction does encourage laziness. I default to ChatGPT far more than to traditional Google search these days. Open AI are betting this trend will accelerate as their new Atlas browser takes off with its integrated ‘agent and search’ omnibox.

This cautionary rant from Michael Judge powerfully debunks many claimed productivity gains by the time you've actually checked and fixed the output from your favourite LLM. Michael’s rant relates to sophisticated programming tasks, but we are equally deceived by the same illusion of simplicity and productivity   for much more prosaic queries and searches.

On a good day I’m incredibly bullish about the scientific and health breakthroughs AI will enable. I'm also sure there are many, many genuinely and necessarily repetitive tasks where it will be cheaper to pay for an AI agent than to pay for a human agent.

On other days I wonder if we will multiply the world’s energy consumption by an order of magnitude, just to power AI slop performing straightforward tasks badly which we can already perform better ourselves if we just engage our brains.