Your monthly AI bill is a crap measure of your productivity, so why encourage developers to spend $25,000 PER DAY?

Token usage leaderboards reward developers for burning the most fuel, not building the best engine. But behind the bragging rights, agentic loops are reshaping how software gets built.

Image of author standing before enormous bonfire of tokens
audio-thumbnail
Audio Narration
0:00
/544.5485714285714

Using Claude Code, Claude Cowork or any other agent with what's called an agentic loop is a great way to appreciate why the evolution of AI is not about more well-structured blog posts, more beautiful images or more comprehensive answers to your maths homework. The agentic loop is the way forward to more autonomous AI, but the costs for extreme use can be utterly eye-watering if you're paying the bill yourself and you're not already in the private jet set.

AI model charging mechanics

Skip this paragraph if you are familiar with LLM tokens and API credits. Tokens are a measure of how much work a Large Language Model is doing for you. The exact definition of a token depends on what you are feeding into the model (text, image, video, DNA sequence, etc) and what you are getting out. Text tokens in a chat interface broadly correlate to words. When text is tokenised (split into small fragments that LLMs can deal with most effectively) we end up with about 1.3 times as many tokens as words because punctuation counts as a token and longer words become multiple tokens. The more words you type in and the more words you get out, the more you spend. Most consumers either have a free plan or a flat-rate monthly subscription so they don't typically see their token usage until they start hitting the token threshold for their particular plan.

If you hit the monthly limit on your subscription or your free plan and are in a hurry to do more you have to upgrade to the next level (higher monthly price equals more token credits) or just top up your account by buying more token credits — think of these credits as the fuel for the gas-guzzling AI beast. The most capable models typically burn the most credit (i.e. the cost per output token for the latest state-of-the-art model is a lot more than for the earlier or more lightweight models). This reflects the resources spent building the data centre, energy consumed training the model and energy consumed generating the response to each user prompt.

Token costs vary substantially between Anthropic, Google and OpenAI, and depend heavily on whether you are using a lightweight or premium model. Claude Opus 4.6/4.7 are generally regarded as the most capable (albeit most expensive) coding models, with a token cost of $5 per million input tokens and $25 per million output tokens. All models have this asymmetry between input and output costs, although the cheapest Google model, Gemini Flash is an order of magnitude cheaper than Claude Opus. It's reasonably intuitive that output tokens are more expensive than input ones: it's easier to digest a blog post than it is to write it, and it's easier to interpret a rich image than to create it.

A million tokens (equals around 750,000 words) sounds like a lot, and it is. For example, Shakespeare's Macbeth is only 18,000 words, so asking Gemini Flash to translate Macbeth to Swahili if you were paying per token would only cost around 10 cents. I haven't tried this, but by splitting the text into a few chunks (Acts or Scenes) I'm pretty sure you could even do this on a free plan.

For single question/answer exchanges, it's almost impossible to blow the usage limits on Claude's $200/month Max plan. $200 does sound expensive compared to a Netflix subscription, but $200/month is utterly minuscule compared to the token budgets being burned by hardcore AI-native software engineers. That is because they all run sophisticated combinations of agentic loops.

An agentic loop is in principle very simple:

  1. Send the initial user prompt to the LLM
  2. Wait for an instruction from the LLM to do something with one of a set of tools (read a file, search the web, execute a snippet of code, etc)
  3. Send the output from the tool back to the LLM
  4. Loop around to step 2 above and continue doing this until told to stop.

For a complex task with many steps this loop can run for hours or even days.

Anthropic's coding-specific agent Claude Code is just such an agentic loop and not to be confused with Claude itself, Anthropic's Large Language Model. Claude Code runs the loop and executes tools on your local computer, at the behest of Claude who thinks, hallucinates and gobbles tokens in the cloud.

Anyone who has written a trivial python script or visual basic macro will recognise that the agentic loop is not AI. It is not itself a Large Language Model (it is just a harness to invoke an LLM). This distinction is important because people often conflate the agent with the mysterious probabilistic token generator (the Large Language Model).

I have over-simplified. Sophisticated agents do have lots of twiddly bits around the edges, but they are still just regular software behaving predictably except when they've been badly written, badly specified or badly tested. If you ask your gardener to use a trowel to dig out the weeds, but instead they use a pair of shears to lop off the tops, this inappropriate use of the wrong tool cannot be put down to probability — it's just wilful laziness. In contrast the job of a Large Language Model is to predict the next token (word) based on all that has gone before. That this prediction works so extraordinarily well almost all the time is genuinely mind-boggling. But it is still guesswork, which occasionally delivers the wrong answer, unhelpfully called hallucination rather than bad luck.

The duration of this loop directly influences token usage and cost — the more times the agent goes around the loop before the next user interaction, the more tokens the system will have produced with a correspondingly greater cost.

And these agentic workflows get way, way more expensive when one agent spawns other independent subagents, each with their own loop repeatedly invoking the LLM and of course invisibly burning their own bonfires of tokens.

Managing hundreds of concurrent long-running agents, the top individual developers at the tech giants can burn upwards of a billion tokens per day ($25,000 per developer per day). Pause to let that sink in.

A few companies maintain official or unofficial leaderboards, with 'tokenmaxxing' bragging rights for the developer who burns the most tokens. Stories circulate of individuals exceeding $5M of token costs in a single year, and that's presumably without even generating gratuitous high-fidelity videos of themselves embracing Jesus, raising the dead or doing other things contrary to their terms of employment.

Token usage is an extremely imperfect measure of productivity in the same way that timesheets are a poor measure of output and lines of code are a poor measure of software functionality. The very existence of token usage leaderboards encouraging tokenmaxxing feels like rewarding people for working 100+ hours a week without caring what they produce.

But there is some underlying truth that the most talented and productive AI-native developers are finding novel ways to orchestrate whole armies of AI agents producing working code in an extraordinarily short period of time.

Having said that, right now the code they produce is often flabby and sometimes downright unmaintainable, which we'll return to in a future post.

Software development is particularly well suited to armies of agents co-operating to write, test and review large swathes of code. But the general principle of semi-independent agents running for hours or days to get shit done applies to many other desk jobs. Anything that involves a sequence of steps such as research, combining, distilling, writing, reading, comparing or deducing is a great candidate for an agentic loop. Even if there are points where the process has to pause or consult other departments, that can be built into the loop as just another tool call (call the tool upstairs to get his approval).

If you're still a sceptic, try using Claude Cowork or even do a bit of vibe coding for a glimpse of what is now possible. Don't worry about being hit with a nasty bill. If you do run out of tokens on your plan you will simply get throttled back by the model.