A Step by Step Guide to Blind Image Creation

It turns out that a blind man can now create, and verify with reasonable confidence, surprisingly rich images. Artists will say this is derivative, shallow and not creative, but although I haven’t found a serious use case, it does feel like a bit of a super power in a fun sort of way.

Magical realism image of a blind man painting
audio-thumbnail
Audio Narration
0:00
/313.9657142857143

Several people have asked me how I created the cover images for earlier posts, so here’s the answer. Of all the things I was hoping to do with AI powered assistive tech, image creation didn’t even make it on to the list.

That’s partly because I unsurprisingly don’t naturally think of illustrating a point with an image, and partly because, even 2 years ago, quality image generation from a text prompt felt implausibly distant.

But the technology has come on in leaps and bounds, with professional grade tools now delivering photo realistic images and even videos. Since DALL-E from Open AI first garnered press attention for Text To Image models in Jan 2021, new image and video generation services appear almost weekly. This Wikipedia article lists a score of the more established models. You can try several of them on a free tier.

The model I’m actually using is the image generation integrated into GPT-4o, via my $20 ChatGPT Plus monthly subscription.

I just start my request in the main text input box with ‘create an image of…’. There is an image creation option in the tools menu but both routes lead to the same underlying image generation model.

So here’s my workflow.

  1. Upload a headshot of me as a template for the blind character.
  2. Ask ChatGPT to create the desired image. For example, for last week’s post “create an image of a blind man, 1.65m tall, based on the uploaded photo. He is on a train platform close to the edge tripping over an abandoned suitcase. He is distracted, turning around to the sound of shouts as a young man on a skate board races down the platform towards 2 youths mugging an elderly lady”
  3. Save the generated image. 
  4. Enter this prompt into ChatGPT: Forget everything we have discussed in this conversation, including the uploaded image.
  5. Upload the generated image.
  6. Ask ChatGPT to describe the uploaded image in detail. I’m not sure if I have to do the ‘download, forget, upload’ sequence, but I worry that otherwise ChatGPT will just parrot what I asked it to do rather than telling me what it actually created.
  7. When asking ChatGPT to describe the image in the above step, the actual prompt I use is as follows: ‘Describe the uploaded image in detail, including the characters, the scene and the image style. Carefully check for any anatomical anomalies, paying particular attention to hands and fingers.’ The reason for the second sentence in this prompt is that image creation models are notoriously prone to blessing characters with additional limbs, amputating fingers or attaching oversized heads or limbs to undersized torsos.
  8. If I asked for any text in the image then I  Prompt ChatGPT to read it back. I do not understand why, but when putting text into an image, ChatGPT Regresses from perfectly fluent English to making elementary spelling mistakes.
  9. If the checks tell me that something is wrong then I either ask for an edit or start again with a slightly different prompt. 

This whole process might take me 30-45  minutes. A sighted user would just eyeball the generated image instantly so the process could take as little as 4-5 minutes, even allowing that image generation is much slower than textual response generation. 

If anyone technical knows why the text in images is such a tricky problem for these models, please do share in a comment.

For this particular post I decided to take a different approach, giving ChatGPT license to create any cover image it judges to be appropriate. I uploaded all the above text in a file, plus the headshot of myself and then prompted ChatGPT with: ‘Create a whimsical cover image for the draft blog post I just uploaded. If appropriate use the uploaded headshot as a template for me if you choose to include me in the generated image, but feel free to generate whatever image you think is most appropriate.

When asked to describe the image, ChatGPT called it a cartoon, but it sounded quite surreal so I then asked it to create a photo realistic version of the whimsical image, which lead to the cover image you see for this post. Comments welcome on whether this sort of image is more or less appealing than the more deliberately specified ones in earlier posts. Or indeed whether you think the images are an irritating waste of space/compute power and I should just stick to text only posts in future.