I tested 13 LLMs to auto-generate alt-text for 9,000 images missing descriptions on my website. Not surprisingly, cloud models (GPT-4, Claude 3.5) performed best but weren’t perfect. For local options, Llama variants and MiniCPM-V worked reliably but missed some details. Local models align with my values, but cloud models would serve visually impaired users better. Should I prioritize principles or pragmatic accessibility?
Neat work–maybe use segment anything (or something similar) and do better classification on the detected objects? Captioning is tricky, and given that it’s basically a preprocessing step I wouldn’t sweat using commercial models.
Also, please watch the self-promotion. Maybe post some other stuff.
From the post prior to this post we have a header image
Is this an accurate description of the image’s subject? Sure, but does it add like any value to the reader other than some eye candy to settle on before the post & required for better social media metrics to have
og:image? I don’t know this or similar images would help visually-impaired folks in any manner to be read. Without commenting too heavily on the larger trend with how blog posts are shoehorning an AI-generated or Creative Commons image where it might not be needed, can the the LLM understand when an image isn’t really ‘relevant’ & instead marking it as<img src="…" alt="">making it a decorative image instead?These captions let screen reader users know that the illustration is just eye candy and they’re not missing important information.
That isn’t following the W3C guildelines, which says to null out the
alttag.I am curious, how long does it take to scan 9k photos like this?
You may want to try moondream as its great and lightweight and can run locally or via cloud api. 2b or 5m https://moondream.ai/