1. 3

I tested 13 LLMs to auto-generate alt-text for 9,000 images missing descriptions on my website. Not surprisingly, cloud models (GPT-4, Claude 3.5) performed best but weren’t perfect. For local options, Llama variants and MiniCPM-V worked reliably but missed some details. Local models align with my values, but cloud models would serve visually impaired users better. Should I prioritize principles or pragmatic accessibility?

  1.  

    1. 3

      Neat work–maybe use segment anything (or something similar) and do better classification on the detected objects? Captioning is tricky, and given that it’s basically a preprocessing step I wouldn’t sweat using commercial models.

      Also, please watch the self-promotion. Maybe post some other stuff.

      1. 2

        From the post prior to this post we have a header image

        A person works on a laptop as a hologram of an AI agent hovers beside them.

        Is this an accurate description of the image’s subject? Sure, but does it add like any value to the reader other than some eye candy to settle on before the post & required for better social media metrics to have og:image? I don’t know this or similar images would help visually-impaired folks in any manner to be read. Without commenting too heavily on the larger trend with how blog posts are shoehorning an AI-generated or Creative Commons image where it might not be needed, can the the LLM understand when an image isn’t really ‘relevant’ & instead marking it as <img src="…" alt=""> making it a decorative image instead?

        Decorative images don’t add information to the content of a page. For example, the information provided by the image might already be given using adjacent text, or the image might be included to make the website more visually attractive.

        In these cases, a null (empty) alt text should be provided (alt="") so that they can be ignored by assistive technologies, such as screen readers. Text values for these types of images would add audible clutter to screen reader output or could distract users if the topic is different from that in adjacent text. Leaving out the alt attribute is also not an option because when it is not provided, some screen readers will announce the file name of the image instead.

        1. 1

          These captions let screen reader users know that the illustration is just eye candy and they’re not missing important information.

          1. 3

            That isn’t following the W3C guildelines, which says to null out the alt tag.

        2. 2

          I am curious, how long does it take to scan 9k photos like this?

          1. 1

            You may want to try moondream as its great and lightweight and can run locally or via cloud api. 2b or 5m https://moondream.ai/