1. 5

    Most recently I’ve published a basic web app called Vizor for interacting with Google’s Vision API to extract text from images. I have a small backlog of public domain books I’d like to digitize with it, and give them the hypertext treatment on llll.ro.

    1. 3

      Cool project. Are you uploading the books to archive.org?

      By the way, I’ve found that the Microsoft Vision API is just a tad bit better at properly identifying text in images: https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/

      Don’t even bother with the Amazon Rekognition, it’s horrible at properly OCRing text.

      1. 1

        Eventually, I’d like to. But I need to figure out a rig to take proper photographs — the kinds I’m producing right now are merely for OCR purposes.

        Thanks, I’ll check it out! I’m doing OCR on Romanian texts that use older orthography and diacritical marks, so Google’s offering was the only remotely offering adequate results among the options I explored. I’d be great if I could squeeze some extra accuracy out of the process.

        1. 3

          You could always just upload the OCR text results and not worry about the photos. As for Microsoft Vision, I only tested with English, so not sure how it’ll perform with your use case. Good luck!

    1. 27

      For the past year I’ve been converting archive.org CD/disk rips into modern formats.

      I love that archive.org make these available. My goal is to make them accessible. Lots of art, music and animations are locked away in ancient formats that few can access today.

      I extract the original CD/disk and recursively extract and convert all sub files. Archives (ZIP/ZOO/SIT) I extract. Images (PCX/TGA/PICT/TIFF) to PNG. Music/Audio (MOD/MID/S3M/AIFF/AU) to MP3. Video (FLC/FLI/AVI/SMK) to MP4. Documents (DOC/WP5/WRI) to PDF. Fonts to OTF/TTF & PNG preview. A single CD ISO can baloon to over 100,000 files.

      The converter part currently supports 608 formats: https://github.com/Sembiance/dexvert/blob/master/SUPPORTED.md

      The website part is a work in progress. Likely another year of work before I put it live so folks can explore all this content. It has faceted search with a full text index including OCR’ed text from images.

      The website does not require JS/CSS/HTML5/HTTPS so it works well in text based and vintage browsers. This allows retro system users like Amiga/Win95/AtariST to directly access the website and download useful shareware/freeware.