1. 2

I’m looking for a way to archive some images and PDFs for a search-engine-like application I’m building. Concretely, I’d like some sort of hybrid HTTP cache / document store, that ideally:

  1. caches URLs persistently and indefinitely
  2. supports some form of versioned updates, i.e., refresh the source document but keep the original around if it has changed

I’m prepared to roll my own, but it feels like this problem might have been solved in a reusable way before. Any pointers?

  1.  

  2. 2

    S3 or any of S3-like services out there?

    1. 1

      That’s a good pointer, thanks! I’d ignored that part of the space due to not actually wanting to put things on the cloud, but the API does seem to be close to what I want. https://github.com/minio/minio in particular looks promising.

      EDIT: Sadly minio seems to have ballooned in terms of complexity. Compare the 100-line go.mod. I guess that’s what you need to do if you want to build a company around it.

    2. 2

      S3 or similar would also be my default; but if you really don’t want to use the cloud this might be a decent use case for git large file storage or git annex.

      1. 2

        There is a file format for this [0], and bunches of software like the WayBack Machine by the internet archive[1] They provide a service [2] that does it all for you. wget[3] will create warc formatted files for you[4]. Then it’s just a matter of storing the warc files somewhere, S3 and similar object stores is definitely one way to do it. A ZFS pool would be another way.

        If you really just want to cache it, but not necessarily keep it around long-term, checkout varnish[5].

        0: https://en.wikipedia.org/wiki/Web_ARChive

        1: www.archive.org

        2: https://archive-it.org/

        3: https://www.gnu.org/software/wget/

        4: https://www.archiveteam.org/index.php?title=Wget

        5: https://varnish-cache.org/