1. 19

In the recent story about archive.is on 1.1.1.1 I brought up the request to point the cached link below the story title back at the Internet Archive’s Wayback machine. Responses were limited, and pushcx suggested starting a meta thread. Here’s the text of my proposal:

While we’re at it: can we please switch the cached link to archive.org?

  • they are a well funded non-profit
  • their archive is open and mirrored elsewhere

Meanwhile, archive.is

There is an old ticket on github (#331), but that was reverted without discussion a few months later.

pushcx: you seem to have implemented that on barnacl.es, can this please also be considered for lobste.rs again? I’m happy to submit the PR.

@cnst has brought up the counter argument that archive.is apparently is “faster, simpler, seems to require less JavaScript, has integration with all the other archives.”

I’d like to gather some feedback about that issue from a wider crowd of crustaceans here.

  1.  

  2. 12

    I favor using Wayback because the whole point of an archive is to keep things around a long time. Wayback has, is, and will continue to do it better. By the odds at least.

    1. 3

      If taken out of context, I 100% agree with your statement. One of the benefits of archive.org is that they don’t do database-driven non-semantic URL shortening, which means that even if they do go down, you could still use the links you’ve saved for any archived page with any other archival service, including archive.today (because each archive.org link already has a date and the original URL within the text of each link itself).

      However, what is the extent of our integration with archive.is? I think we only use it as a search engine for archived copies, using semantic URLs, which we could also change at any moment in the future, should they go down, without any ill effects or consequences. (And I doubt anyone would actually bother to save the results of their search, either (e.g., bookmark the link that could go down), as it should be trivial to repeat the search at any time.)

      Anyone really disliking archive.is could also take over the domain name locally and easily redirect deterministic URLs (from lobste.rs) to archive.org, and non-deterministic ones (outside of lobste.rs) to archive.today (their other domain). It wouldn’t be possible to do the same thing with archive.org domain without a more complicated proxy setup, because they don’t seem to have alternative domains. If there’s any interest for a sample nginx config to take over archive.is locally in this way with lobste.rs integration in mind, I’d be happy to make one shortly.


      So, I think that the only consideration for selecting whether to remain with archive.today or switch to archive.org should be which service works better as a short-term archival search engine.


      In my opinion, it’s archive.today:

      • archive.today integrates with all the other archival services, including archive.org, whereas archive.org doesn’t integrate with anyone else at all;

      • archive.today seems to require way less JavaScript than the new design of archive.org search, and way less bloated than archive.org search, which stopped working for me at one point in one of the browsers;

      • archive.today seems to work much faster for me in general than archive.org;

      • archive.today provides very nice previews on the history of each page. Etc. Basically, just way more features.

      As per above, given our specific use-case, I don’t see any compelling reasons to switch away from archive.is and specifically to archive.org.

      1. 1

        I didn’t even know about archive.today. The proxy idea is pretty good, too. Thanks for the insightful comment!

      2. 1

        Thanks! That’s exactly the conclusion I wanted to arrive at, but forgot to write down :|

      3. 4

        How about a dropdown menu (like the downvote dropdown) with both + e.g. Google cache and others?

        1. 3

          Personal opinion, regarding the counter arguments: I don’t see how the Internet Archive is ‘slower’ or ‘more complicated’. The thing about it requiring more Javascript is true, but on the other hand it also saves and serves the website with all its features, instead of a glorified screenshot that breaks any interactivity (archive.is seems to even break links on the mirrored page).

          1. 4

            Bit offtopic, but when did the Internet Archive switch to this better experience, as you claim? For all my old stuff (>15 years, sometimes >10) all I ever get is a really slow landing pag with assets missing. Thus I’m really not using the internet archive a lot.

            1. 2

              I use both, TBH. Some pages work better in one, and some in the other. archive.is does archive the whole page, not just a screenshot; some pages just don’t have all the resources archived properly.

              BTW, I just tried both from lynx, and archive.org is 100% broken. Nothing about the web-site you’re looking for at all. Archive.is just works. Archive.org is also much slower than archive.is even if you do have JavaScript, and doesn’t even have thumbnail screenshots to see which version you’re looking for if there’s multuple versions of same page.

              Since lobste.rs itself works fine in lynx, if we move to archive.org, it’ll basically break the cached link for anyone without JavaScript. (Why on earth a non-profit has to use JavaScript to present links to archived webpages is beyond me.)

            2. 3

              I also ideologically prefer using the wayback machine, but I’ll just point out that it was originally changed to archive.is for technical reasons according to the git commit that changed it: https://github.com/lobsters/lobsters/commit/b3ed12e09166c5577dba71248a65351cdd9dba8b

              1. 2

                Not to mention that archive.org’s search for any given date is atrocious. It’s so slow and JS-heavy, and doesn’t even have any previews. archive.today has very nice previews, so, you know what you’d be getting. archive.today is much faster, too.

              2. 3

                Does the Wayback Machine still have the problem of retroactively applying robots.txt? That’s a misfeature imho.

                1. 1

                  I believe they stopped. BoingBoing puts it in clearer words: “To be clear, they’re ignoring robots.txt even if you explicitly identify and disallow the Internet Archive.”

                  Edit: when you submit an archival request (at least through the website), they use your user agent now.

                  1. 2

                    Ah, cool. For a while there was the thing where, if a site updated it’s robots.txt, the IA would zorch the old crawls iirc.

                    1. 1

                      I think archive.org also leaks your IP address in one of the header. I’m not sure it’s even explained anywhere, BTW. It sure surprised me when I tried doing a tcpdump on them once.

                  2. 2

                    Pull request opened here. Thanks for all your input!

                    1. 2

                      Wait, so, you just go ahead and ignore all the input? What’s the purpose of this pull request when there’s hardly any informed consensus against archive.is and in favour of archive.org here?