1. 32

Over fifteen years ago I coined my own rule #0 of the internet: you have to constantly back it up!

As time goes by I feel it’s even more relevant nowadays than before. Internet is not a truly safe place for any kind of content. What you read a few years ago, may not necessarily be available or searchable today. (That’s also why bookmarks aren’t good enough in the long term.) There are various reasons for vanishing content, some are possibly justified, many are not, and bunch of them can be explained by content creator missing/breaking/not caring about stuff when converting/changing between platforms/blogs/websites they provide their content on. But it’s not a story about vanishing stuff. It’s a story about preserving stuff from internet on your own, so you can go read/listen/watch it off-line, even if it’s no longer present on internet.

I’m not talking about backing up whole internet, as it’s not feasible for any mere mortal (and even for great majority of companies with beefy equipment and net links, I believe), but only about parts of it, usually very specific parts you are personally interested in for some reason (or you think you may be interested in future).

Paradoxically I think I was often better at backing up stuff way back then. In the modem dial-up times using software for mirroring whole sites (Teleport Pro was one of the examples, there were more of them, but I cannot remember them now) was often the most effective way to grab useful content in as small time as possible. Usually you grabbed a bit more that you really wanted, but could browse it freely later, without the clock ticking and growing bills.

Nowadays I rarely back up text content, which I regret sometimes when I no longer cannot find later what I was reading earlier (to refresh my memory). And when I do back up text content, I usually go with on-line services like archive.is or Wayback Machine, and they aren’t guaranteed to last forever either, so I’m not sure it even really counts here as backup.

Most often I tend to locally back up stuff, which is apparently safe, or it looks like so, as there is possibly no good reason for it vanishing any time soon. But well, you never know. The mentioned stuff is recordings/slides from presentations on some conferences. I don’t always have good internet in my mobile phone, so I can load it with some of downloaded talks (which I have much more than I’ll possibly ever watch), so I can kill time while commuting or whatever, and making killing time more like informative time.

I already gave an example in the past here: downloading videos from slcon 2016.

So what I back up? I back up some YT channels for instance:

I back up them with youtube-dl using following settings:

youtube-dl \
    -f "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best" --restrict-filenames \
    -w -o "%(upload_date)s-%(id)s-%(uploader)s-%(title)s.%(ext)s" \
    --download-archive DOWNLOADED

and keep videos from different channels in their own directories. I deliberately choose MP4 so hardware acceleration can be used in my mobile phone and battery is less drained.

What do you back up?


  2. 5

    I’m considering paying Pinboard for their web archiving feature, but so far it’s not been a huge pain point.

    1. 6

      I use Pinboard’s archiving, just for articles I’ve read and other things where I’d only be mildly annoyed if I lost them, it’s a bit too unreliable for anything else. The archiving time is sporadic, some things get archived in a couple of hours, others can take weeks, and many of my bookmarks say they’re archived but trying to open the archived page just causes an error.

      I still use it because it’s the only one I’ve found that will archive PDFs and direct links to images. Well, that, and because I paid 5 years in advance.

      1. 1

        Thanks for the review. It’s sad they don’t do the archiving at the moment of bookmarking. That’s what I feel is the best approach, but maybe they have so many users that reaching front of the queue takes week or so?

        Considering how you don’t think that good of Pinboard, I’m wondering why you went with buying 5-year service from the beginning.

        1. 2

          I already had a standard pinboard account grandfathered in from when it was a one-off fee, when I upgraded to an archiving account, and I had been happy enough with that. My thought process was I’d pay in advance and then I would have everything archived and I wouldn’t have to worry about it again for 5 years, I didn’t consider that it would turn out to be less reliable than I’d like.

      2. 2

        I pay for it and use it – my only regret is activating it so late, after having added bookmarks for years – that meant many many bookmarks had already vanished. (Thankfully Pinboard lists all such errors and the specific HTTP code that caused it)

        1. 1

          I like that they provide all error and HTTP codes. Are there logs too, so you can actually tell when the page stopped being reachable?

          1. 2

            No, just the error and an option to manually trigger a retry.

            It’s added as a machine tag like code:403

        2. 2

          I joined Pinboard almost exactly 7 years ago and it has already saved my butt a bunch of times. According to my profile page, about 5% of my bookmarks are dead links at this point.

          1. 1

            It has to be reassuring. Well, they’re not only proving fun statistics, but they’re proving their value to you. I really haven’t heard about Pinboard until today. If there would be a local client for syncinc the archived content locally, then I could consider buying the service and using it, but first I would need to restore my habit of bookmarking that I somehow lost many years ago.

          2. 1

            Interesting. I guess some bookmark-like service on top of archive.is / web.archive.org could be created. Or maybe there is even already such thing for free.

          3. 5

            Many of us back up some git repositories, or at least their main branches (like master). I love this feature of any decentralized VCS.

            In my home directory I have git directory in which I put hosts and paths to repos I’m cloning from, which I think is quite good organization. First level therefore looks like:

            • bearssl.org
            • bitbucket.org
            • gist.github.com
            • git.alpinelinux.org
            • git.buildroot.net
            • git.busybox.net
            • git.denx.de
            • git.fedorahosted.org
            • git.freescale.com
            • git.gnome.org
            • git.kernel.org
            • git.linaro.org
            • git.linuxfoundation.org
            • git.midipix.org
            • git.musl-libc.org
            • git.skarnet.org
            • git.sr.ht
            • git.suckless.org
            • git.yoctoproject.org
            • github.com
            • gitlab.com
            • sourceware.org

            in my case. It takes almost 10 GB.

            1. 3

              +1 to that organization of repos. That is the same scheme that go get and ghq use.

            2. 5

              I tend towards archiving, sorting and catalouging all my data. So I have full site rips of several science fiction blogs as well as textfiles.com, Bruce Scheiner’s blog, and quite a few others. Along with:

              UC Berkely’s online Video lectures All of C3 All of Blackhat All of Defcon Several TB of full tv series Several thousand movies Twenty thousand plus books Full archives of byte, mondo, 2600 magazine, etc. And much, much more.

              1. 2

                Admirable dedication.

                I have to add that I hate when there are online video courses I bought that I cannot easily download, like from Thinkific. I know I can watch it anytime I want, but what if the site will go down, or the creator closes his account with all the courses she or he was selling? I don’t like it (just like I don’t like subscription model in software, because if I buy something, I want to have it accessible perpetually and obviously offline too).

                I know it’s to prevent piracy, but typical thing with anti-piracy protections is that they make lives of users harder, while pirates will somehow grab the content anyway if they’ll be really willing to do it.

              2. 5

                I am trying to minimize manually checking websites for updates, so I just download everything and look at the list in a text editor. Also, I try to generally increase the fraction of things online that I read by downloading, converting to text using document.documentElement.innerText (with some minor extra scripts to put hyperlink targets inside the text), and opening the result in an editor. Of course I don’t bother to delete either HTML or text afterwards (and I — well, my scripts — do record source URLs).

                1. 2

                  Why did you decide to browse this way?

                  1. 5

                    Well, there are multiple things.

                    I consider most of the web design actively harmful, as in: a text dump has minor inconveniences, but the site as designed is usually even less readable. Comment threads are sometimes an exception, and in case of comment threads Elinks is usually better than innerText (but it has other drawbacks; maybe I should find a way to combine best of both worlds in some way).

                    I want to have tools that gradually reduce the attack surface of web browsing. Grab-then-read workflow (and once Firefox instance exits, all processes of the corresponding user are killed) will hopefully let me to gradually increase the sandboxing.

                    This workflow means that if I save something, I actually see what I have saved.

                    Most of the sites see almost-normal Firefox visits; and I do have an option to apply the grabbing script to something opened in an interactive Firefox instance (which is still UID-isolated, network-namespaced etc.), which might be in a state that is hard to obtain automatically (for example, some subset of threads is loaded to a greater depth).

                    1. 2

                      I’m working towards something similar, except as much as possible I want to send the resulting data either to my printer (or, I might get an e-reader for christmas?), as a batch job every morning. I was planning on using a dockerized chrome that I found somewhere. How are you automating Firefox to do this? Selenium? The print-to-pdf seems to be missing from the Selenium API, so I might have to use another tool to get my pdfs.

                      1. 4

                        No, I cut out the middle man. I just use Marionette and the official Marionette Python client from Mozilla. Which is used to execute Javascript code sometimes generated by bash scripts, but oh well. I also use networking namespaces to allow each instance to have port 2828 for Marionette.

                        Marionette allows execution of Javascript code in the context of Firefox UI. For example, (the code is lifted from Firefox tests, which are the main use of Marionette) Components.classes["@mozilla.org/gfx/printsettings-service;1"].getService(Components.interfaces.nsIPrintSettingsService) seems to evaluate to an instance of nsIPrintSettingsService. Hopefully some browsing through XUL reference could give you a solution for printing in the current Firefox release; no guarantees when something will change…

                        Another option is to run Firefox in its own (virtual) X session, run window.print() then find the print dialog and send it the needed input events.

                        1. 1

                          Are your scripts available somewhere? Does a write-up of your method exist? I’d be a huge fan of using that.

                          1. 2

                            A separate problem is that I need to create/delete a ton of users, and that requires root access, and my current permission-check code for that is a part of a Lisp project where I use sinit as PID 1 and most of the system management stuff is performed inside an SBCL process.

                            I hope to clean up and write up that Lisp part at some point…

                            Are you interested enough to participate in cleanup of the part relevant to your interests (and probably in implementation of an alternative UID-spawning backend, as you are interested only in the Firefox part)?

                            1. 1

                              I’m interested, sure, but I can’t say in all honesty that I’d have enough time to inject significant effort in the project. Is the code already on a public repository somewhere, or is that in the future too? I’d rather not promise anything, but I really would like an opportunity to touch some Lisp code.

                              1. 1

                                Well, there are too many assumptions to just put it in the public repository and hope anyone could reproduce it (some parts assume my Lisp daemon, some parts assume Nix — the package manager — is available, various parts use my scripts initially written for other reasons, etc.) Have I mentioned I feel significantly less comfortable when more than one assumption is broken on something I use as a laptop/desktop?

                                I could set up a repository for that, put a layer by layer there and ask you to check if simple tests work in your environment (for each layer). Then at some point the simple test will be «please check if it correctly downloads 20 entries from Schneier’s blog starting with {URL}». I am not asking you to write much code for that, but I need some feedback (and some positive reinforcement) to undertake it.

                                If you are willing to skim and run as root a trimmed-down version of my Common Lisp system-management daemon (you don’t run as root code provided by conspiciously pseudonymous strangers on the Web without skimming the code first, right?), even I would need just to separate the relevant parts of my setup without writing much new code.

                                In any case I plan to eventually publish some rewritten version of all that; hopefully in February 2018, as a write-up to submit to European Lisp Symposium (this would hopefully be about Lisp-for-system-policy, and controlling Firefox would be one of the features).

                                1. 1

                                  I think you underestimate the value of reading code even without the ability to run it.

                                  But deciding to publish it at any date is generous of you, so don’t read this as me pressuring you to up your schedule :)

                                  1. 2

                                    OK, I tried to look what can be pushed as-is. But even cleaning up the first step (network namespace wrapper script, with passthru done via socat) turns out to be already not completely trivial… It still leaves socat behind from time to time (doesn’t matter as much for my specific case where lack-of-persistency incentivizes me to run a reaper anyway, but obviously should be cleaned up, and I failed to do it cheaply)


                        2. 1

                          I understand where are you coming from. Thanks for sharing your approach, which I guess is most likely unpopular.

                    2. 4

                      I’ve been relying on pinboard, but eventually want to use something with higher-fidelity archives. Googling revealed this ‘awesome list’: https://github.com/iipc/awesome-web-archiving

                      Other things I’ve archived in one-off scenarios:

                      • Myspace-era songs where the artist is no longer active and isn’t available for sale.
                      • Talks I liked usually hosted outside of youtube. I’ve noticed some of them disappear over the years, or just want an easier way to stream/sync them (e.g. Engelbart’s Unfinished Revolution: http://purl.stanford.edu/gd223nv1866 )
                      • Academic papers if they’re personally interesting.
                      • Old out-of-print books. I’m more worried about this stuff disappearing than youtube videos.

                      My problem isn’t so much the act of archiving, it’s figuring out how to organize it all to consume it or aggregate it for searching later - I’d like to at least sit down for a few days to comb through all my pinboard tags and give it some better structure (which leads to another thing - I kinda wish pinboard supported tag autocomplete).

                      1. 1

                        Thanks for sharing awesome list and your backup achievements.

                        I perfectly understand your worry for old out-of-prints books.

                        I agree also on how hard good organizing, i.e. useful for further processes on it, like mentioned consuming or indexing.

                        One of my main problems regarding videos I back up, is that I have no solution for properly tracking what I watched already. I watch stuff on various devices, mobile phone, laptop, etc. and because full channel backup can take a lot of space, I have to move most of it to disk, which isn’t on-line all the time, because my home server, from which I watch stuff, has limited storage.

                        1. 1

                          Have you used plex? It sorta tries to track watching, at least to allow you to continue where you left off, and the mobile apps let you sync to watch offline. XBMC/kodi may also do this.

                          1. 1

                            No, I haven’t used plex.

                      2. 3

                        The eBooks and CompSci papers I’ve saved are close to 50GB per my properties tab. . I’d like to have copies of certain web sites because they’re disappearing offline even in Wayback. It’s disturbing. I’m too overloaded with stuff to do to email all the admins I see setting up methods to keep them up. So, saving local copies might be the only thing to do.

                        1. 2

                          You could possibly have a script that checks when they 404 from the source site and makes them publicly available on your own hosting with a message about providing it under fair use and it isn’t available anywhere else, email me to have it taken down…

                          1. 1

                            I never thought about mailing admins of sites, but usually you have no direct contact to them.

                            Asking them for some archive is possibly the easiest way to back things up. I guess it could work for some smaller services, if admins were open for cooperation in that regard, obviously.

                            Thanks for this simple idea. Sometimes obvious solutions are overlooked.

                            1. 1

                              Usually the author’s email is on their academic page. The old school sites often have a webmaster email on the bottom. Some work.

                          2. 2

                            I am also wondering if there would be some value in creating site, where people could state what they back up, so in case of some content going down, you would know where to look for help and file request for reshare.

                            1. 2

                              I’ve been thinking of a distributed YouTube archive… basically an app that youtube-dl’s videos and exposes them on IPFS or dat (or good old BitTorrent? These seem more convenient.) And some kind of metadata index that would let you find the hashes of the videos.

                              1. 1

                                The potential problem with this approach is that people can download videos in different ways (and for good reasons). Some may get them using youtube-dl with options like I’ve shown in my post’s text (to be more mobile-friendly), but other may go with default best, so they’ll get various results (e.g. VP9 instead of H264), and even if options are the same, but you combine audio and video, results can slightly vary because of different ffmpeg versions used and so on. Mere BitTorrent isn’t good enough, because file hashes will be too easily different. There would be need for some video container and codec aware P2P, where checksumming would be at video data level and not whole file level, where stuff can differ even if the actual video or audio bitstream is the same. But as I wrote, even this could sort of work only in case of videos downloaded with same settings, because if bitstreams differ, you won’t convert one from the other.

                                Thanks for mentioning dat protocol. I heard about IPFS, but somehow never about dat (or I simply forgot).

                              2. 1

                                This sounds like a torrent tracker, or a DHT, or something.

                                1. 1

                                  It could be kind of that, I guess (please check the other comment here and my reply), but it’s not what I had in mind.

                                  I was thinking more about metadata-sharing service than data-sharing service. What I mean by that is that the service itself wouldn’t provide any way of sending files, videos, etc. It would only allow people simply advertise what are they backing up from internet (possibly how and also how many stuff they presently have), obviously in some very organized way (e.g. YouTube channel, YouTube playlist, ftp mirror, etc.), so others could search for same thing if in need. This advertising should be done in repeated manner, so some kind of tracking app could be needed, possibly with many plugins (youtube-dl download archive reader, etc.).

                                  If what you search for is already in the database (i.e. there are some users advertising they’re backing it up), then you would be able to send message to users having it to figure out how you could obtain the content from them. I simply wouldn’t concentrate on providing concrete sharing solutions, as I think it wouldn’t be a point here.

                              3. 2

                                Whenever I come across a PDF I always save a copy before opening it, since they’re very cheap to store and often useful to refer to. I try to add them to a large Bibtex file too, but that takes more effort so I’ve only gotten through a fraction of them.

                                I have a directory of “talks” for lectures, presentations, etc. which I’ve found interesting enough to keep (mostly saved from YouTube). I only move stuff in there after watching, and deciding whether or not I might ever watch it again. I have a subdirectory for TED talks, since (a) I’ve saved loads and (b) their short/shallow nature is rather different than the long/deep nature of a lecture.

                                I don’t trust sites like Github with hosting code; they’re useful as mirrors, but I keep clones of all the software I’ve ever written, along with all the projects I’ve ever bothered checking out or cloning (outside of build scripts).

                                1. 1

                                  Yeah, I also save most of my PDFs. I don’t save them mostly by accident, as I can skim it quickly in browser and forget to hit the download button. But I don’t organize them immediately after downloading, sadly, so I tend to have a lot of them to the point it becomes a pain and I start organizing them, but usually fail to do so to the full extent.

                                  Some conferences used to provide content on their FTP servers, or HTTP servers with file index, and in such cases lftp is invaluable, because it can non only work with FTP, obviously, but even with file index on HTTP, which is not a widely known feature. So yeah, you can do a mirror of file server served over HTTP using it. I find wget with its recursive features a bit more clunky here. With lftp I can for instance estimate needed free space by invoking du -hs command, and it will visit each subdirectory to do the calculation. Sometimes some non-standard file index are used, so it may not always work, but it’s still a great feature.

                                  GitHub is not better or worse than most other code hosting solutions (if we’re talking about sole repository hosting that you access using your favorite tools remotely). But as long as a decentralized VCS is used under the hood, you can easily fully clone it (git clone --mirror and similar solutions), which is a great thing. But I very rarely mirror repositories and main branch is usually good enough.

                                2. 2

                                  I’ve been running a backup of my shaarli bookmarks but neglecting it over the last months. Sadly a few of the bookmarks were already offline or require flash player or other silly stuff.

                                  I was working at a project to replace the weird python tool I was using for this since it didn’t screenshot the entire page, only a 1920x1080 section of it (or some other screen setting). I might redo it considering firefox now offers the option to screenshot a page at full height.

                                  I also archive a lot of stuff on my nextcloud folder, I have some arxiv on there. The biggest part might be my image collection of memes and fandom art material (some fanfiction and other artwork sites are also backed up), which is totalling about 47.5GB of data.

                                  1. 1

                                    I haven’t heard about Shaarli before. Thanks for mentioning!

                                    Ideally any on-line bookmarking service should archive the current version of bookmarked page along with the URL. Local bookmarking apps should provide similar dump locally, but if user would be okay to sync his data with service’s server (*), then request for a snapshot should be performed on-line too, e.g. using mentioned earlier archive.is and/or web.archive.org, so you would have off-line backup and on-line backup.

                                    (*) it never should be enforced, I am fed up with mobile apps wanting to keep all my data in the cloud - it may be fine for some stuff, but not necessarily for everything

                                    1. 1

                                      Shaarli has a plugin to open a archive.org page but it’s sadly not automatic.

                                      Personally I don’t think bothering archive.is and archive.org about this would be misplaced, especially if you backup a lot (I have about 2GB worth of site data atm, it’s probably double that by now), I did have the entire thing online somewhere though…

                                  2. 2

                                    Most of the code I write goes on GitHub, and then is usually cloned to my home desktop and my laptop. I’ve been meaning to upload everything to my Bitbucket account, but haven’t yet.

                                    I keep my photos on an external hard drive, which I backup to another external drive every so often (which I should probably do sometime soon). The ‘good ones’ get uploaded to SmugMug and often 500px.

                                    I personally don’t see the point in hoarding content from the web. I use Pinboard to bookmark interesting content when I find it, and I’ll download PDFs for offline viewing, but that’s generally as far as I go.

                                    1. 1

                                      If you want to have multiple git repository mirrors, just in case, then I would consider also those known to be reliable, but not necessarily that well-known or not providing as many UX features people know from GitHub or BitBucket:

                                      Nice point about them is that their framework is open-source, so you can host them on your own, if you want.

                                      Thanks for mentioning SmugMug and remembering me about 500px - I totally forgot about it.

                                      Another Pinboard user. I wonder how I haven’t heard about it so far if it’s so popular?

                                    2. 2

                                      My archives show long stretches of using Firefox’s Scrapbook Autosave, which would save every page I visited. But there are also multiple interruptions caused by the extension breaking, or by (the last time) Firefox breaking it.

                                      1. 1

                                        Thanks for mentioning this Firefox add-on.

                                      2. 2

                                        I still use jwz’s venerable youtubedown.pl script for archiving YouTube, even though I have youtube-dl installed for mpv. Still works with nary an update.

                                        If a site is small or eclectic enough I’ll spider it with wget, but there’s been a few times where I’ve had to spend a few hours finding a single archive.org link to save. It’s on my to-do list to write automation for going through my pinboard XML to find broken/moved links, and to find forum posts: ltehacks.com went down a few months ago, and now that I’ve got a Calyx hotspot it would be useful to have those posts for reference.

                                        1. 1

                                          I haven’t used youtubedown.pl ever, but it’s possibly not as configurable as youtube-dl, so I doubt it will change anytime soon. youtube-dl works fine, is quite well maintained, supports a lot of other sites beside YouTube, etc.

                                          Ok, another Pinboard user. It’s like at least half of people commenting here use this service. Isn’t finding broken links already too late (unless you have archival account)? From what I read in other comments it seemed that Pinboard shows if link is no longer reachable, so why your own tool for that?

                                          1. 1

                                            I have a grandfathered one-time account and no archiving service. Pinboard does not check the links nor add tags in that case, and I have over 20k bookmarks.

                                            Sometimes the content has been moved slightly, (esp. if it’s an academic site, ie transitioning away from tilde-user directories) in which case I can usually manually find the content again. Some of them are also “read later” shortened Twitter/newspaper links from when I’m on my phone on the run, so a dead link in that case is “oh well, delete”.

                                        2. 1

                                          I don’t actually do this, but I’ve always wanted local copies of stack overflow and wikipedia.

                                          1. 2

                                            There is also project called sotoki that seemingly allows bringing StackExchange sites to Kiwix by converting their dumps into zim files.

                                            1. 1

                                              apparently the ipfs folk have wikipedia up as something you can mirror through ipfs but they’re working on a dynamic version that doesn’t need to be manually uploaded which would be amazing

                                            2. 1

                                              I have the KiwiX wiki backup and it works great.