1. 73
  1. 20

    In one of my browsers, I have distrusted Let’s Encrypt. I opened this page on that browser, which gave me a warning. I tried using http:// instead of https://, but that only redirected, HTTP/1.1 301 Moved Permanently according to cURL.

    This lead me to wonder if websites designed to last should always offer a plain text version of their content. They can offer HTTPS, too, of course. But use <link rel="canonical" to point to the unsecured version. Why? Today, Internet Explorer 6 cannot be used for anything HTTPS anymore. It supports TLS 1.0 if you enable it, otherwise it’s capped on SSL 3.0. Nobody uses Internet Explorer 6 anymore[citation needed], but it can still be used to retrieve an HTML page over HTTP, because that protocol hasn’t changed in decades.

    Likewise, if you run a webserver today with crypto settings that worked for Internet Explorer 6, browsers might rightfully call you out on it and present the user with bright red blinking lights. You have to keep your software up2date and trust that your software is still supported in the future.

    If you want to offer HTTPS, you need to have the correct crypto settings, and you need to keep renewing your certificate. That adds to the work required to keep your content online. If you really want to design your page to last, you should keep the amount of things that can go wrong to a minimum. Don’t assume you have a secure/widely accepted TLS-engine in the future and don’t assume you have a trusted certificate in the future.

    1. 10

      Flagging future SSL concerns[1] is a weak argument at best. Today we already deploy tools such as proxies to bridge legacy systems to the new and nothing prevents content mirroring either.

      Concerns about the transport layer pales into insignificance compared to can something at the other end still digest and read the content? Push your mind ten or fifty years into the future, what would you find harder to interact with, an IE6 compatible SSL served single Flash SWF app page or something that uses TLSv5.6-with-chookity-extensions and serves just HTML/CSS?

      It is not only archivists that have noticed that stone tablets/paper seem to retain accessibility better than a bunch of digital cruft produced in the last two decades, the problem being tackled is not the transport but the encoding used.

      I do have a gripe and think the items ‘Obsessively compress your images’ and ‘Stick with the 13 web safe fonts +2’ have no place on that page and have no bearing on accessibility which is where I think ‘correctness’ is pushing aside ‘importance’. :)

      [1] especially so as here the author did not list it as a a item on their manifesto

      1. 2

        You’re addressing the client-side perfectly, I’ve seen people use Amigas for a lot of cool things, they use proxies on Raspberry Pi’s. As long as the site is available over HTTPS, you can use a proxy to make it available to HTTP.

        But this article is about the server-side as well. I’ve been in the situation where I’ve botched a webserver by misconfiguring SSL, where everything else was configured correctly. The server doesn’t start. What are the chances that your configuration file today can start your webserver next year? In 5 years? 10 years? If I have an httpd.conf from 5 years ago, will it work without modifications? Of course there are more things than HTTPS that can break, but my point is that HTTPS is something that adds to the list of things that can make your site unreachable.

        And besides, if I wanted to read a website today that’s only available through SSL 3.0, how would I do it? I think most current agents will tell me to take a hike. Not to mention SSL 2.0. While plain ol’ HTTP is as easy to reach as ever. TLS 1.0 is deprecated now, how long until it’s code paths are being removed from the most common TLS libraries?

      2. 7

        You are correct. HTTPS itself is continually rotting, and even a 10-year-old browser may not be able to connect to today’s HTTPS sites. You can forget about IE6 and prior.

        1. 6

          Agree with all this, but I’m expecting in the near future that modern browsers will distrust unencrypted content completely. At that point things get more complex: it’ll be relatively simple to have unencrypted content available for IE6, and comparatively more complex to have content available for “current” browsers, but over time we’ll end up with a donut hole in the middle which depend on the deployment of security technology which is considered insecure by the bleeding edge.

          We might even end up with a “shadow web” where historical content is unavailable to modern browsers due to security demands they impose.

          1. 5

            Just curious, why did you go to the trouble to distrust Let’s Encrypt?

            1. 2

              It’s ok to let go of foundational software that’s been wholly replaced vertically.

              1. 1

                A very easy way to bridge across TLS differences is to navigate through a proxy which will only handle the S part of HTTPS and leave the rest to your old browser.

              2. 15

                I kinda dream of a blogging and/or bookmarking engine that would scrape the target page at the time of linking and archive it locally (ideally in WARC format, though a “readability-like” text extract would be a good first attempt too); then, occasionally, re-crawl the links, warning me about 404s and other potential changes; I could then click the suspicious ones to check them manually, and for those I tick off as confirmed bitrot, the engine would then serve the local archived copies to the readers. Even more ideally, all of this stuff could then be stored in IPFS. Nad yes, I know of pinboard.in, and am using it, but would still prefer a self-hosted (static) blog-like solution, ideally with IPFS support. It’s on my super long TODO list, but too many projects already started so I don’t think I’ll get to it in this life, and additionally it has quite a few nontrivial pieces to it I think.

                edit: Even more ideally, the IPFS copies of the websites could then be easily cloned by the blog readers, forming a self-interest-driven worldwide replicated Web Archive network.

                1. 6

                  and for those I tick off as confirmed bitrot, the engine would then serve the local archived copies to the readers

                  I think this would violate copyright law in Europe. Not sure about the US, though. archive.org somehow does not seem to have problems.

                  In Germany archiving written material (also the web) is the job of the National Library. But to my knowledge they only archive all books and a small portion of web pages. And even they say: “Due to copyright reasons access to the collected websites is usually only possible from our reading halls in Leipzig and Frankfurt am Main”.

                  1. 2

                    Uhhhhhh. Sadly, a good point. One would have to talk to r/datahoarders or The Archive Team and ask what they think about it, would they have some ideas how to do this legally. Still, I’d certainly want to have those archived copies available to myself for sure. This cannot be illegal, I can already do “File / Save as” in my browser.

                  2. 1

                    Many years ago when I attended university (in early 2000s) quoting from internet sources wasn’t accepted unless the quoted source was included as an appendix item along with the essay.

                    Since then if I find a digital source for referencing I create a personal archive of it including all relevant meta data for referencing purposes. This has helped combat the effect of “digital decay” on my work where the internet archive may not have managed to grab a snapshot.

                    1. 1

                      warning me about 404s and other potential changes; I could then click the suspicious ones to check them manually

                      This was one of the few uses I had for deep learning. Sometimes, I’d get 404’s on part of a page but not all. They might also do weird stuff, like GIF’s, in place of the 404. Local copies with smart, 404 detection would be a big help to counter the death of the Old Web.

                    2. 10

                      A few other suggestions:

                      • include a date of publication and date of the last revision;

                      • if you rendered the HTML after all, consider making the source documents available as well (e.g. by replacing .html with .md in URL);

                      • if you want to allow people to archive and redistribute your work:

                        • explicitly choose a license and put it on a well-visible space;

                        • consider generating UUID which could be used to reference or search for your article regardless of the URL;

                        • consider assigning a digital signature so readers can verify the mirror hasn’t be tampered with; this assumes they already know your public key and I’m not sure how useful it really is, but it’s better than nothing

                        • consider using relative links; this is a bit controversial, because they are more difficult to get right and tend to break, but once setup correctly, they allow to migrate entire sites really easily (also, replacing URLs would break the signatures should you decide to use them);

                      1. 1

                        Thanks, these are pretty interesting suggestions, and I’ve thought about some of these.

                        If it’s not a dynamic website, doing something like alert(document.lastModified) might help the user retrieve the date of the last revision. Putting it explicitly on the website could make sense in many cases, but I also imagine some cases where the author doesn’t want it (like a restaurant website, where visitors might think, rightly or wrongly, that the restaurant is out of date if it doesn’t keep updating its page).

                        UUID/signature – I imagine it’d be uncommon for someone has the UUID but not the website saved. But to both of these comments, I feel that generally you’re thinking of a use case where a website remains static and we need to preserve that copy, but I’m thinking of use cases where websites should be continuously updated over time. I’m okay with older content being revised and not having the previous edits (if they don’t use the quick-backup scheme I mentioned), if the upside is that there is less overhead to updating the website.

                        Relative links are fine, but in my opinion, putting the other content on the single page and using skip page navigation like “#references” would be a bit more maintainable.

                        I like the idea of having a .md version of each .html page (though as I note, I think just writing out the html/css is preferred). But an .md is a better version of “view source” if the html is generated.

                        1. 1

                          If it’s not a dynamic website, doing something like alert(document.lastModified) might help the user retrieve the date of the last revision.

                          I didn’t know of this, thanks. I’d still prefer to see date on the page, though. Filesystem attributes are not always entirely reliable or semantically meaningful.

                          I also imagine some cases where the author doesn’t want it (like a restaurant website, where visitors might think, rightly or wrongly, that the restaurant is out of date if it doesn’t keep updating its page).

                          Of course. I was mostly concerned about blog posts and articles. I hate when I encounter an undated article. Sometimes, the date itself tells half the story.

                          If restaurants had websites that were designed to last, that wouldn’t hurt, but I would mostly appreciated it for an improved UX (incl. lower CPU usage and potentially a better parsability) instead of longevity per se.

                          UUID/signature – I imagine it’d be uncommon for someone has the UUID but not the website saved.

                          I was thinking in terms of references (bibliography). Books are referenced by ISBN. For online content, we currently only have URLs. But URL is necessarily tied to a single server (or datacenter) which just happens to be serving that particular content. This creates a centralization and once that technical infrastructure collapses, the URL is only good for feeding into archive.org. It would be rather unfortunate to read a paper that references a book “that you can borrow from the guy with the black cape who is seen on the local market every second Tuesday from around 7 am to 9 am”.

                          Linking to a concrete website is fine, but providing UUID in the bibliography would be even better, because users have the option to copy-paste it into a search engine and try to find a mirror.

                          In conjunction with digital signatures and hashes, readers would be able to assert they read the same copy, or a modified copy, but at least written (signed) by the same person that wrote the original linked paper.

                          I’ll write about it more in the near future.

                          Relative links are fine, but in my opinion, putting the other content on the single page and using skip page navigation like “#references” would be a bit more maintainable.

                          There are also images and stylesheets. Also, I was thinking of entire blogs rather than single documents.

                          though as I note, I think just writing out the html/css is preferred

                          Well, that depends on the use-case (and also who’s sitting behind the keyboard). I hate writing HTML with the passion, because in my opinion, it’s too technical and frankly, it just looks ugly. I always tend to use proper formatting/indentation I’d use for XML, but in the end up with so many levels of indentation it’s unbearable, so I just fallback to a mess and try to pretend it’s OK to use no indentation in the <body>. And then the “writer’s block” comes into the play and looking at “a bunch of characters that just need to be there for no obvious reason” (it’s certainly not nearly as much document-related or semantics-related as it is design- or programming- related) doesn’t help. When I decide to change a headline, I want to just press CTRL+D and type # a new headline, not think about selecting all the content between <h1></h1>. In the end, I forget the closing tag somewhere anyway.

                          You have some good points about writting HTML by hand, but for me, the disadvantages are too much. I can imagine doing it for certain projects, and I’ve done it in the past, but I mostly have personal blogs in mind (because I’m slowly but surely working on mine right now; I’ve ran away from Jekyll for complexity, hopefully my own toolchain will be better).

                      2. 7

                        One thing I think the page got “right” but didn’t explicitly call out is to use URL names that are robust to changing technology. Like this page, I’ve settled on using directories to serve content, to avoid exposing a file name extension indicating the technology generating the content. And like this page, having your own DNS becomes important, since it gives you the ability to relocate when your current hosting inevitably ends. Personally I keep my main DNS name renewed for the maximum 10 year duration at all times, even though I’m highly doubtful that it will keep pointing in the same place for that length of time.

                        I’m really curious how much of today’s Internet will still exist ten years into the future. If anything it looks like the dependence on third party hosting, long dependency chains, and constantly moving security targets means the longevity of the web in the future will be less than in the past. That’s very sobering considering that the past didn’t last all that well outside of a handful of major publishers. Like this author, I’m striving to be an exception.

                        1. 4

                          These are all good suggestions. Nothing can change how the web works: when you request a website, it goes off to the internet and fetches it from someone else’s machine. I don’t think it’s reasonable to expect a business to run forever. So hosting your site on a self-managed server or a hosting platform will eventually require you to move away. The author’s points about keeping files small and simple will help with this. Another way to tackle this problem is to make the hosting service more resilient. IPFS wont help because there’s no guarantee those files will be hosted by someone and accessible. What we need is a public institution that provides, say, 100 mb of storage and 10 GB of traffic per month for everyone on the planet and guarantees that the files wont go away. Note that though we still have the risks of files being deleted due to censorship, accidental loss, or catastrophe, we have removed the failure-mode of a business failing. Another step would be to have this system be content-addressible so old versions of files can be retrieved as well. (For the case when a url refers to a page that has changed.) If Neocities was run by a well-funded public-interest globally-distributed nonprofit, it would be something similar to what I’m asking for.

                          1. 3

                            I like how the page loads super fast, but also this is typical advice for other geeks. Building a site that lasts for decades shouldn’t require me to faff about with all these manual steps. There should be simple software to do these things for me and it should be trivial to host static content for non-geeks.

                            1. 3

                              Jeff, I like where you’re going with this. But, I don’t think you’re done yet. If you’d like, I’ll tell you everything I know, freely. I don’t have the same talent for speaking to an unknown audience that you do. But, I can speak to you, if you’d like.

                              May I ask you a personal question? The oldest machine I ever spent a year with was a Commodore 64. It was a hand-me-down by then. My next machine was a Pentium (classic). When my family got dial-up at home, it was a connection to AOL at 28800 kbps. By this time, most modems would NOT disconnect when the user viewed a web page that describe certain AT commands… That personal question.. How OLD are you? And I don’t mean in terms of revolutions around the sun. I mean, I can see that you are like me–you’ve used computers non-stop since you first started. And, in what generation of hardware and networks was that, when you started?

                              My next question is, what does line 6 of the html source of your post do?

                              <link rel="preload" href="fonts/open-sans-v17-latin-regular.woff2" as="font">

                              What is a woff2 file? What is inside it? Can it be extracted? What is the license? How many implementations of software exist to interpret that file?

                              May I proceed in this manner down the file, line by line? (Including the lines of prose.)

                              Thank you, cheers, –David

                              1. 2

                                WOFF2 is a compressed font format. It’s supported by all modern browsers and its reference implementation is available under the MIT license.

                                The Open Sans font is available under the Apache license.

                                Now Jeff only needs to answer the first question. ;)

                              2. 3

                                And if you follow this advice, it’s much easier for a service like pinboard.in to store your pages. For $25 a year every link I want will be preserved forever (well, limited to something like 30MB per link I believe and assuming I backup my stored stuff, as pinboard itself could of course disappear at some point)

                                1. 1

                                  The solution needs to be multi-faceted because no one solution can solve the entire problem. pinboard.in can’t reliably store all the pages if the pages exponentially grow in size as they currently are doing.

                                2. 2

                                  I’m imagining a mashup in my head between “built to last web site” and one of those “hard workin’ American man” Budweiser commercials.

                                  1. 2

                                    Don’t minimize that HTML

                                    Okay, sure

                                    Minify your SVGs


                                    1. 1

                                      Surprised there was no mention of using content-addresses instead of server addresses. An IPFS content hash is more permanent than a server path because content hash never changes as long as the content is the same but server paths can change even if the content doesn’t change which ends up breaking links.

                                      1. 3

                                        I don’t know about other people but I simply have no confidence that any of the content addressable technologies will still work in a few months.

                                        1. 2

                                          IPFS has been active and working for years and it’s usage is growing more than ever.

                                      2. 1

                                        Another thing I like to do is to make pages as self-contained as possible and don’t needlessly rely on external asserts.

                                        On my Jekyll site, I {% include ... %} the CSS and few lines of JS. Images are likewise {% include %}’d with data:image/png;base64,.... For other pages I wrote a tool to do this automatically a while ago: https://github.com/arp242/singlepage

                                        There are some exceptions: I have a picture of me on every site (~20K) and load some webfonts (~45k, it just looks much nicer) which I don’t include as they’re non-essential and would add a lot of overhead.

                                        1. 1

                                          One day, Medium, Twitter, and even hosting services like GitHub Pages will be plundered then discarded when they can no longer grow or cannot find a working business model.

                                          I believe this is an insight that many developers choose to ignore.

                                          1. 1

                                            Edit suggestion: under point 7, the link to “monitoring services” is broken.

                                            1. 1

                                              Thank you, fixed.

                                            2. 1

                                              I wonder your thoughts on Web Accessibility? I know I don’t always practice what I preach, but I assume it’s good to keep in mind.