1. 21
  1.  

  2. 24

    Was fun to see all the “is it down” websites to be down.

    1. 14

      This postmortem seems unusually bad by Cloudflare’s standards.

      They did a BGP change and stuff almost immediately changed. But tracing the outcome back to the change took >20 minutes. That seems strange. How come? Don’t they have a log of actions taken on the routers, that would show that something was done on one of the affected routers a few seconds before the graphs went up to full or down to zero?

      I’m also surprised that the relevant POPs continued to export the world-facing anycast routes when their success rate dropped to near zero.

      1. 10

        Maybe people will reconsider using MiTMflare if we get a few more outages like this.

        1. 7

          Can you suggest some comparable services with better uptime or, failing that, better postmortems?

          1. 6

            What’s the use case you have?

            I just use … “my web host” (which happens to be Dreamhost, which does offer optional Cloudflare integration, but I intentionally leave it off). It has survived all the HN traffic news spikes just fine, as well as spikes from Reddit, lobste.rs, from an O’Reilly newsletter, and from what I think is some weird Google content suggestion thing (based on user agent).

            It has worked fine for 10 years. The system administration seems very competent. I don’t know the details, but they have their own caches.

            I noticed this guy said the same thing about Dreamhost: https://www.roguelazer.com/2020/07/etcd-post-follow-up/ i.e. that it’s worked for 15 years.

            I feel like a lot of people are using Cloudflare for some weird “just in case” moment that never happens. I’m not saying you don’t have that use case, but I think many people talking about and using Cloudflare don’t.

            To me Cloudflare is just another layer of complexity and insecurity. I would consider using something like it if I had a concrete use case, but not until then. Computers are fast and can serve a lot of traffic.

            1. 3

              The use case is free caching, and free bandwidth if you use some services for hosting (like backblaze). Which cuts down a lot of costs depending on the website you’re running.

              1. 3

                Where is the original site hosted? Why does it need caching?

                (I’m not familiar with Backblaze – is it a web host or is it an object store or both?)

                My point is that, depending on the use case, you probably don’t need caching, so it doesn’t matter if it’s free. There is a downside in security and complexity, which is not theoretical (as this outage shows, and as MITM attacks by state actors and others have shown.)

                1. 2

                  (I’m not familiar with Backblaze – is it a web host or is it an object store or both?)

                  Backblaze has a backup service, as well as a service called “b2” which is basically an s3 like object storage service.

              2. 1

                For the use cases I’ve had, I have (we have) used Fastly, a local Varnish/Apache/Nginx, or Rails middleware. The goals were some combination of a) overriding the backend’s declared cache lifetime b) speeding up page response c) letting the client cache various things even if not cachable by intermediates.

                Cloudflare combines all that with good DDOS protection and good performance globally. I can see how that’s an attractive feature set to many people, and while it’s a shame that VCs haven’t funded three dozen copycats, suggestions like that of @asymptotically that people just shouldn’t use it are stupid. It’s a fine combination of features, and telling people to just not want it, without suggesting alternatives, is IMO offensive and stupid.

              3. 4

                I don’t think so. I think that Cloudflare’s offerings are very good, they got this whole thing fixed in 30 minutes and explained how they’re making sure nothing similar happens again.

                The main problem I have with Cloudflare is their size. What good is a decentralised internet if we just connect through the Cloudflare VPN, resolve a domain via Cloudflare DNS and then get our requests proxied through Cloudflare?

                I also hate the captchas that you are occasionally forced to do.

                1. 3

                  the captchas that you are occasionally forced to do

                  Or all the time when connecting through Tor. Privacy Pass barely works :/ and it’s really silly that you need captchas to just view public pages! If they want to prevent comment spam and whatnot, why not restrict captchas to non-GET requests by default >_<

                2. 1

                  DNS or anti-ddos? Doesn’t OVH have anti ddos-servers for example.

                  1. 6

                    Cloudflare is a CDN with DDOS features (and has some related products, such as a registrar). It offers quick page access anywhere in the world, excellent support for load spikes, and DDOS protection.

                    A lot of ISPs offer anti-DDOS features for their products (which may be a product like Cloudflare’s or a different one, like OVH), but the feature is often one that displeases the victim: Dropping packets to the attacked IP address until the attacker grows bored and goes away. I don’t know what OVH means by anti-DDOS and they description page sounds a little noncommittal to my ears.

                    1. 3

                      OVH’s anti-ddos will trigger on legitimate traffic and then people will say your website has been “hugged to death” when it’s just OVH that shut down all incoming connections.

                      1. 2

                        OVH, the service from which 1/3 of my current bot-attacks come..

                        1. 1

                          Okay. Never used their services myself and don’t know how bots affect their anti-ddos or DNS.

                    2. 2

                      My impression was BGP problems (specifically BGP leaks, I think) were not just a problem in a CDN like Cloudflare, but also allowed mistakes by small players to make huge numbers of people to temporarily lose internet access.

                      Is there a difference in what happened here, and if so, is it a difference of scale, or some other kind of difference?

                      1. 3

                        This incident is related to internal BGP, not eBGP, and could’ve happened with any internal routing protocol.

                    3. 5

                      Meta opinion about Cloudflare: I tried to avoid Cloudflare but if I have to use a CDN, I must allow some sort of a pseudo-MITM. I just choose CF instead of something else for my blog.

                      1. 4

                        You can ask browsers to verify that your CDN is returning what you want instead of giving them everything.

                        1. 1

                          Yeah, but only for assets/media. These days it’s common/desirable/??? for the web page itself to come through the CDN.

                          1. 2

                            If your page is dynamic, then it doesn’t make any sense for the web page to come through the CDN. If your page is just a bunch of links for your JS SPA, you should still serve it yourself for the security, and because it isn’t that big anyways. If your page is fully static, you should probably still serve it yourself, just to be sure that nobody is inserting anything unwanted into your page.

                            1. 2

                              If your page is dynamic, then it doesn’t make any sense for the web page to come through the CDN

                              The route through the CDN’s private network should usually be faster than over the public internet.

                              1. 1

                                It’s not that expensive to have several servers in multiple locations to handle load, you just need to architecture for it. Paying for the access through the CDN’s private network is about the same cost.

                            2. 1

                              Exactly - my use case is the caching of the HTML web page itself.

                          2. 2

                            Where is your blog hosted? Why would it need a CDN? I would be surprised if does. My other comment in this thread gives some color on why I’ve never needed to MITM my own site:

                            https://lobste.rs/s/xbl6uc/cloudflare_outage_on_july_17_2020#c_nt8atu

                            1. 4

                              Without a CDN your site and its assets are stored in a single location worldwide. If you happen to have readers who do not live close to where the site is hosted, latency will likely be noticeable. Is it using TLS? 3 round trips for the handshake. That’s before even downloading html, css, js and images.

                              A quick check on https://wondernetwork.com/pings/Los+Angeles/Barcelona and ping between Western Europe and US west coast is around 150ms. That’s not nothing!

                              As you pointed out, it’s not necessary, but it doesn’t visibly improve the user experience for a lot of people worldwide.

                              1. 1

                                WAN latency can be an issue for really optimized sites, but most sites aren’t really optimized.

                                Example: I just want to webpagetest.org and put in nytimes.com.

                                https://www.webpagetest.org/result/200722_CA_fc5e1e19a8f5c3402fc5cb91be7b4824/

                                It took 15 seconds to load the page. nytimes presumably has all the CDNs in the world.

                                Then I go to lobste.rs, and it takes less than a second:

                                https://www.webpagetest.org/result/200722_7S_ec97779d98563256c0140997a431e19f/

                                lobste.rs is hosted at a single location as far as I know (because CDNs would break the functionality of the site, i.e. the dynamic elements. It’s running of a single traditional database AFAIK)

                                So the bottleneck is elsewhere, and I claim that’s overwhelmingly common.


                                So it could be an issue, but probably not, and that’s why I asked for specific examples. Most sites have many other things to optimize first. If it were “free”, then sure, use a CDN. But it also affects correctness and has serious downside security wise (e.g. a self-MITM)

                              2. 3

                                It’s self hosted on a box in my house, so having a CDN ensures my ISP doesn’t hate me if I get too much traffic, and other services I host aren’t impacted.

                                For folks concerned, I also provide a tor onion address for the blog which totally bypasses CF.

                            2. 3

                              It is fine, nobody’s pacemaker is directly connected to Cloudflare-backed services right? Right?

                              1. 1

                                i’m always impressed with Cloudflare’s transparency with their postmortems…. this one’s quite boring but they’re usually pretty interesting