1. 5
  1.  

  2. 3

    This article ignores, and unfortunately the author does not put the energy into describing, the operation constraints and the solutions available to those having to maintain a global highly available service. To validate these statements, you really do have to bring to the table your own operational experience on the handling and response to whole datacentre failover[1].

    What are the benefits of the short TTL? Not many, actually. You have the flexibility to change your IP address, but you don’t do that very often, and besides…

    [snipped]

    This means the DNS servers are hit constantly, which turns them into a real single point of failure and “the internet goes down” just after a few minutes of DDoS.

    These statement suggests that maybe the author has not had the ‘fun’ experience of 3am pagers maintaining such a service and the solutions network/system administrators turn to in an effort to improve their sleeping patterns. :)

    He is an experienced developer which might mean he views such problems through a development lens rather than an operational one though maybe he has used gdnsd (IMO the ‘Royale With Cheese’ of authoritative servers) or the like and simply does not want to admit to it on his CV otherwise to be condemned with these problems at his next gig :)

    As for the suggested 24 hour TTL suggestion, unfortunately from the DDoS events we see reported, this would not in practice stop an outage, it would just postpone it.

    Using Zytrax as a reference, is good, but there are no numbers to back up the ‘nameserver load increase’ statement, which I think the author may have taken too literally and suspects (probably like all of us did initially) there is an exponential curve at play. Back in 2010 I remember attending a JANET NWS presentation on just this on this and the difference between a large TTL and zero was about a 2.5x request rate. Great, my CPU load just went from 0.01 to 0.02, not really anything to write home about :)

    Global DNS load balancing does have its problems, but for me the interesting question here, and more crucially actionable, is why the majority of resolvers out there expired a cached lookup in the face of a SERVFAIL when attempting to refresh it. This to me is ‘unexpected’ behaviour and not something I would have catered to in my deployments, I would be interested to hear if others have? Maybe this is technically correct behaviour and so maybe we need to revisit its reasoning now that low DNS TTL deployments are the norm?

    I await folks to point out how I also have not put the energy into answering the ‘unexpected behaviour’ question :)

    [1] BGP is maybe technically a better option, letting you retain a 24 hour TTL, but would you recommend to someone trying this to get the equipment, skillset, upstream peering and at least one or more /24 IPv4 blocks to do this…or just recommend a low DNS TTL?

    1. 3

      One reason not to cache past TTL is it can make it much harder to detect problems. People are saying that e.g. Comcast should serve stale entries until they get new info. What happens when Comcast bungles their router config and their DNS server loses its uplink? All the sites that are in cache will continue to be fed to users. So like 99% of the Internet will work, and some few sites won’t. How long until that’s properly resolved?

      I spent weeks living with an issue where unbound had a single broken entry in its cache. Would not recommend.

      1. 2

        This is a different problem, caching stale entries when the upstream authoritative server is responsive (ie. not returning SERVFAIL) is completely incorrect and of course does break things.

        1. 2

          How does one differentiate between a dead upstream server and a disconnected network cable?

          1. 2

            Not really able to see the relevance of this question. Dyn were returning SERVFAIL though even if they were just timing out the response returned by a cache to its clients would still be the same.

            The ‘same’ here for now seems to be “nope, entry expired, you are getting nothing” except the relayed SERVFAIL response from upstream or locally if it timed out. I am suggesting that maybe re-using the last known good reply is not such a bad thing when faced with a situation in where there is no way in which to revalidate that cached entry.

            The problem you seem to be describing is either:

            1) an authoritative server returns the wrong result, bug or phat fingering does not matter, then an upstream server stores that for hours/days/months

            2) the resolver ignores the TTL, and stores an old stale entry for way longer than it should, even though upstream is responsive

            Both these are problems, but different problems to the one regarding the choice in behaviour for a cache when there is no usable response from upstream.

            As a caching application for DNS (applies for other protocols too, CDNs for HTTP are a classic example) the difference in its responses to its clients an unresponsive upstream server and one that is returning errors (SERVFAIL, 5xx, …) in practice is no different. Of course, the twist is the cache has the option to serve possibly stale data if it has it over serving nothing at all.

            As to your question, which looks like baiting, ARP has a tendency not to run too well with no L1 :) You can use the three or so different combination of responses ({host,net}-unreachable and TTL value) to figure out what and where, but why bother doing this if by having this answer you are given no new options in which to handle that failure?

      2. 1

        The point of the cache is to cache the “correct” answer. If the resolver is down, the correct answer is SERVFAIL. The alternative is that when I shut down my resolver, all my records continue to live, in an unevenly distributed way, forever (or for some arbitrary period of time I have no control over).

        OpenDNS does do some form of this, but I don’t know what their criteria are for their “smart cache” feature.