1. 5

    In most async systems … you end up in a world where you chain a bunch of async functions together with no regard of back pressure.

    yup. Back pressure doesn’t compose in the world of callbacks / async. It does compose if designed well in coroutine world (see: erlang).

    async/await is great but it encourages writing stuff that will behave catastrophically when overloaded.

    yup. It’s very hard, in larger systems impossible, to do back pressure right with callbacks / async programming model.

    This is how I assess software projects I look at. How fast is database is one thing. What does it do when I send it 2GiB of requests not reading responses? What happens when I open a bazillion connections to it? Will a previously established connections have priority over handling new connections?

    1. 6

      Even in the Erlang world, there are technical risks at every point where you reintroduce asynchrony. See https://ferd.ca/handling-overload.html for a long post on all the ways you can find to handle overload in the Erlang ecosystem, from the simplest one to more complex systematic approaches.

      1. 2

        In Monte, where I/O is managed by callbacks, my “streamcaps” stream library has perfect backpressure. Each callback generates a promise, and each promise is waited upon before more data is generated.

        The main downside to perfect backpressure is that flow is slow and managed. Many APIs have “firehose” or “streaming” configurations that can send multiple packets of data as fast as possible, but such an API is not possible when backpressure is perfectly communicated.

      1. 4

        If the intent is to redirect users to a different network just to display a fail whale page when absolutely everything else is on fire, having more than 1 minute delay is probably acceptable.

        I disagree. On L3 volumetric DDoS you might need to nullroute/blackhole an IP address, and migrate your service away. It’s totally valid to null + move to another IP. Many L3 attacks do not follow DNS. It’s very important to be able to shift your service between many IP addresses promptly.

        Also, a 1 minute TTL means that if authoritative DNS servers are hosed for more than 1 minute, no one would be able to access the dependant services any longer.

        This train of thought seem to imply that the reliability of DNS auth is the same as reliability of L7 application. This is not true… You can with large confidence assume that if you are using reasonable DNS auth provider it’s reliability is good. There are PNI links between opendns and major players, or google dns and major dns auth servers. I would argue world’s DNS system is pretty healthy, surprisingly reliable and definitely NOT the most vulnerable part of your stack.

        In other words - dns resolvers caching your dns answers for long time is not a pragmatic “improved reliability” argument.

        1. 1

          Will the DDoS just cache the old dns entries? They know your IP, and they can just direct the DoS to that instead of the new provider.

          1. 1

            I’m talking from a provider point of view - they move services across IP’s all the time.

            If you are a simple website, for sure you should not expose your direct server IP to the public internet ever.

        1. 5

          Always interesting to see how networks work in real life. If you just wanted to get rid of a connection on Linux then there’s something not shown in the article. Repair mode, a fairly obscure setsockopt thing, lets you take down and move connections between systems. Here’s an article about it: https://lwn.net/Articles/495304/

          Finally, if a connection is closed while it is in the repair mode, it is simply deleted with no notification to the remote end.”

          1. 4

            This is great. Ok, I’ll bite.

            close(): socket will be lingering in background as usual shutdown(SHUT_RD): no network side effect, discards read buffer shutdown(SHUT_WR): equivalent to FIN SO_LINGER socket - if timeout non-zero blocks until write buffer flushed; if timeout is zero then immediately sends RST

            the trick you described: immediately discard a socket with no network side effects.

            Then there is ss --kill command to forcefully close a socket from outside process. It is done with netlink SOCK_DESTROY command.

            1. 2

              On BSD we have tcpdrop, too

          1. 9

            In the mean time I’m trying to convince Antirez to use io_submit in redis: https://twitter.com/antirez/status/1081197002573139968

            There was a problem with that… I don’t remember what exactly, but it was like, the structures you fill did not match how Redis stored the data or something. I need to try again soon or later because the speedup in Redis would be huge.

            1. 11

              @majke thank you for a very interesting article and analysis!

              I’m trying to reproduce and follow your analysis on my mac and would appreciate some help. On my mac I can get the code to compile by removing sections 1 and 2 and my removing the MAP_POPULATE and MAP_LOCKED flags. Is it OK to do so? (revised code)

              When I run this on my mac my mode is 0 ns with 1000 ns following and an occasional 3000 ns or higher. This pattern is much less smooth than yours and I wonder why.

              I’m trying to follow your analysis. As fas as I can follow the loop duration variable is redundant, since you have a timestamp and the loop duration merely stores the diff to the last timestamp. So you have point events with their times. You’ve tried to smoothen these point events by convolving a triangular window to them well, doing something slightly unorthodox, which amounts to literal linear interpolation between points.

              The linear interpolation generates a lot of high frequencies in the Fourier transform because of the corners.

              In neuroscience (see, I knew that esoteric training would come in useful someday) we have a similar data set that comes from neuronal firing. A popular way to perform Fourier analysis on them is to convolve the train of deltas with gaussians. This is like dropping a cloth over a set of spikes - you get pointy heads where the spikes are and a graceful tail where they are not. This leads to smooth curves which behave more politely in the frequency domain.

              There is a hypothesis behind doing this in neuroscience (neurons act in concert, with a slight jitter between them, blah blah) but basically smoothing delta trains is a dirty deed most practical scientists will let you do by looking the other way.

              Since I can’t regenerate the data I’m requesting you to retry your analysis with gaussian smoothing of your delta train and/or give me some pointers as to how to get proper results out of my mac.

              Thank you very kindly!

              -Kaushik

              PS. Also, if you are not inclined to help figure out how it would work on a mac, if you could send me your data set I could try out the gaussian convolution and send the results back to you.

              1. 9

                Raw data at your service: https://raw.githubusercontent.com/cloudflare/cloudflare-blog/master/2018-11-memory-refresh/example-data.csv

                ~/2018-11-memory-refresh$ cat example-data.csv |python3 ./analyze-dram.py 
                [*] Input data: min=111 avg=176 med=167 max=11909 items=131072
                [*] Cutoff range 212-inf
                [ ] 127893 items below cutoff, 0 items above cutoff, 3179 items non-zero
                [*] Running FFT
                [*] Top frequency above 2kHz below 350kHz has magnitude of 7590
                [+] Top frequency spikes above 2kHz are at:
                127884Hz	4544
                127927Hz	5295
                255812Hz	7590
                383739Hz	5799
                511624Hz	6932
                639551Hz	5911
                767436Hz	6001
                895363Hz	5682
                1023248Hz	4774
                1151175Hz	5107
                1406987Hz	4263
                

                (a) trying to run it on mac: why not, but the power saving settings may introduce even more jitter. Also - is there a reliable fast clock_gettime(CLOCK_MONOTONIC) on mac these days?

                (b) I’m very much not an expert on DSP and signal analysis. Please do explain why and how to use the suggested gaussian smoothing.

                1. 6

                  @majke very cool, thanks!

                  My interpretation of your data is in this notebook

                  In brief, I did a simple time domain analysis first by plotting the interval histogram. The histogram shows a prominent periodicity at 16.7 us with some slower components.

                  When I do a frequency domain analysis by smoothing the delta train with a gaussian I see this prominent period with higher harmonics. I’ve forgotten how to interpret the higher harmonics, but the base frequency is consistent with the 16.7 us periodicity.

                  This is roughly twice the 7.8 us you report in your article.

                  Treating this whole thing as a black box, I’d say it typically takes 16.7us to complete one cycle of operations though there are instances when things take a lot longer, though this is less common by a factor of about 100.

                  Tag! You’re it :)

                  1. 1

                    It looks like you’re interpreting the data differently from @majke; one analysis is on duration of each event (167ns); the other is on the delta between events (7818ns).

                    1. 1

                      Every cycle only one time stamp is dropped (rt1 = realtime_now();). There is no differentiation made between the duration of the event and the delta between them. It’s not a square wave with a duty cycle.

                      1. 2

                        Well. I don’t care how long the long stall was. All I care is about the gap between long stalls. I think if you make a histogram of durations between long-stalls (when long is avg*1.4 or higher), then indeed I think you will find the 7.8us period with simple histogram. Having said that, this will depend on the noise in data. I’ve had some runs of the over which simpler analysis failed.

                        1. 1

                          @majke ah very interesting! Thanks again for a very educational article.

                          1. 1

                            Here’s a histogram of “durations between long loop runs”

                            https://uploads.disquscdn.com/images/4d3d1472115285539d85539d3c84b8f5e8e56821cd2e8d84686c7031b51335b2.png

                            You can definitely see the spike at 7800ns, but I’m not sure how to extract it algorithmically without cheating.

                2. 1

                  I’m no expert, but I would assume that ASLR might screw with your results.

                1. 1

                  Neat post. Do you have any custom or unusual (eg Cavium) hardware needed for your DDOS mitigation activities? Or is it 100% vanilla boxes from Intel/AMD with Linux running software solutions like in the article?

                  1. 13

                    In our architecture every server is identical both in hardware and software. The more servers we add, the larger DDoS capacity we have. The servers are pretty standard. We do use Solarflare network cards, and occasionally offload parts of iptables into userspace. We are working on replacing this custom piece of software with NIC-vendor agnostic XDP.

                    1. 1

                      Wow. Impressive how far things have come in not relying on custom stuff. Thanks for the reply!

                      1. 1

                        lovely ! any particular reason of moving away from or not choosing dpdk for this ?

                        1. 1

                          DPDK is great, but it’s really meant to take over the whole NIC[1]. That puts a lot of constraints on what other functions each server can perform. Fortunately, the netdev guys are taking a lot of cues from DPDK and applying them to XDP and related kernel infrastructure. Comparable performance is coming to Linux sooner than you’d think!

                          [1] bifurcating & SR-IOV aren’t applicable for this particular usecase

                    1. 5

                      Author here. This was a fun bit. I don’t think many people write eBPF bytecode manually. The need to large clang/bcc dependencies is usually discouraging from using eBPF for smaller things, which is a pity.

                      1. 2

                        Probably, people tend to assume going low level makes things more complicated.

                        Thanks for the post! I have wanted to learn about ebpf bytecode for a while, and this was a great intro. I’m curious, you could have easily compiled the bpf code offline, pasted the resulting bytecode into your Go program, and parametrized it with the map descriptor at runtime. Is there a reason you chose not to do so?

                        1. 3

                          I’m frankly not sure. Most of the eBPF examples out there compile .c into an ELF. The resulting ELF has the bytecode and map metadata (what maps, what parameters). The resulting ELF can be loaded with some magical userspace helper .

                          This for example: https://github.com/nathanjsweet/ebpf/blob/master/examples/sockex1-user.go#L46

                          I don’t think I saw .c -> ELF -> bytecode workflow yet. I was told new objdump is able to read/dump the magical BPF ELF’s though, so maybe it’s simple.