1. 18

    Worth reading to the end just for the totally evil code snippet.

    It was kind of foreshadowed to be evil when the author named it “skynet.c” I guess.

    1. 4

      Reminds me of the Java-code we used to see around 2000.

      With a RuntimeException try-catch at the top and then just print it and continue like nothing happened.

      How much bad bugs, data corruption and weirdness did that practice cause?

      1. 1

        How is that any different from kubernetes and “just restart it”? Its mostly the same practice ultimately, though with a bit more cleanup between failures.

        1. 2

          I guess it depends on whether you keep any app state in memory. If you’re just funnelling data to a database maybe not much difference.

      2. 2

        Now I start to wonder, how the correct code should look like (as opposed of jumping 10 bytes ahead).

        Read DWARF to figure out next instruction?

        Embed a decompiler to decode the faulty opcode length?

        1. 4

          Increment the instruction pointer until you end up at a valid instruction (i.e., you don’t get SIGILL), of course ;)

          1. 6

            I have code that does this by catching SIGILL too and bumping the instruction pointer along in response to that. https://github.com/RichardBarrell/snippets/blob/master/no_crash_kthxbai.c

            1. 2

              Brilliant. I’m simultaneously horrified and amused.

            2. 1

              SIGILL

              That’d be a pretty great nerdcore MC name.

            3. 1

              If you want to skip the offending instruction, à la Visual Basics “on error resume next”, you determine instruction length by looking at the code and then increment by that.

              Figuring out the length requires understanding all the valid instruction formats for your CPU architecture. For some it’s almost trivial, say AVR has 16 bit instructions with very few exceptions for stuff like absolute call. For others, like x86, you need to have a fair bit of logic.

              I am aware that the “just increment by 1” below are intended as a joke. However I still think it’s instructive to say that incrementing blindly might lead you to start decoding at some point in the middle of an instruction. This might still be a valid instruction, especially for dense instruction set encodings. In fact, jumping into the middle of operands was sometimes used on early microcomputers to achieve compact code.

            4. 2

              Here’s a more correct approach: https://git.saucisseroyale.cc/emersion/c-safe

              1. 1

                Just don’t compile it with -pg :)

              1. 9

                Meh, I often drive my tests by code coverage. I often design the code (especially error paths) to be explicitly code-coverageable. For example, not covered open() error in the code? easy. Add –testonly-break-open command line option to the program to go there (fault injection). Fault injection can be achieved in other ways, for example “strace” allows me to fail some syscalls (mmap!).

                Having said that, I’m not orthodox. There is little point in testing non-hot-path things, for example do I really need code coverage for –help printing code? Or for malloc errors in all paths? Some things can fail and while I want to see a warning, I don’t really care if it’s code-coveraged (madvise for example).

                Perfect is the enemy of good.

                1. 7

                  The author agrees that you don’t need to test literally everything, but argues that developers should be explicit about what is excluded from testing. Thus 100% coverage should be achieved by either testing or explicitly NOT testing each line.

                  The author even has recommendations for which things should usually be excluded from testing:

                  What can/should be excluded from coverage? It depends on your codebase and your quality expectations. Also, as for any rules, there are exceptions…

                  • Simple error handling
                  • Proxy functions
                  • Code as configuration
                  • Simple early return/continue
                  • Code that is too difficult to test (rare)
                  • Debugging/manual test code
                  1. 3

                    I often drive my tests by code coverage

                    Careful about that, I was maintaining some tests written by a colleague who was doing that.

                    Alas, it was a horror.

                    Why? Because the reason for test cases and assertions were based on a branch in the implementation, not on a required behaviour.

                    And if it failed? What was it testing? Why was it that value asserted?

                    Not a clue.

                    It didn’t help that it was a ramble on test and violated the principle of testing a single required behaviour per test case.

                    After that nightmare I became even more keen on “Given blah When foo Then …” format for test cases.

                  1. 5

                    In most async systems … you end up in a world where you chain a bunch of async functions together with no regard of back pressure.

                    yup. Back pressure doesn’t compose in the world of callbacks / async. It does compose if designed well in coroutine world (see: erlang).

                    async/await is great but it encourages writing stuff that will behave catastrophically when overloaded.

                    yup. It’s very hard, in larger systems impossible, to do back pressure right with callbacks / async programming model.

                    This is how I assess software projects I look at. How fast is database is one thing. What does it do when I send it 2GiB of requests not reading responses? What happens when I open a bazillion connections to it? Will a previously established connections have priority over handling new connections?

                    1. 6

                      Even in the Erlang world, there are technical risks at every point where you reintroduce asynchrony. See https://ferd.ca/handling-overload.html for a long post on all the ways you can find to handle overload in the Erlang ecosystem, from the simplest one to more complex systematic approaches.

                      1. 2

                        In Monte, where I/O is managed by callbacks, my “streamcaps” stream library has perfect backpressure. Each callback generates a promise, and each promise is waited upon before more data is generated.

                        The main downside to perfect backpressure is that flow is slow and managed. Many APIs have “firehose” or “streaming” configurations that can send multiple packets of data as fast as possible, but such an API is not possible when backpressure is perfectly communicated.

                      1. 4

                        If the intent is to redirect users to a different network just to display a fail whale page when absolutely everything else is on fire, having more than 1 minute delay is probably acceptable.

                        I disagree. On L3 volumetric DDoS you might need to nullroute/blackhole an IP address, and migrate your service away. It’s totally valid to null + move to another IP. Many L3 attacks do not follow DNS. It’s very important to be able to shift your service between many IP addresses promptly.

                        Also, a 1 minute TTL means that if authoritative DNS servers are hosed for more than 1 minute, no one would be able to access the dependant services any longer.

                        This train of thought seem to imply that the reliability of DNS auth is the same as reliability of L7 application. This is not true… You can with large confidence assume that if you are using reasonable DNS auth provider it’s reliability is good. There are PNI links between opendns and major players, or google dns and major dns auth servers. I would argue world’s DNS system is pretty healthy, surprisingly reliable and definitely NOT the most vulnerable part of your stack.

                        In other words - dns resolvers caching your dns answers for long time is not a pragmatic “improved reliability” argument.

                        1. 1

                          Will the DDoS just cache the old dns entries? They know your IP, and they can just direct the DoS to that instead of the new provider.

                          1. 1

                            I’m talking from a provider point of view - they move services across IP’s all the time.

                            If you are a simple website, for sure you should not expose your direct server IP to the public internet ever.

                        1. 5

                          Always interesting to see how networks work in real life. If you just wanted to get rid of a connection on Linux then there’s something not shown in the article. Repair mode, a fairly obscure setsockopt thing, lets you take down and move connections between systems. Here’s an article about it: https://lwn.net/Articles/495304/

                          Finally, if a connection is closed while it is in the repair mode, it is simply deleted with no notification to the remote end.”

                          1. 4

                            This is great. Ok, I’ll bite.

                            close(): socket will be lingering in background as usual shutdown(SHUT_RD): no network side effect, discards read buffer shutdown(SHUT_WR): equivalent to FIN SO_LINGER socket - if timeout non-zero blocks until write buffer flushed; if timeout is zero then immediately sends RST

                            the trick you described: immediately discard a socket with no network side effects.

                            Then there is ss --kill command to forcefully close a socket from outside process. It is done with netlink SOCK_DESTROY command.

                            1. 2

                              On BSD we have tcpdrop, too

                          1. 9

                            In the mean time I’m trying to convince Antirez to use io_submit in redis: https://twitter.com/antirez/status/1081197002573139968

                            There was a problem with that… I don’t remember what exactly, but it was like, the structures you fill did not match how Redis stored the data or something. I need to try again soon or later because the speedup in Redis would be huge.

                            1. 11

                              @majke thank you for a very interesting article and analysis!

                              I’m trying to reproduce and follow your analysis on my mac and would appreciate some help. On my mac I can get the code to compile by removing sections 1 and 2 and my removing the MAP_POPULATE and MAP_LOCKED flags. Is it OK to do so? (revised code)

                              When I run this on my mac my mode is 0 ns with 1000 ns following and an occasional 3000 ns or higher. This pattern is much less smooth than yours and I wonder why.

                              I’m trying to follow your analysis. As fas as I can follow the loop duration variable is redundant, since you have a timestamp and the loop duration merely stores the diff to the last timestamp. So you have point events with their times. You’ve tried to smoothen these point events by convolving a triangular window to them well, doing something slightly unorthodox, which amounts to literal linear interpolation between points.

                              The linear interpolation generates a lot of high frequencies in the Fourier transform because of the corners.

                              In neuroscience (see, I knew that esoteric training would come in useful someday) we have a similar data set that comes from neuronal firing. A popular way to perform Fourier analysis on them is to convolve the train of deltas with gaussians. This is like dropping a cloth over a set of spikes - you get pointy heads where the spikes are and a graceful tail where they are not. This leads to smooth curves which behave more politely in the frequency domain.

                              There is a hypothesis behind doing this in neuroscience (neurons act in concert, with a slight jitter between them, blah blah) but basically smoothing delta trains is a dirty deed most practical scientists will let you do by looking the other way.

                              Since I can’t regenerate the data I’m requesting you to retry your analysis with gaussian smoothing of your delta train and/or give me some pointers as to how to get proper results out of my mac.

                              Thank you very kindly!

                              -Kaushik

                              PS. Also, if you are not inclined to help figure out how it would work on a mac, if you could send me your data set I could try out the gaussian convolution and send the results back to you.

                              1. 9

                                Raw data at your service: https://raw.githubusercontent.com/cloudflare/cloudflare-blog/master/2018-11-memory-refresh/example-data.csv

                                ~/2018-11-memory-refresh$ cat example-data.csv |python3 ./analyze-dram.py 
                                [*] Input data: min=111 avg=176 med=167 max=11909 items=131072
                                [*] Cutoff range 212-inf
                                [ ] 127893 items below cutoff, 0 items above cutoff, 3179 items non-zero
                                [*] Running FFT
                                [*] Top frequency above 2kHz below 350kHz has magnitude of 7590
                                [+] Top frequency spikes above 2kHz are at:
                                127884Hz	4544
                                127927Hz	5295
                                255812Hz	7590
                                383739Hz	5799
                                511624Hz	6932
                                639551Hz	5911
                                767436Hz	6001
                                895363Hz	5682
                                1023248Hz	4774
                                1151175Hz	5107
                                1406987Hz	4263
                                

                                (a) trying to run it on mac: why not, but the power saving settings may introduce even more jitter. Also - is there a reliable fast clock_gettime(CLOCK_MONOTONIC) on mac these days?

                                (b) I’m very much not an expert on DSP and signal analysis. Please do explain why and how to use the suggested gaussian smoothing.

                                1. 6

                                  @majke very cool, thanks!

                                  My interpretation of your data is in this notebook

                                  In brief, I did a simple time domain analysis first by plotting the interval histogram. The histogram shows a prominent periodicity at 16.7 us with some slower components.

                                  When I do a frequency domain analysis by smoothing the delta train with a gaussian I see this prominent period with higher harmonics. I’ve forgotten how to interpret the higher harmonics, but the base frequency is consistent with the 16.7 us periodicity.

                                  This is roughly twice the 7.8 us you report in your article.

                                  Treating this whole thing as a black box, I’d say it typically takes 16.7us to complete one cycle of operations though there are instances when things take a lot longer, though this is less common by a factor of about 100.

                                  Tag! You’re it :)

                                  1. 1

                                    It looks like you’re interpreting the data differently from @majke; one analysis is on duration of each event (167ns); the other is on the delta between events (7818ns).

                                    1. 1

                                      Every cycle only one time stamp is dropped (rt1 = realtime_now();). There is no differentiation made between the duration of the event and the delta between them. It’s not a square wave with a duty cycle.

                                      1. 2

                                        Well. I don’t care how long the long stall was. All I care is about the gap between long stalls. I think if you make a histogram of durations between long-stalls (when long is avg*1.4 or higher), then indeed I think you will find the 7.8us period with simple histogram. Having said that, this will depend on the noise in data. I’ve had some runs of the over which simpler analysis failed.

                                        1. 1

                                          @majke ah very interesting! Thanks again for a very educational article.

                                          1. 1

                                            Here’s a histogram of “durations between long loop runs”

                                            https://uploads.disquscdn.com/images/4d3d1472115285539d85539d3c84b8f5e8e56821cd2e8d84686c7031b51335b2.png

                                            You can definitely see the spike at 7800ns, but I’m not sure how to extract it algorithmically without cheating.

                                2. 1

                                  I’m no expert, but I would assume that ASLR might screw with your results.

                                1. 1

                                  Neat post. Do you have any custom or unusual (eg Cavium) hardware needed for your DDOS mitigation activities? Or is it 100% vanilla boxes from Intel/AMD with Linux running software solutions like in the article?

                                  1. 13

                                    In our architecture every server is identical both in hardware and software. The more servers we add, the larger DDoS capacity we have. The servers are pretty standard. We do use Solarflare network cards, and occasionally offload parts of iptables into userspace. We are working on replacing this custom piece of software with NIC-vendor agnostic XDP.

                                    1. 1

                                      Wow. Impressive how far things have come in not relying on custom stuff. Thanks for the reply!

                                      1. 1

                                        lovely ! any particular reason of moving away from or not choosing dpdk for this ?

                                        1. 1

                                          DPDK is great, but it’s really meant to take over the whole NIC[1]. That puts a lot of constraints on what other functions each server can perform. Fortunately, the netdev guys are taking a lot of cues from DPDK and applying them to XDP and related kernel infrastructure. Comparable performance is coming to Linux sooner than you’d think!

                                          [1] bifurcating & SR-IOV aren’t applicable for this particular usecase

                                    1. 5

                                      Author here. This was a fun bit. I don’t think many people write eBPF bytecode manually. The need to large clang/bcc dependencies is usually discouraging from using eBPF for smaller things, which is a pity.

                                      1. 2

                                        Probably, people tend to assume going low level makes things more complicated.

                                        Thanks for the post! I have wanted to learn about ebpf bytecode for a while, and this was a great intro. I’m curious, you could have easily compiled the bpf code offline, pasted the resulting bytecode into your Go program, and parametrized it with the map descriptor at runtime. Is there a reason you chose not to do so?

                                        1. 3

                                          I’m frankly not sure. Most of the eBPF examples out there compile .c into an ELF. The resulting ELF has the bytecode and map metadata (what maps, what parameters). The resulting ELF can be loaded with some magical userspace helper .

                                          This for example: https://github.com/nathanjsweet/ebpf/blob/master/examples/sockex1-user.go#L46

                                          I don’t think I saw .c -> ELF -> bytecode workflow yet. I was told new objdump is able to read/dump the magical BPF ELF’s though, so maybe it’s simple.