Threads for majke

  1. 8

    To me, this looks like a great example of the cascading complexity that results from a mismatch between limits imposed by an API (in this case a protocol) and actual hardware limits. If TCP had a 4 byte or 8 byte port range, this would presumably not be an issue. There would be no need to share ports.

    How can we design protocols and APIs that can adapt to increasing hardware capabilities? Perhaps we should always use variable-length integers? There presumably will always be some absurdly large limit that we can safely impose, e.g. the number of atoms in the observable universe (roughly 2^265, or 256 bits for a nice round number), so we don’t necessarily need to allow any arbitrary integer.

    1. 5

      We had one protocol that allowed more flexibility and it didn’t end well - ipv6. All the “flexible” bits are pretty much deprecated because it’s impossible to implement a fast router when you don’t know where the fields (like tcp ports) are.

      I wouldn’t drive conclusion that 16 bits for a port is limited.

      IMO the conclusion is that the BSD sockets api kinda requires for 2-tuple to be locked, while the user doesn’t want that. For connected socket we expect 4-tuple to be locked. It would be nice to have some API that can express that. For tcp we have IP_BIND_ADDRESS_NO_PORT, for udp we don’t have anything plus there is the overshadowing issue

      1. 3

        With IPv6 you don’t even need ports, you could just use the last N bytes of the address as the port :D

        More practically it might actually make sense to “expand” the ephemeral port range into the address, i.e. just use lots of different source addresses under the subnet your machine has.

        1. 3

          This is one of the reasons why it’s good to have a /64. One of the privacy options for IPv6 recommends that you keep a small number of stable IPv6 addresses for incoming connections and periodically cycle the one that you use for outbound connections. With a /64 and SLAAC you can to this more or less by picking a new 64-bit random number. In the most extreme configuration, you pick a new IPv6 address for every outbound connection. This means that a server shouldn’t be able to distinguish between two connections from different machines in the subnet and two from the same machine. This doesn’t help much for home networks (where everyone on the /64 is likely to be in the same family, at least), but is good for networks with more users.

          I believe this is very rarely done in practice because the higher-level protocols (for example, HTTP) provide tracking information (cookies, browser fingerprinting, and so on) that’s so much more useful than IP address that tracking based on IP is of fairly negligible value.

          1. 2

            This is done for DNS resolvers already, it got popular after the security issues around lack of space for entropy in the ID field. You route a /64 to the resolver and tell it to prefer IPv6 and to use that whole /64 as the outgoing source.

            See, eg, Unbound’s outgoing-interface setting.

        1. 3

          Take a look at the associated code: https://github.com/cloudflare/cloudflare-blog/tree/master/2022-02-connectx

          There is quite some value in the tests. These kind of things are notoriously hard to reason about, hard to get all the corner cases right, and hard to test.

          The killtw.py is a bit buggy though, there is one corner case around TIME-WAIT sockets I wasn’t able to figure out yet.

          1. 36

            I wasn’t afraid to ask. I asked several times. I just never got an answer.

            1. 9

              Nice trolling with the UDP joke :)

              In writing this blog post I’ve spent like three days trying to understand what IP_RECVERR does. It’s amazing how little docs there are.

              1. 9

                Odd, I never asked by was told repeatedly anyway just because I happened to be listening.

              1. 2

                This code stresses the BTB to an extreme - it just consists of a chain of jmp +2 statements (i.e. literally jumping to the next instruction)

                Nit: those are jmp +0. You can see that in the encoding (eb 00): eb is a short jump, and 00 is the displacement. The displacement of a branching instruction is added to the instruction pointer, which points to the instruction after that currently being executed.


                The call/ret chart is funny [https://blog.cloudflare.com/content/images/2021/05/pasted-image-0--17-.png]

                Ha, that is amazing!

                1. 4

                  I know, but then I speak about block sizes. This is getting too much into details, and explaining that jmp+14 is block size 16 on x86 and jmp+12 is block size 16 on aarch64 would not be fun. Thanks for the nudge though :)

                1. 5

                  The article notes that a never-taken branch doesn’t get an entry in the BTB, and is automatically predicted to not be taken. However, my understanding was that a branch with no BTB entry would be assumed to be not taken only for a forward jump – backward branches would be assumed to be taken (since such are typically loop guards).

                  It looks like their test only tests forward branches, so I’m not sure if my information is out of date (or incomplete), or if they just didn’t test this situation.

                  1. 4

                    Take a look at the code, there is a backward not-taken branch test as well. I did run it and it didn’t show anything intersting. But think about it: the cpu needs to feed instruction before the jmp is decoded. It would still create a bubble if the cpu had to wait for the backward branch to decode.

                    1. 2

                      Certainly the bubble would still exist; my only question was if backward branches had their cases reversed – i.e. the CPU would only create an entry the BTB for it if it wasn’t taken. It sounds like that’s not the case, however, so presumably my understanding was just wrong.

                    2. 3

                      It depends a lot. In a big superscalar core, it’s common to do predictions per fetch granule, rather than per instruction. This is why the article sees a benefit from adding nops. The fetch stage grabs a chunk of data from i-cache and the branch predictor then has to tell it what to fetch next. This prediction has to happen before decode, which makes it difficult to key the decision on anything that happens after decode. You may get very small bubbles if decode tells you that there are no branch instructions but the branch prediction came from some stale state and suggested the next fetch granule was somewhere other than immediately following the current one.

                      This gets even more complex when you have things like trace caches. Decoding x86 is so complex that it’s common to store the micro-ops in an internal cache, rather than decode the raw instructions every time. If you’re in a hot code segment, you’re probably skipping decode and going straight from the trace cache. There’s an old heuristic (I don’t actually know how true it is today) that 90% of compute time is spent in hot loops (which is why compilers try so hard to optimise loops) and most time in a loop you’ll be fetching from the trace cache. The trace cache can store extra information to hint the branch predictor, including just storing the next most-likely fetch granule in the buffer immediately after the current one so that you don’t need to do prediction at all, you just lay out the trace caches so that the hot code path is contiguous. That kind of thing is a bit less common on Arm implementations, where decode is cheap, but nVidia’s Project Denver series does something even more aggressive and JITs hot traces as contiguous blocks of VLIW code so that the fast paths are really fast but the cost of leaving the trace can be disproportionately high.

                      TL;DR: Modern CPUs are indistinguishable from magic. Any heuristic that you think you know is probably wrong.

                    1. 11

                      Surprised they don’t mention the singular value decomposition at all for dimensionality reduction. A friend also pointed me to the Johnson-Lindenstrauss Theorem, which “shows that a set of n points in high dimensional Euclidean space can be mapped into an O(log n/e^2)-dimensional Euclidean space such that the distance between any two points changes by only a factor of (1 +- e)”.

                      Maybe it’s because their solution needs to run online? But they don’t mention that in their blog post.

                      EDIT: Sorry, didn’t want to come across as totally negative. I definitely think the solution they ended up going with is clever. I just think it would be more interesting if they’d talked about more prior art.

                      1. 4

                        My understanding of SVD is that for it to make sense the dimensions need to be corelated in some way. This is not the case for our workload - for all intends and purposes the dimension are pretty random.

                        Johnson-Lindenstrauss Theorem - amazing. I’m not sure what implications are, would need to think about it. Thanks for the link!

                      1. 3

                        Thanks for the very interesting post! I am not very familiar with image search, but I wonder if (Optimized) Product Quantization would help here. In word embeddings, we have found that OPQ can compress vectors an order of magnitude with barely any l2 loss. Another benefit is that distances between vectors can be computed by using a precomputed table of pairwise distances of subquantizer centroids.

                        1. 2

                          This post is less about image search and more about just “nearest neighbor” in 144 dimensions. Or to be more precise - proving that nearest neighbor within given threshold doesn’t exist :)

                          We tried various centroids, but the data is quite random so no obvious clusters emerge.

                          1. 3

                            This post is less about image search and more about just “nearest neighbor” in 144 dimensions.

                            Which is what PQ-based NN search will give you, except that you can do the computations in the lower-dimensional quantized space.

                            We tried various centroids, but the data is quite random so no obvious clusters emerge.

                            I am not sure what you mean here. The vectors must have some underlying structure, otherwise it wouldn’t be useful to compute distances between vectors. As you mention in the blog post, the problem is more that with such large dimensionalities, you run into the curse of dimensionality and all data points and points tend to equidistant, which is problematic for k-means.

                            However, PQ uses k-means clustering in low-dimensional spaces (for each subquantizer).

                        1. 4

                          I have the book on my shelf and I regularly peek into it. Although for me it is more a reference. For example it perfectly answers why we have 8-way set associative cache and not more - the book gives a, well, quantitative answer.

                          More examples: how DRAM works, basics of NVIDIA Fermi architecture (or: SIMD vs GPU).

                          Although the book is rather dense and I would not recommend reading it back-to-back.

                          1. 19

                            Worth reading to the end just for the totally evil code snippet.

                            It was kind of foreshadowed to be evil when the author named it “skynet.c” I guess.

                            1. 4

                              Reminds me of the Java-code we used to see around 2000.

                              With a RuntimeException try-catch at the top and then just print it and continue like nothing happened.

                              How much bad bugs, data corruption and weirdness did that practice cause?

                              1. 1

                                How is that any different from kubernetes and “just restart it”? Its mostly the same practice ultimately, though with a bit more cleanup between failures.

                                1. 2

                                  I guess it depends on whether you keep any app state in memory. If you’re just funnelling data to a database maybe not much difference.

                              2. 2

                                Now I start to wonder, how the correct code should look like (as opposed of jumping 10 bytes ahead).

                                Read DWARF to figure out next instruction?

                                Embed a decompiler to decode the faulty opcode length?

                                1. 4

                                  Increment the instruction pointer until you end up at a valid instruction (i.e., you don’t get SIGILL), of course ;)

                                  1. 7

                                    I have code that does this by catching SIGILL too and bumping the instruction pointer along in response to that. https://github.com/RichardBarrell/snippets/blob/master/no_crash_kthxbai.c

                                    1. 2

                                      Brilliant. I’m simultaneously horrified and amused.

                                    2. 1

                                      SIGILL

                                      That’d be a pretty great nerdcore MC name.

                                    3. 1

                                      If you want to skip the offending instruction, à la Visual Basics “on error resume next”, you determine instruction length by looking at the code and then increment by that.

                                      Figuring out the length requires understanding all the valid instruction formats for your CPU architecture. For some it’s almost trivial, say AVR has 16 bit instructions with very few exceptions for stuff like absolute call. For others, like x86, you need to have a fair bit of logic.

                                      I am aware that the “just increment by 1” below are intended as a joke. However I still think it’s instructive to say that incrementing blindly might lead you to start decoding at some point in the middle of an instruction. This might still be a valid instruction, especially for dense instruction set encodings. In fact, jumping into the middle of operands was sometimes used on early microcomputers to achieve compact code.

                                    4. 2

                                      Here’s a more correct approach: https://git.saucisseroyale.cc/emersion/c-safe

                                      1. 1

                                        Just don’t compile it with -pg :)

                                      1. 9

                                        Meh, I often drive my tests by code coverage. I often design the code (especially error paths) to be explicitly code-coverageable. For example, not covered open() error in the code? easy. Add –testonly-break-open command line option to the program to go there (fault injection). Fault injection can be achieved in other ways, for example “strace” allows me to fail some syscalls (mmap!).

                                        Having said that, I’m not orthodox. There is little point in testing non-hot-path things, for example do I really need code coverage for –help printing code? Or for malloc errors in all paths? Some things can fail and while I want to see a warning, I don’t really care if it’s code-coveraged (madvise for example).

                                        Perfect is the enemy of good.

                                        1. 7

                                          The author agrees that you don’t need to test literally everything, but argues that developers should be explicit about what is excluded from testing. Thus 100% coverage should be achieved by either testing or explicitly NOT testing each line.

                                          The author even has recommendations for which things should usually be excluded from testing:

                                          What can/should be excluded from coverage? It depends on your codebase and your quality expectations. Also, as for any rules, there are exceptions…

                                          • Simple error handling
                                          • Proxy functions
                                          • Code as configuration
                                          • Simple early return/continue
                                          • Code that is too difficult to test (rare)
                                          • Debugging/manual test code
                                          1. 3

                                            I often drive my tests by code coverage

                                            Careful about that, I was maintaining some tests written by a colleague who was doing that.

                                            Alas, it was a horror.

                                            Why? Because the reason for test cases and assertions were based on a branch in the implementation, not on a required behaviour.

                                            And if it failed? What was it testing? Why was it that value asserted?

                                            Not a clue.

                                            It didn’t help that it was a ramble on test and violated the principle of testing a single required behaviour per test case.

                                            After that nightmare I became even more keen on “Given blah When foo Then …” format for test cases.

                                          1. 5

                                            In most async systems … you end up in a world where you chain a bunch of async functions together with no regard of back pressure.

                                            yup. Back pressure doesn’t compose in the world of callbacks / async. It does compose if designed well in coroutine world (see: erlang).

                                            async/await is great but it encourages writing stuff that will behave catastrophically when overloaded.

                                            yup. It’s very hard, in larger systems impossible, to do back pressure right with callbacks / async programming model.

                                            This is how I assess software projects I look at. How fast is database is one thing. What does it do when I send it 2GiB of requests not reading responses? What happens when I open a bazillion connections to it? Will a previously established connections have priority over handling new connections?

                                            1. 6

                                              Even in the Erlang world, there are technical risks at every point where you reintroduce asynchrony. See https://ferd.ca/handling-overload.html for a long post on all the ways you can find to handle overload in the Erlang ecosystem, from the simplest one to more complex systematic approaches.

                                              1. 2

                                                In Monte, where I/O is managed by callbacks, my “streamcaps” stream library has perfect backpressure. Each callback generates a promise, and each promise is waited upon before more data is generated.

                                                The main downside to perfect backpressure is that flow is slow and managed. Many APIs have “firehose” or “streaming” configurations that can send multiple packets of data as fast as possible, but such an API is not possible when backpressure is perfectly communicated.

                                              1. 4

                                                If the intent is to redirect users to a different network just to display a fail whale page when absolutely everything else is on fire, having more than 1 minute delay is probably acceptable.

                                                I disagree. On L3 volumetric DDoS you might need to nullroute/blackhole an IP address, and migrate your service away. It’s totally valid to null + move to another IP. Many L3 attacks do not follow DNS. It’s very important to be able to shift your service between many IP addresses promptly.

                                                Also, a 1 minute TTL means that if authoritative DNS servers are hosed for more than 1 minute, no one would be able to access the dependant services any longer.

                                                This train of thought seem to imply that the reliability of DNS auth is the same as reliability of L7 application. This is not true… You can with large confidence assume that if you are using reasonable DNS auth provider it’s reliability is good. There are PNI links between opendns and major players, or google dns and major dns auth servers. I would argue world’s DNS system is pretty healthy, surprisingly reliable and definitely NOT the most vulnerable part of your stack.

                                                In other words - dns resolvers caching your dns answers for long time is not a pragmatic “improved reliability” argument.

                                                1. 1

                                                  Will the DDoS just cache the old dns entries? They know your IP, and they can just direct the DoS to that instead of the new provider.

                                                  1. 1

                                                    I’m talking from a provider point of view - they move services across IP’s all the time.

                                                    If you are a simple website, for sure you should not expose your direct server IP to the public internet ever.

                                                1. 5

                                                  Always interesting to see how networks work in real life. If you just wanted to get rid of a connection on Linux then there’s something not shown in the article. Repair mode, a fairly obscure setsockopt thing, lets you take down and move connections between systems. Here’s an article about it: https://lwn.net/Articles/495304/

                                                  Finally, if a connection is closed while it is in the repair mode, it is simply deleted with no notification to the remote end.”

                                                  1. 4

                                                    This is great. Ok, I’ll bite.

                                                    close(): socket will be lingering in background as usual shutdown(SHUT_RD): no network side effect, discards read buffer shutdown(SHUT_WR): equivalent to FIN SO_LINGER socket - if timeout non-zero blocks until write buffer flushed; if timeout is zero then immediately sends RST

                                                    the trick you described: immediately discard a socket with no network side effects.

                                                    Then there is ss --kill command to forcefully close a socket from outside process. It is done with netlink SOCK_DESTROY command.

                                                    1. 2

                                                      On BSD we have tcpdrop, too

                                                  1. 9

                                                    In the mean time I’m trying to convince Antirez to use io_submit in redis: https://twitter.com/antirez/status/1081197002573139968

                                                    There was a problem with that… I don’t remember what exactly, but it was like, the structures you fill did not match how Redis stored the data or something. I need to try again soon or later because the speedup in Redis would be huge.

                                                    1. 11

                                                      @majke thank you for a very interesting article and analysis!

                                                      I’m trying to reproduce and follow your analysis on my mac and would appreciate some help. On my mac I can get the code to compile by removing sections 1 and 2 and my removing the MAP_POPULATE and MAP_LOCKED flags. Is it OK to do so? (revised code)

                                                      When I run this on my mac my mode is 0 ns with 1000 ns following and an occasional 3000 ns or higher. This pattern is much less smooth than yours and I wonder why.

                                                      I’m trying to follow your analysis. As fas as I can follow the loop duration variable is redundant, since you have a timestamp and the loop duration merely stores the diff to the last timestamp. So you have point events with their times. You’ve tried to smoothen these point events by convolving a triangular window to them well, doing something slightly unorthodox, which amounts to literal linear interpolation between points.

                                                      The linear interpolation generates a lot of high frequencies in the Fourier transform because of the corners.

                                                      In neuroscience (see, I knew that esoteric training would come in useful someday) we have a similar data set that comes from neuronal firing. A popular way to perform Fourier analysis on them is to convolve the train of deltas with gaussians. This is like dropping a cloth over a set of spikes - you get pointy heads where the spikes are and a graceful tail where they are not. This leads to smooth curves which behave more politely in the frequency domain.

                                                      There is a hypothesis behind doing this in neuroscience (neurons act in concert, with a slight jitter between them, blah blah) but basically smoothing delta trains is a dirty deed most practical scientists will let you do by looking the other way.

                                                      Since I can’t regenerate the data I’m requesting you to retry your analysis with gaussian smoothing of your delta train and/or give me some pointers as to how to get proper results out of my mac.

                                                      Thank you very kindly!

                                                      -Kaushik

                                                      PS. Also, if you are not inclined to help figure out how it would work on a mac, if you could send me your data set I could try out the gaussian convolution and send the results back to you.

                                                      1. 9

                                                        Raw data at your service: https://raw.githubusercontent.com/cloudflare/cloudflare-blog/master/2018-11-memory-refresh/example-data.csv

                                                        ~/2018-11-memory-refresh$ cat example-data.csv |python3 ./analyze-dram.py 
                                                        [*] Input data: min=111 avg=176 med=167 max=11909 items=131072
                                                        [*] Cutoff range 212-inf
                                                        [ ] 127893 items below cutoff, 0 items above cutoff, 3179 items non-zero
                                                        [*] Running FFT
                                                        [*] Top frequency above 2kHz below 350kHz has magnitude of 7590
                                                        [+] Top frequency spikes above 2kHz are at:
                                                        127884Hz	4544
                                                        127927Hz	5295
                                                        255812Hz	7590
                                                        383739Hz	5799
                                                        511624Hz	6932
                                                        639551Hz	5911
                                                        767436Hz	6001
                                                        895363Hz	5682
                                                        1023248Hz	4774
                                                        1151175Hz	5107
                                                        1406987Hz	4263
                                                        

                                                        (a) trying to run it on mac: why not, but the power saving settings may introduce even more jitter. Also - is there a reliable fast clock_gettime(CLOCK_MONOTONIC) on mac these days?

                                                        (b) I’m very much not an expert on DSP and signal analysis. Please do explain why and how to use the suggested gaussian smoothing.

                                                        1. 6

                                                          @majke very cool, thanks!

                                                          My interpretation of your data is in this notebook

                                                          In brief, I did a simple time domain analysis first by plotting the interval histogram. The histogram shows a prominent periodicity at 16.7 us with some slower components.

                                                          When I do a frequency domain analysis by smoothing the delta train with a gaussian I see this prominent period with higher harmonics. I’ve forgotten how to interpret the higher harmonics, but the base frequency is consistent with the 16.7 us periodicity.

                                                          This is roughly twice the 7.8 us you report in your article.

                                                          Treating this whole thing as a black box, I’d say it typically takes 16.7us to complete one cycle of operations though there are instances when things take a lot longer, though this is less common by a factor of about 100.

                                                          Tag! You’re it :)

                                                          1. 1

                                                            It looks like you’re interpreting the data differently from @majke; one analysis is on duration of each event (167ns); the other is on the delta between events (7818ns).

                                                            1. 1

                                                              Every cycle only one time stamp is dropped (rt1 = realtime_now();). There is no differentiation made between the duration of the event and the delta between them. It’s not a square wave with a duty cycle.

                                                              1. 2

                                                                Well. I don’t care how long the long stall was. All I care is about the gap between long stalls. I think if you make a histogram of durations between long-stalls (when long is avg*1.4 or higher), then indeed I think you will find the 7.8us period with simple histogram. Having said that, this will depend on the noise in data. I’ve had some runs of the over which simpler analysis failed.

                                                                1. 1

                                                                  @majke ah very interesting! Thanks again for a very educational article.

                                                                  1. 1

                                                                    Here’s a histogram of “durations between long loop runs”

                                                                    https://uploads.disquscdn.com/images/4d3d1472115285539d85539d3c84b8f5e8e56821cd2e8d84686c7031b51335b2.png

                                                                    You can definitely see the spike at 7800ns, but I’m not sure how to extract it algorithmically without cheating.

                                                        2. 1

                                                          I’m no expert, but I would assume that ASLR might screw with your results.

                                                        1. 1

                                                          Neat post. Do you have any custom or unusual (eg Cavium) hardware needed for your DDOS mitigation activities? Or is it 100% vanilla boxes from Intel/AMD with Linux running software solutions like in the article?

                                                          1. 13

                                                            In our architecture every server is identical both in hardware and software. The more servers we add, the larger DDoS capacity we have. The servers are pretty standard. We do use Solarflare network cards, and occasionally offload parts of iptables into userspace. We are working on replacing this custom piece of software with NIC-vendor agnostic XDP.

                                                            1. 1

                                                              Wow. Impressive how far things have come in not relying on custom stuff. Thanks for the reply!

                                                              1. 1

                                                                lovely ! any particular reason of moving away from or not choosing dpdk for this ?

                                                                1. 1

                                                                  DPDK is great, but it’s really meant to take over the whole NIC[1]. That puts a lot of constraints on what other functions each server can perform. Fortunately, the netdev guys are taking a lot of cues from DPDK and applying them to XDP and related kernel infrastructure. Comparable performance is coming to Linux sooner than you’d think!

                                                                  [1] bifurcating & SR-IOV aren’t applicable for this particular usecase

                                                            1. 5

                                                              Author here. This was a fun bit. I don’t think many people write eBPF bytecode manually. The need to large clang/bcc dependencies is usually discouraging from using eBPF for smaller things, which is a pity.

                                                              1. 2

                                                                Probably, people tend to assume going low level makes things more complicated.

                                                                Thanks for the post! I have wanted to learn about ebpf bytecode for a while, and this was a great intro. I’m curious, you could have easily compiled the bpf code offline, pasted the resulting bytecode into your Go program, and parametrized it with the map descriptor at runtime. Is there a reason you chose not to do so?

                                                                1. 3

                                                                  I’m frankly not sure. Most of the eBPF examples out there compile .c into an ELF. The resulting ELF has the bytecode and map metadata (what maps, what parameters). The resulting ELF can be loaded with some magical userspace helper .

                                                                  This for example: https://github.com/nathanjsweet/ebpf/blob/master/examples/sockex1-user.go#L46

                                                                  I don’t think I saw .c -> ELF -> bytecode workflow yet. I was told new objdump is able to read/dump the magical BPF ELF’s though, so maybe it’s simple.