Threads for revert

  1. 3

    I didn’t really understand this at first, but this technique is based on the fact that a struct inside a struct is essentially flattened out in terms of memory representation. The technique is called an intrusive data structure and it lets you do things generically (with macros) with only one linked list struct. I am used to making a new linked list struct for every data type in C - so it’s a pretty clever hack!

    1. 2

      These macros originated with 4BSD, as far as I can tell. Modified versions are present in the *BSD, Linux, and the NT kernel. They’re slightly terrifying because they require pointer arithmetic that removes type information, including using offsetof to identify the location of the field and then subtracting it.

      1. 1

        They’re quite useful! Like you said, it lets you write generic functions easily (you could write generic versions of a non intrusive list, but with indirection and allocation overhead). The other thing that’s very nice is they allow for local manipulation - if you have a reference to the struct, you can eg. remove it in O(1) from the containing list without traversing the list or even needing a reference to the list head. This can make cleanup code much simpler (like a C++ destructor could remove the object from lists it’s present in), and it also makes cases where you have a single item that needs to be present in multiple lists much easier to manage. They tend to show up a decent amount in systems and games programming.

        1. 1

          When I first saw this I thought it was genius. The macro to convert a list pointer to its parent is a fun piece of pointer arithmetic to unpick. It involves casting the literal 0 to a pointer.

          Another significant advantage is normal use (outside of that controlled and tested macro) doesn’t need casts: with the traditional list-struct-on-the-outside, you end up having to use void pointers for the payload, and there are many more opportunities to make a type error. Also the list structure and parent structure can be easily contiguous in memory.

        1. 7

          Long ago I contributed some Levenshtein distance code to apache commons. There are two interesting optimizations that aren’t touched on in the post:

          1. The iterative grid solution doesn’t need to maintain the full grid, only two rows (which can be swapped back and forth). This can significantly reduce memory usage on large strings.
          2. Using an upper bound alg and iteratively increasing the bound until completion lets you bring the general case down from O(nm) (where n and m are the length of the two strings) to O(km) (where k is the edit distance between n and m). This performance can be significant, especially for similar or long strings.

          Most of this I have in a comment starting on line 175. I also recommend this computation biology book for anyone with an interest in string algorithms.

          1. 5

            Every DNS company I’ve worked at (which includes the one mentioned here) has at some point informally kicked around the idea of being a general key/value store. I think it’s a natural thing to happen upon when thinking about where to grow the business outside of DNS - first there’s the “let’s rack some caches” (CDN), then there’s “let’s sell the network” (VPN), and then there’s “well DNS is basically an eventually consistent key/value store.”

            It is very niche. Fun idea, but not widely practical. Your data needs to be public (DoH/DoT is only last mile), your data needs to be okay with being tampered with (can’t rely on resolvers validating DNSSEC), your data needs to be extremely read heavy, your reads require a global footprint, and you can tolerate periodic outages for writes (even though every DNS company worth its salt has a 100% uptime SLA on the dataplane, no one has that on the control plane - expect control plane latency to vary from seconds best case to minutes/tens of minutes during significant outages/DDoS).

            1. 3

              What are the benefits wrt another high performance GC? O(1) allocation and de-allocation is the norm.

              1. 7

                Thank you for asking. This implementation provides short O(1) pause times, regardless of working set size, and it also compacts. Maybe such things do already exist, but I couldn’t find such things when I searched as I was thinking about this project, so I thought this would be new. I’m happy to compare to another implementation if you could point me to one.

                1. 3

                  I see. Given the expensive read barrier, I think there are other ways to achieve similar characteristics with better trade-offs (I can find references tomorrow, but it is a bit late here, sorry). However, your design seems to make a lot of sense for a compacting database (the expensive read barrier can be factored in the cost to access external pages, storage overhead might be a problem though).

                  Small question: when GC thread is active, what happens if the main thread fills the region before than the GC can complete?

                  1. 1

                    Thanks I’ll be happy to read them. Regarding your question, I use the term “region” in the code to refer to a range within the CB/ring-buffer. That region is allocated by the main thread for use by the GC thread, and then the main thread never touches it. If you’re referring to the CB/ring-buffer itself, as in “what happens if the main thread fills the CB before the GC completes?” then the answer is that the main thread will resize the CB upwards to the next power-of-2, copy all of the contents of the old CB (including the still-being-written GC consolidation region), and then when the main thread observes the GC to have completed, it will copy forward the data from the old CB’s consolidation region to the equivalent range in the new CB. In this way, the GC thread only knows about the CB that was given to it during the start of the GC/consolidation, and the main thread is responsible for the shift of the CB to a larger one.

                    Thanks! –Dan

                2. 2

                  Yeah, it looks like an exercise to write such GC. Other pauseless GC’s has similar characteristics and arguably cheaper in many cases. This implementation for lookups requires a hash table on-the-side and most likely requires a read-barrier (when rewire objects moved from zone A to B to C).

                  1. 3

                    Thanks for taking a look! Yes it does require a read-barrier for the objects which have shifted locations. I’d be interested if you could link me to the best implementations you know of.

                    1. 3

                      Read what you did led me immediately to Azul’s Pauseless GC. Both does tons of work on read side to ensure GC thread can finish marking and compacting in predictable way. In their case, I think they trap the pointer read to outdated pages to update the reference (in compacting phase).

                      Edit: thinking through their marking phase and yours (from my understanding, your zone B is for marking). One thing I don’t quite understand is how you can make sure mutations in zone A has no pointers back to zone B / C objects, hence, you can do marking phase correctly without consulting zone A?

                      1. 2

                        I cheat by not using pointers but instead handles. Each allocation is just given an increasing integer ID, which is “dereferenced” indirectly through the O(log32(N)) objtable. This should be a reasonably shallow data structure, but is definitely not free, so adds overhead to the dereferences.

                        1. 1

                          Yeah, by pointers, I meant handles. I guess whenever zone A tries to references an object already moved to zone B, you copied it out of zone B back to zone A? Otherwise I cannot see how you can avoid in marking phase missed some objects referenced from zone A?

                          1. 1

                            Yeah, the vision is that all objects are either “small” (e.g. tuples under some arity [envisioned, but not yet present in this implementation]) and are copied from either B or C into A when they need to become mutable, or they are larger (e.g. maps/collections) but are persistent objects which are mutated via path-copying of the path-to-mutated-leaf, where such copied path is written out in A and is similarly considered “small”.

                      2. 3

                        If you haven’t seen it already, you might find some of the work on Shenandoah interesting - specifically how they’ve optimized forwarding pointers/read barriers.

                        https://developers.redhat.com/blog/2019/06/27/shenandoah-gc-in-jdk-13-part-1-load-reference-barriers/ https://developers.redhat.com/blog/2019/06/28/shenandoah-gc-in-jdk-13-part-2-eliminating-the-forward-pointer-word/

                  1. 19

                    There is little of needless code in standard library. And what there is, believe me, over the decades was really polished.

                    realloc, all of locale.h, strtok, strcpy, strcat, gets… I like C a lot, but I wouldn’t call the stdlib polished. It’s a product of its time and has various footguns and bad APIs.

                    1. 12

                      Yeah, and all of this (and POSIX) has been designed before threads were a concern. For example, getenv is theoretically impossible to use in a multi-threaded environment, which makes time.h dangerous, because it may read the TZ env var.

                      1. 7

                        The poor and confusing stdlib is one of the least friendly bits of C. Nothing is ever dropped from the stdlib and only rarely dropped from the OS APIs, so all the bad stuff just hands around forever to confuse new programmers.

                        1. 4

                          You haven’t lived until you’ve called gets.

                          1. 3

                            Strings.

                            1. 1

                              Why don’t you like realloc?

                              1. 2

                                There are two problems with realloc - scope creep and its API.

                                Scope creep - realloc attempts to resize a buffer, but will also act as malloc if you pass it a special value (NULL ptr) and as free if you pass it a special value (size 0). This makes it difficult to tell what any given realloc call does when reading code and since it’s rare to actually want that functionality (because malloc/free exist) it tends to paper over bugs when accidentally triggered. It’s likely that the implementation drove the design; it would have been better to disallow resizing a null allocation and disallow resizing an allocation to 0, but it probably ‘fell out’ of how the initial implementation was written so became part of the functionality instead of keeping it tight and focused.

                                API - the intuitive way to use realloc is incorrect.

                                void *buf = ; // some existing allocation
                                buf = realloc(buf, new_size);
                                

                                That will leak memory if the resize fails - null is returned and it overwrites the original buf pointer.

                                The correct way to use it requires this song and dance, which is not a common or intuitive pattern.

                                void *buf = ; // some existing allocation
                                void *tmpbuf = realloc(buf, new_size);
                                if (tmpbuf != null) 
                                    buf = tmpbuf;
                                

                                realloc often gets a wrapper to prevent misuse because this is such a common issue. FreeBSD has reallocf that will automatically free the passed in buf pointer on allocation failure.

                                It would have been better if realloc had separated the error signaling from the buffer parameter instead of trying to collapse both into a single return value. Example alternate definition

                                int realloc(void **buf, size);
                                void *buf = ; // some existing allocation
                                if (!realloc(&buf, size)) {
                                    free(buf); // resize failed
                                }
                                
                                1. 4

                                  Realloc has other problems. Its efficiency depends heavily on the design of the allocator. Early malloc implementations stored the size in the header word and so could easily either expand the size if the memory after the allocation was free or contract it by carving the space into two runs. This family of allocators is generally slow and difficult to make efficient in a multithreaded setting. Modern allocators tend to be sizeclass allocators, which round the requested size up to some fixed-size bucket and then use a slab allocator of that size for the allocation. These are much faster and scalable but typically end up transforming realloc calls into malloc, memcpy, free sequences, syntactically hiding the fact that realloc is now a very slow operation.

                                  The realloc function needs to be modelled as a free in user code: it invalidates all pointers to existing allocations. A shocking amount of software suffers from memory-safety violations when realloc actually reallocates. If you have any aliasing of the object, code that calls realloc must go and rewrite all of the pointers after the call. That’s easy if there is one pointer (but the C type system doesn’t give you any help ensuring that this is the case) and very hard in the general case.

                                  To fix this problem, a lot of code depends on undefined behaviour, using something like:

                                  void *old = somePtr;
                                  void *new = realloc(old, newSize);
                                  if (new != old)
                                  {
                                     // Go and rewrite pointers that are equal to old.
                                  }
                                  

                                  The problem here is that it is UB in C to compare anything to a freed pointer. If new and old are equal, this is well-defined behaviour. If the realloc call freed the old object, then this is UB, so it’s completely fine for a compiler to assume that the body of this function is never reached (reaching it is only possible if your program contains UB and programs that contain UB are in an undefined state and anything is allowed).

                            1. 17

                              My bigger gripe with that change is that it is “unannounced”. I don’t give a flying “care” about the new name (while I am on the side that old one was ok, and it is overreacting in the name of non real problem), and change is ok for me. But not having period of issuing warning during repo creation and similar is IMHO disrespect to the users. There should be two or three minor versions with warning message and then on major version bump the change should be introduced.

                              TL;DR

                              I am ok with change (while I think it is unnecessary), but I am very not ok with the way it is announced.

                              1. 44

                                It’s an unlanded patch on the mailing list - it’s not the announcement. 🙂

                                Per Junio Hamano’s feedback (core maintainer): ‘(requires) a well thought-out transition plan (including user education and publicity), with the final “flip of the fallback default”’

                                1. 3

                                  Ah, good to hear. My reading of the threads suggested that it was nigh.

                                  1. 3

                                    I had the same feelings as @friendlysock, especially after reading some other threads on mailing list

                                  2. 4

                                    As always, changes in Git’s core will minimize disruption for Git’s users and will include appropriate deprecation periods.

                                    https://sfconservancy.org/news/2020/jun/23/gitbranchname/

                                    This was the notice I received, when updating Git for Windows today.

                                  1. 14

                                    the differences in terms of performance

                                    It’s extremely complicated in practice, and unfortunately I think a question like this often gets hot takes based on which ISA is simpler on the surface and people’s basic understandings of traditional RISC/CISC design tradeoffs.

                                    There are ways in which the ARM ISA is simpler due to it being newer and some choices made in its initial design that were closer to RISC. But it has also gone the path of anything else over time and accumulated its own baggage. The x86 has a legendary amount of baggage in its instruction set.

                                    The reason there isn’t a straightforward general answer is because the realization of those ISAs in hardware matters. The effect of the x86 instruction set can be mitigated to varying degrees in microcode, with complex OOO designs, etc. It’s also made more complex by the fact that up until recently many ARM chips were targeting very different spaces than x86 so there wasn’t a good comparison to be made.

                                    If you want to answer this question, I think it makes more sense to focus on two specific cores and try to understand the differences between them, including some of the hardware. A good starting point would be anandtech’s recent article on the A14/M1 that highlights some of the differences in decode width between it and x86 cores.

                                    1. 8

                                      While I agree that different microarchitectures have a huge difference, there are some microarchitectural decisions that the ISA forces. The big one for x86 is that you absolutely need to have an efficient microcode engine: it is completely impractical to implement the entire ISA in fixed-function silicon. This has a huge impact on the rest of the microarchitecture.

                                      There are a bunch of different ways of implementing microcode. The simplest is to just crack it into a bunch of normal instructions using a little state machine that pumps out decoded instructions. You typically don’t want to use architectural registers for these, so you add a few names that the rename engine can use that aren’t exposed to non-microcoded instructions. This is very easy to do but has a couple of downsides. The first is that those registers, because they are not part of architectural state, cannot be saved on context switch. This means that you need to either ensure that the microcoded instruction sequences are idempotent, or you need to disable interrupts across the instructions. When you go down this path, you often have to pause the decoder and issue a single microcoded instruction at a time.

                                      This approach is very low impact on the silicon but if your common workloads are very low on microcoded instructions. If you want to be able to execute multiple microcoded instructions in parallel then you need to have a lot more extra logic (for example, keeping enough state that you can completely roll back all side effects of an interrupted multi-cycle instruction).

                                      In AArch32, about the only instructions that were typically microcoded were the load and store multiple. These were a big win in code size, because they were often a single 32-bit instruction for frame setup and tear down but were microarchitectually painful. They could span pages and so might fault in the middle, which led to some horrible complexity in implementations. AArch64 replaces these with load/store pair instructions that are valid only within a stack frame. These don’t give quite as dense code but are vastly simpler to implement. x86, on the other hand has a lot of common instructions that need to be implemented in microcode and so you need an efficient microcode benchmark.

                                      There’s also complexity in terms of the decoder. This can be quite painful from a power perspective because the decoder has to be powered almost all of the time. The only time that you don’t need it on x86 is in high-end systems when you’re in a hot loop that lives entirely in the trace cache. Arm has three instruction sets, AArch32, Thumb-2, and AArch64. The first and last of these are fixed-width encodings and so are fairly trivial to decode, Thumb-2 instructions are 32- or 16- bits, but can all be fairly easily expanded to single AArch32 instructions. AArch64 has some complexity around SVE (it’s effectively a two-width instruction set, they just pretend it isn’t).

                                      As a colleague once said, x86 chips don’t have an instruction decoder, they have an instruction parser. Any instruction is 1-15 bytes. You need a fairly complex state machine to decode it and, because of the way that prefixes can be chained, you need to parse the whole thing before you can figure out where the next one starts. Doing that on a superscalar chip that wants to issue multiple instructions is really hard and so Intel chips don’t actually do that, they decode into trace caches that contain fixed-width micro-ops and then try to execute from there. Arm cores don’t need any of that logic and can typically cache either raw instructions or the result of some very simple expansion.

                                      The original RISC project was designed by looking at the instructions that C compilers actually generated on CISC systems and building an ISA optimised for those sequences. AArch32 was designed the same way. AArch64 used a variety of workloads including some managed languages to go through the same process. Somewhat depressingly, RISC-V did not. x86 gradually accreted over time. AArch64 and AArch32 in ARMv7, for example, have very powerful bitfield insert and extract instructions. These are really useful for a bunch of things (such as NaN boxing in JavaScript JITs), but are present only in very recent x86 chips.

                                      Arm does not have subregisters. On x86, you have AL, AH, AX, EAX, and RAX all update the same registers. For anything shorter than RAX, this means you need a read-modify-write operation on the rename register. This adds complexity in the rename engine. Arm briefly had a floating-point mode that let you treat the FPU registers as either 32 64-bit or 16 64-bit floating point registers. This caused similar pain and was abandoned (32-bit FPU ops needed to track the enclosing register pair so that they correctly half-overwrote or were overwritten by 64-bit operations). Recent Intel chips make updating subregisters quite fast but at the expense of microarchitectural complexity (i.e. power).

                                      The Arm instruction set is designed to avoid data-dependent exceptions. Only loads and stores will trap based on the data and loads and stores will only trap on the address, not the data (unless you have ECC memory or are Morello). x86, in contrast, has a load of instructions that can trap based on the value (e.g. integer division by zero). This means that you need to start a new speculation window every time you hit a divide instruction in x86, because you may have to reset the pipeline to the state at that instruction if you discover that it traps.

                                      In summary, there are some fundamental differences between the two that mean that, within a fixed power and area budget, you should be able to make AArch64 faster. This matters somewhat less at the very high end, because a lot of these overheads are more-or-less fixed and so there isn’t a huge amount of difference when you scale everything else up. That said, instruction scheduling / register rename are some of the biggest fixed costs on a high-end core and anything that adds to that can end up being a bottleneck. x86 is made entirely out of things that add rename and scheduler complexity. If you’re building a laptop / tablet / phone core, these fixed costs are a very noticeable part of your power / area budget. Even without being especially clever, you can spend all of the saved transistor budget on extra cache and get a big speedup. It looks as if Apple does this as well as some more clever things.

                                      1. 1

                                        It’s a shame this fantastic answer came so long after the original thread, thanks for taking the time to writer it. What sort of other clever things are you talking about at the end? I’d be interested to hear your thought on The Mill given what you’ve said here, too.

                                        1. 4

                                          I don’t know exactly what Apple is doing, but I don’t think their perf results are just from having large caches. I believe they’ve spent more of their power budget on the branch predictor than some competitors (this makes a lot of sense if your workloads are dynamic-language-heavy). I don’t know what else they’re doing, but given how much effort Apple put into AltiVec and SSE optimisations for a bunch of fast paths I wouldn’t be surprised if they’ve doubled up some of the pipelines for Neon operations and carefully tweaked hand-written standard library assembly and the compiler cost model to get the most out of this. They ask developers to upload LLVM IR to the app store and so can aggressively tune their generated code for the specific device that the app is being installed on (I wanted to do this with FreeBSD packages too, but we never had enough contributors to build out all of the infrastructure), so they can get the last 2-10% out of any given microarchitecture.

                                          The Mill feels a lot like VLIW in origin: leave difficult things out of the architecture and punt them to the compiler. This generally gives you great benchmark numbers and terrible real-world performance. The compiler can spend a lot more time optimising things than the CPU can (it’s fine for the compiler to spend a few tens of milliseconds on a function, that latency in the CPU would make performance completely unacceptable) so in theory it can do more, but in practice it can’t take advantage of any dynamic behaviour. The Mill doesn’t do register rename at all, so can be almost entirely execution logic, but it punts a lot to the compiler and it’s not clear to me that it will do any better in the real world than EDGE architectures.

                                          Compiler ISAs are now really dataflow encodings. Registers aren’t really registers, they’re just locally scoped names for data flow. This is something that really bugs me because a typical compiler IR is an SSA form, which it then maps into some fixed-number-of-registers encoding and the register rename engine then tries to re-infer an SSA representation. It feels as if there really ought to be a better intermediate between two SSA forms but so far every attempt that I’m aware of has failed.

                                          The rule of thumb for C/C++ code is that you get, on average, one branch per 7 instructions. This was one of the observations that motivated some of the RISC I and RISC II design and I’ve had students verify experimentally that it’s more or less still true today. This means that the most parallelism a compiler can easily extract on common workloads is among those 7 instructions. A lot of those have data dependencies, which makes it even harder. In contrast, the CPU sees predicted and real branch targets and so can try to extract parallelism from more instructions. nVidia’s Project Denver cores are descendants of TransMeta’s designs and try to get the best of both worlds. They have a simple single-issue Arm decoder that emits one VLIW instruction per Arm instruction (it might do some trivial folding) and some software that watches the in instruction stream and generates optimised traces from hot code paths. Their VLIW design is also interesting because each instruction is offset by one cycle and so you could put a bunch of instructions with data dependencies between them into the same bundle. This makes it something more like a VLIW / EDGE hybrid but the software layer means that you can avoid all of the complicated static hyperblock formation problems that come from EDGE architectures and speculatively generate good structures for common paths that you then roll back if you hit a slow path.

                                    1. 2

                                      There’s a lot of great content from previous years as well!

                                      I particularly recommend Forsyth’s talk on how AVX-512 came about, the discussion with Pat Wyatt about networking in games, and the Fabian’s talk on graphics pipelines.

                                      A playlist with most of the content from the 2015/2016 iteration of this conference is here.

                                      1. 1

                                        My very first Rust project was a raytracer, and my second one a path-tracer. I do tend to agree with the OP that the borrow checker might feel like more of a hindrance on this kind of code, compared to simply using Asan.

                                        However, I really liked using rayon instead of having to write my own thread pool as I did in C99. Going from iter to par_iter on my vector and having everything just work was great .

                                        1. 1

                                          You could likely use OpenMP to do something similar without too much effort in C or C++.

                                          1. 2

                                            Yup! This requires one parallel, schedule(guided) pragma.

                                            It does come with trade offs though. I find asm output with OpenMP in the mix more difficult to follow, and it’s compiler dependent. Generally I’d favor writing the basic thread pool or parallel for myself.

                                            1. 1

                                              I have two things to say about that : first OpenMP is not portable, being a compiler extension, whereas rayon would work on any Rust target and with any Rust compiler.

                                              Secondly, I got way better results (pretty much a x8 on my 8 core system) using rayon, whereas OpenMP only tripled the performance or so. I had to write my threading using pthread to actually make use of my cores correctly.

                                              1. 1

                                                That’s quite a drop in performance! Did you get a sense of where the bottlenecks were in your OpenMP version?

                                                Anecdotally, I’ve always needed to tweak job sizing/assignment and fix a few instances of false sharing when applying the pragmas to previously single threaded code.

                                                1. 1

                                                  Yeah, I would have had to fix a few things, which happened when I refactored to use pthread. My point was mostly to explain that rayon really has that magical feeling of modifying one line and having it just work.

                                                  And that comes from Rust imposing the architectural decisions that lead to good threading support, compared to C where you’re free to do whatever.

                                                  1. 2

                                                    And that comes from Rust imposing the architectural decisions that lead to good threading support, compared to C where you’re free to do whatever.

                                                    I can’t say I agree with this. The architecture rayon guides you towards is about equivalent to the architecture OpenMP’s parallel-for pragma guides you towards. I think rayon has some better defaults, and I’ve found its codebase easier to dig through than OpenMPs, but don’t really see any large narrative there.

                                                    1. 1

                                                      Yeah, but OpenMP isn’t C. C by itself doesn’t guide you to any architecture. I’m not talking about OpenMP.

                                          1. 6

                                            Two of the most critical changes introduced were … HTTP pipelining.

                                            I’m not sure I’d agree with that statement. Pipelining was a bit of a failure and basically not seen in the wild.

                                            While beneficial in theory, this feature rarely seen in practice, since it requires a server to understand the structure of the HTML it serves, which is rarely the case.

                                            This is not why push isn’t used. It has to do with the complexity of predicting what data the client will definitely need, the need to keep buffers shallow so the server can respond to client changes, the need for the client to accurately cancel requests, and the often very high cost of getting it wrong and over pushing.

                                            A bit of a nitpick: 0-rtt concerns are orthogonal to quic. You can run quic as 1-rtt, and you can run http1 or 2 with a 0-rtt TLS 1.3 handshake (and TFO if you want to remove the transport layer round trip).

                                            Two other things I’ll mention that I think are interesting: while both 2 and 3 do header compression, there were significant changes between HPACK and QPACK. HPACK header compression contains head of line blocking which was irrelevant when running over TCP, but made it a bad fit for running over QUIC. And one of the other big advantages of QUIC is having finer grain control of state timers built into TCP - for example, losing the initial SYN is extremely painful, more so than packets in an established session, and QUIC gives you the ability to be more aggressive on retrying that initial loss (which helps with bringing in tail latencies on lossy networks).

                                            1. 1

                                              I’m not sure I’d agree with that statement. Pipelining was a bit of a failure and basically not seen in the wild.

                                              HTTP/1.1 pipelining definitely exists in the wild and has extremely clear performance benefits. To the degree that you can visibly identify pipelining on certain types of websites like image galleries.

                                              1. 1

                                                It looks like it does have more adoption on mobile than I gave it credit for, so my original statement was too strong.

                                                I do still consider it mostly a failed stopgap to http2 though, and I wouldn’t consider the benefits extremely clear - especially given all major non mobile browsers I’m aware of have either turned it off by default or removed it.

                                                1. 1

                                                  Given that buggy servers and proxies prevented widespread adoption in desktop browsers, I suppose I’m forced to agree the technology ultimately failed. But I’m definitely salty about it.

                                            1. 2

                                              I think there are three interesting ideas in this kind of system:

                                              The first is the observation that underlies bloom filters, skip lists, and a lot of other data structures: If you have a very cheap way of getting a half-arsed almost answer, then you can often do an O(n) thing to get from your approximation to the right answer and still outperform a formally correct algorithm because your guess reduces n to a much smaller value. That’s not a particularly new observation but it is often overlooked.

                                              The second is that most data is not a uniform random distribution. If you know something about the distribution of the data then you can often do better than an algorithm that is optimal for a uniform distribution. Again, this is not particularly novel, and is often used in real-world sorting algorithms in systems software, to special-case common values. Outperforming a radix sort is nice, but it’s quite common in systems software to do a bucket sort, which is O(n), to get the most common values out of the way and then do something slower on the values that end up in the slow-path unusual-case bucket.

                                              The third, which was actually the conclusion of some of my PhD work so one that I’m quite biased towards, is that you can often apply some fairly simple and cheap machine learning to figure out enough of the characteristics of the data that you can apply the first to observations to get a rough guess that’s mostly good-enough and fall back to something slower for when the guess is wrong. If your data has characteristics that you can look at and figure out how to design an optimised data structure then you may well be able to have a machine infer these characteristics.

                                              The thing I really like about some of the recent work in this area is that they’re starting to separate these three aspects.

                                              1. 1

                                                Do you have any suggested materials or papers for number three? I’ve seen simple techniques applied in my problem domain very successfully a handful of times, but trying to develop intuition around what set of techniques to consider and when is difficult since so much of the literature is devoted to much grander forms of ML.

                                                  1. 1

                                                    I don’t, unfortunately, other than the ones cited in the article.

                                                1. 1

                                                  Something I wonder about QUIC is whether it could have other interesting application than serving HTTP. It seems to be optimized on things like stateless requests (or “streams”). There’s quite a few protocols and libraries out there utilizing UDP, but also “re-inventing” (or picking) some features and guarantees from TCP. However, these often exist fairly isolated, sometimes in single language communities, frameworks or tools.

                                                  If QUIC gets widely implemented and there already are quite some implementations, currently often really tied to HTTP/3 implementations though, it might be interesting in other use cases as well. Maybe it’s a bit too early to ask that, but has anyone played with that? I wonder, whether there’s situations where this makes sense or whether one would in such situations then always go for HTTP/3 on top of QUIC.

                                                  1. 1

                                                    This is a good question! Most of the focus has been on getting HTTP/3 out the door, but there is talk of “what application protocol is next for QUIC” - it is expected that QUIC will support multiple protocols on top of it. DNS-over-QUIC (which is different from DNS over HTTP/3) is the one I’m aware of that’s furthest along.

                                                    or whether one would in such situations then always go for HTTP/3 on top of QUIC

                                                    In some ways I think HTTP/3 is “the minimal sane application transport you’d build on top of QUIC anyways”, so while you can build new protocols on QUIC directly, I think a lot of currently separate application protocols should instead be interfacing with HTTP/3 with small modifications. Two examples: for a bidirectional streaming or pubsub use case like websockets or MQTT, HTTP/2 and 3 can support it with very minimal changes (https://tools.ietf.org/id/draft-xie-bidirectional-messaging-01.html). For video streaming protocols there’s interesting opportunity to take advantage of selective reliability - eg. making keyframes/i-frames lossless and p/b-frames lossy. There’s no real reason for that to happen at the QUIC layer versus the HTTP/3 layer once APIs exposing delivery reliability are created.

                                                  1. 2

                                                    The thing that’s a problem with QUIC is that you’ll have a hard time (like with http/2) to get it running between an application and your reverse proxy. So it’s user < http/3 > server < http1.1> application for 90% of what people tend to run ?

                                                    Got a nextcloud behind a global nginx resolver ? Great, now everything past the nginx is streamed and quic, but not so between nginx and nextcloud. Or where are your localhost certificates coming from ?

                                                    1. 6

                                                      The reason NGINX gives for not implementing HTTP/2 for reverse http traffic is that it wouldn’t improve the performance on low-latency networks that are normally used with this type of setup. Not sure if this would change for HTTP/3 though.

                                                      1. 1

                                                        Ah thanks, didn’t knew this was the case. Thought streaming / avoiding TCP would give you the same improvements locally.

                                                      2. 5

                                                        HTTP/1 turned out to be good enough for upstream communication. You lose only two things: H/2 push and detailed multiplexing/prioritization.

                                                        However, H/2 push seems to be dead. Browser implementations are too weird and fragile to be useful. Experiments in automating H/2 push at CDNs turned out to give “meh” results (lots of conditions must be just right for a perf benefit, but a less-than-ideal guess makes it a net negative).

                                                        Prioritization and multiplexing can be done at the edge. H/2 server can by itself decide how to prioritize and mix upstream H/1 connections, and this can be done well enough with heuristics.

                                                        So I expect this trend to continue. You can use H/1 server-side for simplicity, and H/3 for the last mile to get through slowest links.

                                                        1. 3

                                                          I tried to deploy http/2 push in a way that improves performance and it’s just hard, even though this was within the context of a server that understood and optimized HTML. Here’s how it typically goes:

                                                          I want to push js/css to the browser. But what file? Maybe I can configure my server to push /app.js, but my build system uses cache busters so now I need to integrate some sort of asset db into my web server config. What if the homepage, search, and product team all have different incompatible systems? Assuming that problem is solved, what happens if app-$SHA.js is already in the browser cache?

                                                          For certain websites you start looking at heuristics. Like if you cookie a user and a request comes in for a page without a cookie you can probably assume it’s their first visit and that they have a cold cache. But without some sort of asset db for your versioned assets you have to examine the response what assets are referenced. Now you might have to add a layer of gzip/brotli decoding and buffer.

                                                          It’s hard.

                                                          1. 3

                                                            Indeed. There has been proposal for “cache digest” that browser would send to signal what it has in the cache: https://calendar.perfplanet.com/2016/cache-digests-http2-server-push/

                                                            but it doesn’t seem to be going anywhere. It’s more complexity, potentially yet another tracking vector, and it’s still 1RTT win in the best case.

                                                          2. 2

                                                            Interesting to hear push is dead. Kinda like that tbh.

                                                          3. 4

                                                            HTTP/3 is really catered towards client-facing edge servers, especially talking to mobile clients. There might be a future where it makes sense for traffic within a datacenter or between services, but I’m skeptical. In any case, that will probably be awhile because more work needs to be done to bring QUIC’s server-side CPU footprint down before you’d want to try shoving it everywhere.

                                                            Generally I think HTTP/2 is the right choice for that sort of revproxy to server communication. You can use it without multiplexing and basically get HTTP1 with compressed headers.

                                                            1. 2

                                                              Not familiar with nextcloud - would nginx connect to it over a public internet? Because for the reliable internal networks HTTP 2/3 gives diminishing returns (packet loss is much less of an issue).

                                                            1. 2

                                                              Some of these seem not very useful

                                                              [alias]
                                                                diff = diff master
                                                              

                                                              This seems like a no-op according to the git-config(1)

                                                              To avoid confusion and troubles with script usage, aliases that hide existing Git commands are ignored.

                                                              It also seems like a strange use-case. From my command history, I mostly use git diff, git diff-cached, and git diff $somefile. I’m not sure that I would want to default comparing against master instead of the default of comparing against HEAD.

                                                                pom = push origin master
                                                                merged-branches = "!git branch --merged master"
                                                                sync = "!git fetch -p && git rebase origin/master"
                                                              

                                                              Why not just !git fetch && git rebase? If you’ve set your upstream, you don’t need to specify to rebase onto a particular branch all the time.

                                                              1. 2

                                                                I use git diff master all the time - I tend to create a series of small semantic commits and often want to see “what all have I changed in this feature branch”, especially before submitting a PR.

                                                                1. 1

                                                                  I tend to use git log (-p) for that sort of thing. If the changes are committed, then I want to use the structure I created for myself!

                                                              1. 4

                                                                This is very handy - I hit this problem just last week and had “figure out how to make git aliases work for both main/master” on my todo list. Thank you!

                                                                1. 5

                                                                  I use Brazil on a day-to-day basis and the thing I’d liken it to most is nixpkgs. Personally, I think that most of the benefits of Brazil can be realized with something like target/lorri and Cargo-style build systems.

                                                                  1. 1

                                                                    I’m not sure if I understood the “version set” concept completely. It seems to similar to Cargo.lock files where the transitive closure of your dependencies is recorded. However…

                                                                    When version sets are imported into Apollo, all of the test and build dependencies are stripped away and the remaining graph is passed in.

                                                                    That sounds like it additionally stores the information which dependencies are only used for testing and build purposes but unnecessary for deployment. That sounds more like package manager like Nix or Apt. In some sense, Amazon software is treated like an extended Linux distro?

                                                                    If you create a package, say the IDL for your service API or config for clients to connect, you specify the interface version (say, “1.0”). Every time you build, the generated package is given a concrete build version (“1.0.1593214096”). That build version is what’s stored in the version set for each package node. Developers never manage the build version either as an author or consumer.

                                                                    That sounds unlike any package manager I know.

                                                                    1. 3

                                                                      I’m not sure if I understood the “version set” concept completely. It seems to similar to Cargo.lock files where the transitive closure of your dependencies is recorded. However…

                                                                      Yup, you can think of version sets are a server-managed Cargo.lock which enable some neat things with the internal CI/CD system such as “where has this commit been deployed to?”. From a development standpoint, version sets are also tied to workspaces, which consist of several cloned packages. You can think of workspaces akin to Cargo workspaces.

                                                                      That sounds like it additionally stores the information which dependencies are only used for testing and build purposes but unnecessary for deployment.

                                                                      That’s right.

                                                                      That sounds more like package manager like Nix or Apt. In some sense, Amazon software is treated like an extended Linux distro?

                                                                      It really depends on the language, but that’s not wholly inaccurate. With npm or Cargo-based projects, language-idiomatic tooling is used for builds with Brazil serving system dependencies. At that point, it’s less a build system and more a packaging and artifact provenace system.

                                                                      That sounds unlike any package manager I know.

                                                                      Yup, that part is unique due to the norms around package versioning (you don’t ever cut a new version.)

                                                                      1. 1

                                                                        Yup, that part is unique due to the norms around package versioning (you don’t ever cut a new version.)

                                                                        What do you mean by “cut”?

                                                                        1. 1

                                                                          My bad. In this instance, releasing a package at a new version. A bump from 1.0 to 1.1 is considered to be a really big deal and is highly discouraged. Brazil treats any version bump as breaking change, as version numbers are treated as opaque strings.

                                                                    2. 1

                                                                      How are security issues in dependencies handled? Does the insecure version get blacklisted and consumers are forced to move to a later version, or do the dependency authors backport the fix to older versions?

                                                                      1. 6

                                                                        When I was there, such things would be surfaced in various interfaces, but if it was truly critical it would require one team going and wrangling all the other teams into updating their version sets (and possibly doing the work for them).

                                                                        The upside of version sets is that each team owns their whole dependency chain and get to choose when they want to take changes from other teams - it gives every team significant independence. The downside is that version sets accumulate cruft (good luck fully rebuilding one; quite possible some of the artifacts you’ve been using for years no longer compile), and if you vend libraries to other teams you can expect a long tail of old versions in use. It’s very difficult to go make the kind of large, company-wide crosscutting changes that you can at eg. Google/FB, and that includes things like compiler/runtime upgrades.

                                                                        I don’t really think there’s a better or worse between the Amazon and Google/FB style build/deploy/source control systems, it’s primarily a reflection of the org/team structure and what they prioritize - there’s tension between team independence/velocity and crosscutting changes that optimize the whole codebase.

                                                                    1. 4

                                                                      For the very basics, here’s an ultra-simple CLI tool that basically just hooks up some existing code to a command line interface.

                                                                      1. 1

                                                                        I’m curious what other people who haven’t use Rust think of the syntax and it’s general feel. I get what’s going on, but there’s something syntactic I find repellent (but I can’t put my finger on it – it’s not like parentheses in Lisp), and was wondering if it’s just me.

                                                                        1. 2

                                                                          but there’s something syntactic I find repellent

                                                                          Purely guessing here because it’s hard to know your tastes without knowing your PL history, but often what I see people react to if there’s nothing specific they can point to is the expression instead of statement focus - implicit returns, assigning to an if/else or block expression, etc.

                                                                        2. 1

                                                                          I’m not familiar with Rust, but all those for loops look more like Go to me — is that really idiomatic?

                                                                          1. 1

                                                                            For loops aren’t inherently bad - I find them more readable than a long iterator chain in lots of cases where there’s a large body, side effects, etc. One thing I like about rust is that it doesn’t lock you into either loops or hofs.

                                                                            That said, the trues function is a good example of something that would likely be more legible as a filter and count.

                                                                            Edit: in general I think CLI arg parsing code won’t be a great intro example for any language though. It’s almost always going to be a large wad of conditional blocks and not very interesting or representative.

                                                                            1. 1

                                                                              I didn’t mean to imply that for loops are inherently bad — you give some valid examples.

                                                                              I would, though, have expected Rust to have something like .contains() to check for a true value in a collection.

                                                                              1. 1

                                                                                Rust collections do have .contains(), and so does slice, so it is widely available. Iterators also have a variety of functions to filter / map / collect values, so if you don’t want to use a for loop then you don’t have to.

                                                                        1. 4

                                                                          I start a new job on Monday and will be going from C++ to Rust, so I’m brushing up on the language: finishing reading a second book on it, finishing rustlings, that sort of thing.

                                                                          I ported my C99 ray tracer to Rust for some extra practice. After a little tweaking it’s ~15-20% slower, so if I find time I want to dig into that and see where I’m losing performance. A cursory look didn’t show anything stupendously wrong (bounds checks aren’t killing me, inlining looks pretty good, etc.), so I need to diff the asm and see what’s going on. It feels like something in my ball of floating point calculations is very slightly worse, and that adds up quickly.

                                                                          1. 2

                                                                            I’m learning Rust at the moment (literally taking a tea break from working through chapter 12 of the rust book as I write this) and I hadn’t come across rustlings, but it looks fantastic! Thanks for mentioning it :)

                                                                          1. 19

                                                                            Not necessarily a technique, but I brushed off using a debugger for way too long. “if I need a debugger to understand the code then it was too complex to begin with”, “printfs are fine”, etc.

                                                                            Using a debugger with disassembly and a memory view greatly increased my productivity. It’s also very useful to learn how to use features past simple breakpoints - ie. conditional breakpoints and watch variables.

                                                                            1. 9

                                                                              A thing I see happening over and over again is that there are two camps of people: Those who only use debuggers and those who only use printing, logging, tracing etc.

                                                                              Typically the arguments of the first group are that “printf debugging” is not professional and inefficient. The latter group often says that they don’t want to learn another tool or that the need to use a debugger is a warning sign of a hard to understand code base.

                                                                              I politely disagree with both camps. The two tools have different use cases and there are areas where it is hard to operate with only one but not the other.

                                                                              In my opinion, debuggers shine for two things. The first is state visualization. Given a certain point in the execution of your program, a debugger will show you the state of the program. When I hear that a program segfaults, the state at the segfault is the first thing I want to see. A debugger will not tell me how the program got into this state*. To find that out, I will resort to reasoning, strategically adding print/log/trace statement, or, if simple, breakpoints to catch conditions before the program crashes. The second use case is working with an unfamiliar code base. Debuggers are good at answering a question like “where is this function called from?” or “is this code ever executed?” without the need to change the program.

                                                                              Prints/log/traces shine for transient visualization. Say you are debugging an FIR filter that starts to oscillate. Breaking into the code with a debugger when oscillating will probably be useless, because while now you see the system in a misbehaving state, you don’t know how it got there. Stepping might be an ineffective method, because the error conditions builds up over thousands of runs of the filter. Printing information as you go can give you a good idea how a system is moving into an undesirable state. When you are adding prints, as a bonus, you can add additional logic. For instance, you can check for certain conditions, print marking message like “here it looks like an accumulator overflowed” or print data with a compact machine representation in an easily human parsed shape that will allow you to find spurious behavior in a cursory glance.

                                                                              In summary, debuggers and prints have an overlap in their functionality, but I feel it’s most productive to combine the two in an intertwined way of working.

                                                                              • Barring reverse debugging which I can often not use due to me working with embedded systems.
                                                                              1. 6

                                                                                conditional breakpoints and watch variables

                                                                                These improved my ability to understand programs using a debugger at least ten-fold. Expression evaluation is another significant feature I use.

                                                                                1. 2

                                                                                  I was very slow to pick up watchpoints. It took me fixing a bug in a debugger front-end to even understand what they were, and why they were useful.

                                                                                1. 2

                                                                                  I used a similar workflow in mercurial and quite liked it - but the big difference was if I added a commit to one of the base branches, I could merge/rebase the entire stack of branches at once. Is there an easy way to do that in git instead of having to go through the git checkout/git merge dance for each stacked branch whenever a base is modified?