Threads for pkhuong

    1. 4

      So if I want to generate an actually crypto-safe random value, like for example an IV, how should I do that in Rust? Also, lack of a “I want a random number in a way that’s safe, stable, and doesn’t rely on a C/C++ library” feels important, I feel like std should just provide this tbh.

      1. 6

        OsRng uses kernel entropy (via the getrandom crate). Probably start with that until you’re sure you need something with less syscalls/random byte.

        1. 4

          Or just go ahead with reading from /dev/urandom assuming your target platform implements this in a cryptographic strong fashion (such as Linux). It keeps the supply chain low which might be helpful in a cryptographic application.

          1. 6

            There are some annoying issues with /dev/urandom

            • You have to ensure the special file is present in containers or chroot() jails

            • It requires multiple system calls and a retry loop to get a few bytes of randomness

            • It can fail if a process hits its file descriptor limit

            It’s much less troublesome to use getentropy()

            1. 3

              Valid points! Thank you for the clarifications.

          2. 3

            Oh, in my mind getrandom and rand where the same crate, alright just a blip.

        2. 4

          I understand that a changelog is technical and quite “clinical”, but I have to admit that I expected the upgrade guide to be more welcoming. If I only used the rand crate to generate some numbers using ThreadRng or shuffling some Vec, what should I do or expect on upgrade? A very simple before/after example for the most obvious cases would help beginners like me a lot!

          1. 5

            I think you’re looking for the “Updating” section of the rust-random book: https://rust-random.github.io/book/update-0.9.html

            1. 2

              I was talking about this exact page, yeah.

          2. 6

            I recall reading a rant about an assortment of time shenanigans, back in 2017 or 2018, that had a short aside envying how $BIGCO avoids 23:59:60 entirely by stretching slightly-longer seconds across a day. I remember wondering how the heck you orchestrate that. Now I know!

            Also, not intending this as a “well actually” response to the claim that no one outside of the cat picture factory has heard what they do. I think the $BIGCO in the rant I read was different one. Even better, because it means multiple places arrived at the same solution to a hairy problem, which is really cool.

            1. 24

              IIRC Google was first, and they blogged about it and it was discussed fairly widely in time nuts and ops circles. My link log says that was 2011 https://dotat.at/:/?q=leap+smear

              In 2015, Facebook and Amazon did leap smear. In 2016 it was pretty widespread.

              The backstory is that around 25 years ago there started to be grumbles about abolishing leap seconds. This first round of discussions and proposals led to a decision to keep the status quo at the ITU-R world radiocommunication conference in 2007. (The official definition of UTC is ITU recommendation TF.460, because radio time broadcasts are under the ITU-R’s remit.) Steve Allen has a very detailed timeline https://www.ucolick.org/~sla/leapsecs/onlinebib.html

              At the same time, there was a long gap in leap seconds, 1998-2005, which happened to cover much of the exponential growth phase of the internet and open source. Far fewer systems were precisely synced to time for the 1990s and earlier leap seconds. The leap second code in NTP and operating system kernels was immature and poorly tested. When leap seconds resumed, it was a shitshow of appalling bugs and outages.

              So there was a good deal of disappointment that leap seconds would not be abolished, and a realisation that much engineering work would be necessary to deal with them. Some of that work went into improving how operating system kernels handle leap seconds (and time in general), and how NTP distributes leap seconds and shares them with the kernel.

              But there was not a lot of confidence that this would be a reliable approach, hence leap smear as a workaround. I think leap smear became popular because it was demonstrated to work and it’s a relatively straightforward way to completely avoid leap second bugs: just hack your NTP servers, no need to audit and test all your time handling code. (Time bugs are painful to test!)

              And the ongoing discussions about the future of leap seconds in the ITU and CGPM/BIPM, especially at the treaty conferences in 2012 and 2015, showed a great reluctance to make any changes. They kept kicking the question back to committees for further study.

              I strongly suspect that the reason leap seconds are being abolished now, when they were not before, has very little to do with any technical matters, and much more to do with people retiring.

              1. 3

                (Time bugs are painful to test!)

                They can be both painfully slow and painfully fast to test and you don’t know when to verify.

                1. 3

                  Wow, thanks for the detailed history!

                  I am reminded of the quip (quote?): “Science progresses one funeral at a time.”

                  1. 2

                    The idea is due to Max Planck tho the cute version seems to have been coined by Paul Samuelson

                  2. 1

                    I strongly suspect that the reason leap seconds are being abolished now

                    I was not aware of this! If anyone else is curious about this: https://www.nature.com/articles/d41586-022-03783-5

                    1. 5

                      multiple places arrived at the same solution to a hairy problem

                      There was (is?) a sort of revolving door between Google and Facebook last decade, especially for kernel and infrastructure folks. This could very well be cross-pollination (or even one of smearing’s initial proponents at Google leaving for Facebook) rather than independent discovery.

                      1. 10

                        Google’s 2011 blog post on leap smear says they started doing it in 2008. Rachel Kroll was an SRE at Google before she moved to Facebook.

                    2. 9

                      Interesting that E is cited under “capabilities”, but not under “loosen up the functions”. E’s eventual-send RPC model is interesting in a number of ways. If the receiver is local then it works a bit like a JavaScript callback in that there’s an event loop driving execution; if it’s remote then E has a clever “promise pipelining” mechanism that can hide latency. However E didn’t do anything memorable (to me at least!) about handling failure, which was the main point of that heading.

                      For “capabilities” and “A Language To Encourage Modular Monoliths”, I like the idea of a capability-secure module system. Something like ML’s signatures and functors, but modules can’t import, they only get access to the arguments passed into a functor. Everything is dependency injection. The build system determines which modules are compiled with which dependencies (which functors are passed which arguments).

                      An existing “semi-dynamic language” is CLOS, the Common Lisp object system. Its metaobject protocol is designed so that there are clear points when defining or altering parts of the object system (classes, methods, etc.) at which the result is compiled, so you know when you pay for being dynamic. It’s an interesting pre-Self design that doesn’t rely on JITs.

                      WRT “value database”, a friend of mine used to work for a company that had a Lisp-ish image-based geospatial language. They were trying to modernise its foundations by porting to the JVM. He had horror stories about their language’s golden image having primitives whose implementation didn’t correspond to the source, because of decades of mutate-in-place development.

                      The most common example of the “value database” or image-based style of development is in fact your bog standard SQL database: DDL and stored procedures are very much mutate-in-place development. We avoid the downsides by carefully managing migrations, and most people prefer not to put lots of cleverness into the database. The impedance mismatch between database development by mutate-in-place and non-database development by rebuild and restart is a horribly longstanding problem.

                      As for “a truly relational language”, at least part of what they want is R style data frames.

                      1. 4

                        Something like ML’s signatures and functors, but modules can’t import, they only get access to the arguments passed into a functor. Everything is dependency injection. The build system determines which modules are compiled with which dependencies (which functors are passed which arguments).

                        MirageOS does exactly this with Functoria! I think MirageOS unikernels largely use this pattern to ensure that they can be run on a wide range of targets, be it hypervisors, bare metal, or a Unix binaries, without modifications to the application code, but they could also be seen as a form of capabilities.

                        1. 2

                          [The CLOS MOP is] an interesting pre-Self design that doesn’t rely on JITs.

                          I believe all reasonably fast MOP implementations rely on compiling dispatch and method combination lambdas at runtime, in response to actual method call arguments. When is runtime compilation not just-in-time? ;)

                          Of course, with image dumping, it’s possible to warm CLOS caches and save the result to disk, so runtime code generation isn’t mandatory.

                          1. 2

                            Yeah, maybe I should have qualified JIT with Self-style too :-) The distinction I was getting at is whether compilation happens at class or method mutation time (which is roughly what CLOS is designed for) or at call time (which is the Self / Strongtalk / Hotspot / v8 tradition).

                            1. 1

                              When you call a CLOS generic with a combination of types that you haven’t used before, it may need to figure out a new effective method, which could involve compiling a new lambda and caching it.

                        2. 1

                          This is super nitpicky I know but it bothers me that this post and the official documentation both say “long-pole” instead of “long-poll”.

                          1. 8

                            Long-pole is definitely what I mean :)

                            1. 7

                              it’s “long pole”, like the thing that holds up the center of a tent or marquee. here a long-pole test would be a test that takes, say, 1 minute to run while all the rest take 10 seconds total. the test suite can’t complete until all the tests are done so it’s the slow test that’s holding every one up

                              1. 5

                                That’s because it is pole, like a long stick.

                              2. 4

                                Back in the days of the Pentium chips, the CPU would remember if it was certain that the high 16 bits of a 32-bit register had been explicitly cleared (e.g. by XORing it with itself), and if so then a subsequent operation on the low 16 bits only would still be fast. It sound like something similar is happening here. An explicit zeroing of the high bits of the 64-bit register (e.g. via an XOR, or a MOV using the 32-bit version of the register) gets the fast shift, otherwise the CPU falls back on a slower shift process that looks at all bits of the register.

                                1. 5

                                  Yeah I was thinking along those lines at first too. I believe there is some amount of renaming of subregisters (e.g. top/bottom half of RAX) that goes on. However, that isn’t the whole picture, because the top half doesn’t need to be 0 for it to be fast! E.g. this is still fast:

                                          MOV RCX, -1
                                          MOV RCX, RCX
                                  

                                  Regardless, it is very strange for SHLX to care at all about the top bits. Only the lowest 6 bits actually affect the instruction. There’s no reason for a dependency on the high bits at all.

                                  1. 3

                                    Good news is compilers are unlikely to emit a 64-bit immediate load for the shift count.

                                    1. 2

                                      True! But things like this still trigger it, which might be more likely:

                                      INC RCX
                                      SHL RAX, CL
                                      
                                      1. 4

                                        Wait, so it isn’t a SHLX anomaly, but a general shift anomaly for certain ways of creating the shift amount???

                                        In some ways, this makes sense… its not like there are two unrelated ports to execute on…

                                        I actually wonder if part of the problem is having to use CL for the non-SHLX variant…

                                        1. 6

                                          I actually wonder if part of the problem is having to use CL for the non-SHLX variant…

                                          That’s the most plausible explanation I’ve seen so far. If you designed for SHL then maybe you only put 16 wires in for the shift port in the barrel shifter.

                                          The fast sequences seem to be ones where you can tell at the decode stage that the register is 16 bits, with a few missing cases. These are all annoying in register rename because you have to handle writes to AH as masked stores generating a new rename register, if you know that the AH bits are zero then you can do a little bit more reordering. Possibly the RAX case is ignored because you rarely write RAX and AH as the same rename register.

                                          If you then add SHLX, reusing an existing SHL pipeline, the simplest thing to do is read the L subregister. If the rename engine knows that it has a consistent view of L, this gets forwarded faster. If not, perhaps there’s another interlock? I can’t imagine you’d actually want to move the whole 32 or 64 bits for an operation that uses, at most, the low 6 bits. This doesn’t seem very likely.

                                          I wondered if there was some overflow behaviour, but both SHL and SHLX are specified to mask the shift operand. If SHL trapped on CL having more bits, I can imagine an SHLX implementation needing all of the bits in the register to check for this trap condition and then discarding the result if it’s the X variant.

                                          My most likely guess is that Intel uses different-sized rename registers and has a pool of 16-bits ones. I’ve met a microarchitect who liked doing this: the area savings are usually small because the control logic is big for register rename and vector registers dwarf any other savings, but it might be that functional units that take 16-bit operands are next to a pool of rename registers for H and L registers and if you decode an E / R register instruction with the H or L variants live you inset a move from the small pool to the large pool. If the 16-bit registers are close to the SHL pipeline and SHLX is reusing the pipeline, you have to get the register from further away and simply getting a signal from one side of the core to the other is a multi-cycle operation at these speeds. If they’ve optimised layout for these functional units getting 16-bit values, that seems plausible.

                                          Even without that, there may only be a single 64-bit path from the big rename file to the SHL unit, so you can fetch a 64-bit and 16-bit operand in parallel but need to fetch two 32-bit ones sequentially.

                                          Three cycles is a bit odd, so my guess is that there’s actually one 16-bit path from 16-bit rename registers and one 32-bit path from the 32/64-bit file(s?). You can do the two halves of a shift in parallel so doing the two fetches of the first source in adjacent cycles still lets you dispatch one per cycle (and skip the second half of the fetch for the 32-bit version). If you need to fetch the shift operand from the 32/64-bit register file then it takes two cycles.

                                          This is pure speculation and is the kind of think I wouldn’t expect anyone to design but if you’re favouring reuse (which you often do to reduce DV costs, which are easily 90% of CPU design costs) then I can kind-of see how you’d end up there. Especially if the split rename file design is a speedup running mostly 32-bit code (and last time I ran Windows, it still shipped with a bunch of 32-bit system services).

                                          1. 6

                                            The idea of knowing that AH is zero was what I thought of until I saw that even this sequence (from up thread) was fast:

                                                    MOV RCX, -1
                                                    MOV RCX, RCX
                                            

                                            Which… clearly has non-zero AH bits…

                                            And it’s not just the 32/64-bit register file read split because non-immediate-writes to 64-bits (like the second MOV above) are fast. It seems something to do with the immediate 64-bit immediate operand writes – maybe the sign extension to the high 32-bits causes an extra stall? But if that’s it, I’d expect to see it in many more places than just shifts…

                                            Maybe an Intel person can chime in on the LLVM bug…

                                            1. 3

                                              I wouldn’t be surprised if both all zeroes and all ones are both special cases, so writing -1 to RAX gives you a -1 AH and the canonical all-ones rename register (or even just the canonical all-ones rename register, which you don’t need to fetch because its ID is hard coded). The other cases confuse me though.

                                              1. 1

                                                While MOV RCX, -1 gets handled at the renamer level, MOV RCX, RCX does not (moves with the same src/dst never get eliminated on Intel). The issue here seems definitely related to operand sources that come from the renamer instead of the output of an execution unit. Besides SHL(X)/SHR(X), BZHI also has the same behavior. Furthermore, the issue is not related to the shift amount: replacing SHLX RAX, RAX, RCX with SHLX RAX, RCX, RAX has the same behavior.

                                                In particular you can avoid immediates all together and get the same behavior:

                                                XOR RCX,RCX
                                                INC RCX
                                                

                                                will get the performance hit because in Alder Lake / Golden Cove arithmetic with small operands gets handled at the renamer level. Yes, Golden Cove lets you run INC RCX; INC RCX; INC RCX; INC RCX; INC RCX; INC RCX; in a single cycle! But for some reason this optimization only works for 64-bit registers; if you use INC ECX it goes back to old behavior.

                                      2. 2

                                        True. That argues that, whatever the root cause, it’s a mistake in the design.

                                    2. 6

                                      The estimation based algorithm he uses supposedly finds shorter solutions than an exact TSP solver, which is a very strong indicator he’s got a bug somewhere.

                                      1. 2

                                        The comparison with Concorde is on an instance under the EUC_2D distance metric, which (perhaps counter-intuitively if you’re not writing a MIP solver) rounds the euclidean distance to the nearest integer. There’s no way a solution can have fractional weight with that distance function.

                                        Perhaps not a bug so much as a misreading (missed reading?) of the spec.

                                      2. 2

                                        I introduced a Dynamic Lookahead Insertion strategy, which works as follows:

                                        • Simulate Insertions: For each remaining city, simulate inserting it between every edge in the current cycle.
                                        • Estimate Total Distance: For each simulated insertion, estimate the total distance of the complete TSP path if that city were inserted at that position.
                                        • Choose the Best Insertion: Select the insertion that results in the least total distance and update the cycle accordingly.
                                        • Repeat: Continue this process iteratively for all cities and for all edges until the path is complete.

                                        this is literally just the cheapest insertion: it greedily inserts at every step and never backtracks to find better solutions. i don’t know why they called it dynamic lookahead since it’s clearly fixed to 1

                                        i am somewhat surprised that it does significantly better than the existing solvers

                                        1. 4

                                          i am somewhat surprised that it does significantly better than the existing solvers

                                          I don’t see any evidence that it does any better than classic 2/3-opt (both by looking at the algorithm description and noticing the lack of citation), let alone the usual benchmark primal heuristic solver, http://webhotel4.ruc.dk/~keld/research/LKH/

                                          EDIT: also “random” instances can be misleading. A lot of NP-complete problems are hard only around a phase transition from trivially feasible to obviously infeasible instances. I’d look at instance sets like http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/

                                          1. 1

                                            A lot of NP-complete problems are hard only around a phase transition from trivially feasible to obviously infeasible instances.

                                            Can you expand what you mean by this?

                                          2. 2

                                            Yeah, this is actually very similar to—but substantially simpler than!—the Sparse A* search algorithm, which is the kind of thing a naïve recent graduate with effectively no knowledge of CS can implement quickly and easily. (That’s not a hand-wave statement; it’s a description of me a bit over 15 years ago!) It’s surprising it does as well as it does on these cases, but like you and @pkhuong’s sibling comment it would be extremely surprising to me if this were what it claims. Indeed: I would be rather surprised, given the history of the problem, if it hadn’t been tried before, likely by precocious undergrads! (I have been surprised before, but even so: suspicion is appropriate.)

                                          3. 14

                                            SQLite comes with its own issues: concurrency and multi-tenancy. Since I/O is synchronous, thus blocking, this makes applications running on the machine compete for resources. This also increases latency.

                                            This makes little sense to me. It’s as though they’re ignoring threads. Just run each task on a separate thread and they no longer “compete”.

                                            The benefits become noticeable only at p999 onwards; for p90 and p99, the performance is almost the same as SQLite.

                                            If I read this correctly, 99.9% of the time there’s no speedup, but for some reason 0.1% of queries are much slower in regular SQLite. That’s interesting, and I’d love to know why — something to do with thread scheduling? — but to me, avoiding a one-in-a-thousand slowdown isn’t a big win that makes it worth completely rewriting the DB engine.

                                            1. 14

                                              I had the same question and looked at the paper.

                                              Their test setup for SQLite:

                                              To simulate a multi-tenant serverless runtime, the microbenchmark creates a thread per tenant to represent a resident and executable function in the runtime. In each thread, we open a connection to a separate 1 MiB SQLite database file, and we execute the SQL query.

                                              So they were using threads, and the threads weren’t even contending over the same files.

                                              [With the above setup] we observe jumps in the latency percentiles when the number of threads is close to a multiple of the number of cores, which highlights that synchronous I/O limits multi-tenancy.

                                              It is disappointing that they don’t have a good theory for why this happens. (Also it says “We plan to investigate the reason for the fall in latency [with SQLite]” that happened from 70->80 workers on a 12 core machine, which is also pretty strange.)

                                              I looked at their benchmark setup and noticed that they include the database “open” time in the measurement. It would be pretty disappointing if the difference was explained by something dumb like SQLite using a different system API for file-level locking.

                                              I think to truly measure what they’re trying to measure they ought to use the same code (maybe their new Rust one) with both sync and async system calls.

                                              1. 17

                                                SQLite using a different system API for file-level locking.

                                                It’s worse than just using different syscalls. Opening a SQLite DB with the default Unix VFS takes time linear in the number of DB files open because the VFS doesn’t use OFD file locks and thus mustn’t close the same file (inode) while a lock is held. The files are tracked in a synchronised doubly linked list https://github.com/sqlite/sqlite/blob/master/src/os_unix.c#L1183-L1207, and each open performs a linear search to find collisions (https://github.com/sqlite/sqlite/blob/master/src/os_unix.c#L6091-L6094).

                                                None of that is needed on recent Linux or OS X (or BSD, but untested theory), since they support OFD locks (https://lwn.net/Articles/586904/): https://github.com/pkhuong/verneuil/blob/main/c/linuxvfs.c#L1590-L1634, for example.

                                                But, looking at https://github.com/tursodatabase/limbo/blob/9aaf0869aed54e6d90ffc377ce2c8a1660b6f7a4/perf/latency/rusqlite/src/main.rs#L23-L34, I think they only measure queries? It’s a bit weird to report p999 with 1000 iterations per tenant though?

                                                1. 2

                                                  Thinking about this more: even with the benchmark only measuring queries, it will still need to grab a lock per each iteration’s query execution, right? So my worry above about a difference in locking mechanisms is still plausible?

                                                  1. 1

                                                    I think they only measure queries

                                                    You are right, I misread the code! I would update my comment if I could, sorry!

                                                2. 4

                                                  If I read this correctly, 99.9% of the time there’s no speedup, but for some reason 0.1% of queries are much slower in regular SQLite.

                                                  Issues to do with scheduling concurrency, table/row locks, write ahead log, reader/writer priority starvation, and other matters of this nature. I.E. issues inherent to the design and architecture of the dbms at large.

                                                  For example, look at my (old but still relevant) benchmark of SQLite to get a baseline performance reading of how it holds up executing the exact same query but with m readers and n writers, playing with knobs that SQLite itself already exposes for adjusting how these simultaneous events affect one another, then extrapolate to new algorithms or mechanics that would become possible with a different backend altogether: https://github.com/mqudsi/sqlite-readers-writers

                                                  I report p95, p99, and p99.9 – perhaps it would be of interest to also include p90, though I personally view a metric that discounts a full 10% of your workload to be virtually useless.

                                                  1. 2

                                                    avoiding a one-in-a-thousand slowdown isn’t a big win

                                                    If we want to be nitpicky, this is really relative. The environment, the scale all these things are measured at and prepared for indicate something way beyond regular sqlite scenarios. If you have a significant scale, you’re making a thousand people pay that latency cost for every million requests. To most use cases, it’s probably not even that bad of a cost, they wait s second more for their sort to run on a table of football scores.

                                                    But to Amazon, this would probably be a big big deal, not too sell a thousand thing.

                                                  2. 1

                                                    Does anyone know why GCM’s message integrity check uses a custom digest function based on finite fields, instead of something standard like CRC or SHA-1?

                                                    1. 4

                                                      If you’re asking about GHASH: GHASH is a standard universal hashing construction. Universal hash functions offer proven collision probability even against computationally omnipotent adversaries, as long as we have access to random bits not controlled by or visible to that adversary. CRC is trivial to break, and SHA’s threat model is completely different.

                                                      1. 3

                                                        I believe it’s for performance: each block just needs a multiplication as well as AES encryption, which is faster than AES+SHA256. (CRC is not cryptographically secure so it isn’t suitable as a MAC; SHA-1 is broken.)

                                                      2. 8

                                                        No one expects it to be hard to break MurmurHash, FarmHash, or CityHash – least of all their creators. They are very explicitly labelled in their documentation as NON-cryptographic hashes, not to be used when there is someone with bad intentions supplying the data. They are designed to be fast and give good hashing on real-world data, and they are.

                                                        If you need a secure hash then better use SHA … which is even pretty fast these days with dedicated hardware instructions.

                                                        1. 17

                                                          I think you missed the issue here, despite it being spelled out multiple times: these are default hash functions for various hashmaps, and some are vulnerable even when seeded, this means all the corresponding hash maps are vulnerable to hashdos-type attacks.

                                                          1. 6

                                                            I think what the OP alluded to is that non-cryptographic hash functions were not marketed as hashdos-resistant and that typically any kind of resistance is reserved for cryptographic hash functions. Therefore it is not fair to begin using non-cryptographic hash functions for performance and then to criticize them for not having cryptographic hash function resistance properties.

                                                            That said, I do think that we need more nuanced classification of hash functions rather than just cryptographic or non-cryptographic so that wyhash and similar hash functions are recognized to be in the middle. SMHasher(3) has seemingly defined such class of quasi-cryptographic hash functions.

                                                            1. 12

                                                              But all of that is addressed in the article. The author doesn’t think that they are marketed as anything else, but knows that they are used as the default for a lot of stuff, and thus wants to demonstrate that this might not be a good idea.

                                                              hashdos-resistant and that typically any kind of resistance is reserved for cryptographic hash functions
                                                              That said, I do think that we need more nuanced classification of hash functions rather than just cryptographic or non-cryptographic so that wyhash and similar hash functions are recognized to be in the middle.

                                                              Just a note: SipHash isn’t a cryptographic hash function, so hashdos resistance shouldn’t be thought of as a property of cryptographic hash functions, and a more nuanced classification scheme should account for this.

                                                              1. 9

                                                                Wikipedia to this day starts the MurmurHash article with: “MurmurHash is a non-cryptographic hash function suitable for general hash-based lookup”

                                                                I think I tried changing this at some point, including scientific references, but it got reverted, and I wasn’t motivated for an edit war. It’s that these hashes are explicitly advertised for a use case where they cause a security risk.

                                                            2. 9

                                                              „simply don’t” is as good a security advice as it is on sex education.

                                                              The boring, improved advice might be “it depends on your threat model”.

                                                              What I mean to say here is that these hash functions come with a somewhat weak assumption that a collision is only going to be a perf problem for the insert/search operations of the hashmap, whereas the real world usage of the data structure might make this anything between “totally a-OK” and “catastrophic”.

                                                              To better understand how and why they can be easily broken is probably very useful for those who want to rely on it. The blog post needed a lot of introduction and context to explain the algorithms, the multiplicative inverse etc but I thoroughly enjoyed reading it.

                                                              I would even go so far and say this content would make a greet bachelors thesis.

                                                              1. 5

                                                                The OP points to universal hashing as a level of hashing reliability that’s far easier to achieve than (and is actually orthogonal to) cryptographic strength. I wrote one such family (umash) for a real database use case. OP has his own. There’s a few other throughput-oriented implementation (CLHash, halftime hash).

                                                                This speed VS guarantee on seed-independent collision rates is a false choice.

                                                                1. 1

                                                                  On a slightly-related subject: in your UMASH post you say it’s unsuitable for

                                                                  hash tables where attackers control the keys and can extrapolate the hash values. A timing side-channel may let attackers determine when keys collide; once a set of colliding keys is known, the linear structure of UMASH makes it trivial to create more collisions by combining keys from that set.

                                                                  Consider the case of a cache used for a web application. The entire point of a cache in this case is to create a side-channel, so that condition is met. Depending on the URL, the keys may be partially attacker-controlled (e.g. /post/12456). I assume this is safe since there are not enough degrees of freedom? But what about search strings. Say we want to cache something like /search?q=foobar. A common strategy when caching arbitrary-length attacker-controlled strings is to use a cryptographic hash. To preserve performance for non-attacker-controlled keys we may want to use a hash like UMASH and just replace the attacker-controlled portion with a cryptographic hash (e.g. a key like /search?q=[sha256 hash of the original q]). Is that safe? Would it be better to use an HMAC?

                                                                  On a further note, when should the hash be re-keyed? Re-keying more often (such as on hash table resize) seems like it would be most robust, since an attacker exploiting the hash for a DoS would likely also trigger hash table resizes as part of the process. However, that would also affect performance for hash table resizes in a major way. On the other hand, e.g. Python only re-keys when the program is started.

                                                                  1. 3

                                                                    On regenerating parameters, when I deployed an OLAP in-memory cache with UMASH, I had code to:

                                                                    1. detect when there were too many hash table collisions (or when hash-based algorithms performed exceedingly poorly) and use a different approach a bounded number of times (e.g., keep some items in an overflow area). I picked the thresholds so they’d trigger with probability < 10^-15 for a fully random hash function, similar to estimated HW failure probability in any millisecond.
                                                                    2. log and restart the server if it happens too many times.

                                                                    On taking a non-cryptographical hash of a crypto hash: if you’re going to sha256 part of the lookup key, maybe you can sha256 the whole thing? If the rest of the key is long, the usual pattern is to use something like umash for the benign data, and sha256 that.

                                                                    But what’s the threat model? In the UMASH post, I wanted to highlight the exact scope of the universal hashing framework: its guarantees only hold if we assume the keys are generated without knowledge of (independently from) the universal hash function’s parameters. That assumption is violated as soon as the keys can be generated with timing information from operations on older keys (no need for hash table collisions, the UMASH implementation in C isn’t even constant time). At least, unlike murmurhash, we know there aren’t keys that collide for all parameters (seeds).

                                                                    Let’s say an attacker can now generate colliding keys on demand. What’s the next step? I think causing the service to regenerate hashing parameters (by restarting the process! crash-only is probably what we want here) is probably fine as long as:

                                                                    1. either we have multiple replica or the blast radius is confined to the attacker
                                                                    2. it takes much longer to extract enough information to generate colliding keys than it does for the service to come back to its steady state after a restart

                                                                    I think universal hashing helps with 2, because a proof of universality is a proof than an attacker needs to interact with the hash function’s parameters to reliably generate collisions. Now, maybe the attacker doesn’t need to interact a lot (this is where k-independence might come into play, for the worst case where one literally divulges the hash value); that depends on a lot of factors but at least raises the bar compared to seed-independent collisions.

                                                                    1. 2

                                                                      But what’s the threat model?

                                                                      Threat model is that the attacker is running a DoS attack on your service. In order for the cache to reduce server load, it must perform no worse than having no cache altogether. In practice, the attacker will be requesting more URLs than the working set size of the cache (“busting” it), so we’re going to be a bit slower than having no cache at all, since we have to perform a cache miss on each request (as well as updating the cache once we’ve generated the response). But we want to do no worse than that (or ideally still keep the “real” working set in cache). One way the attacker could slow things down is by forcing repeated hash table resizes/rehashes/restarts, turning an O(1) structure into an O(n) one. So we want to make it difficult for the attacker to choose URLs that all hash to the same bucket.

                                                                      I think universal hashing helps with 2, because a proof of universality is a proof than an attacker needs to interact with the hash function’s parameters to reliably generate collisions.

                                                                      So I guess my question is: is a cryptographic MAC much better than a universal hash for this use case? There’s still a timing channel, and the attacker just needs to search for bucket collisions instead of hash collisions (and there are many fewer buckets than hashes). The constant-time behavior of a MAC would probably make such an attack less efficient, but I don’t know if it’s worth the performance loss.

                                                              2. 4

                                                                I’ve had to write protobuf decoding code that nearly matched the performance of reading the C structs directly, for messages that actually mapped well to flat structs, and the main challenge there was having the field metadata interleaved with the data. Unless there are requirements around streaming parts of the top level object or playing tricks with appending messages without parsing them, I’d consider moving to the header all the metadata that describes the object’s shape (the parts that are constant for a given “type,” in particular).

                                                                This will let decoders match the header against known patterns and dispatch to hardcoded fast paths while encoders for simple message could concatenate pre-computed headers with the data.

                                                                1. 3

                                                                  Really interesting. I am a customer of rsync.net, and I hope that the protocol ends up being such that installing the sqlite3-rsync binary on their servers is something they’d consider. For reference: https://www.rsync.net/resources/howto/remote_commands.html

                                                                  Also I’m a personal/casual sqlite user, so reaching for Litestream always felt a bit like overkill, esecially since I’ve never had a reason to put any sqlite database in WAL mode, ever.

                                                                  1. 2

                                                                    Verneuil works with regular rollback (non-WAL) sqlite DBs. Does https://gist.github.com/pkhuong/555be66f5e573a2d56186d658fd80881 sound interesting? There’s no need for a remote service, just an endpoint that speaks S3.

                                                                    1. 1

                                                                      Fascinating approach. I didn’t know about VFS extensions. I will bookmark this, because I wasn’t aware of verneuil. I don’t want to use S3-style storage for my personal use cases, though.