What’s strikes me more that isn’t discussed in this post is that the ppv_lite86 crate is mostly a hacked-together version of 10% of std::simd‘s functionality, specialized for Intel (as the name suggests), and with little activity. Ultimately it will go away when the portable_simd feature is stabilized. I cannot tell when that will happen given how big of a beast it is, but in my experience it works well already so it’s just a matter of time. And I think it’s good that the stdlib is taking its time to make sure such things get done right rather than included too quickly.
What the post also missed is how complex some things have become within rand itself, which is the price to pay for performance. Take [T]::shuffle(): it’s implementation first defers to partial_shuffle and IncreasingUniform with already non-trivial functionality, which then calls random_range somewhere else. This is defined in the SampleRange trait which defers to the SampleUniform trait, itself defined for integers via a macro. Lastly, we arrive at the implementation of sampling an integer in a range begin..end, which seems obscure at first. There is an explanation comment, but it doesn’t seem to match the implementation and is indeed copy pasted from the previous version. The CHANGELOG leads to PR #1287 and PR #1272 that mention “Lemire’s trick”. Looking up “lemire uniform random range” on a search engine finally leads to the algorithm description (which is pretty neat).
All to say that this is inherently complex code, that definitely has its place in the rand crate regardless of dependencies: sampling a uniform integer in a range is fairly common, and asking users to convert uniform bytes into uniform integers in a range themselves is a recipe for disaster as the naive x % range sampling is biased (and division/modulo is slow).
Would have I done it more simply? Probably not: all these intermediate traits and macros serve their purpose. And if I had to choose between a naive algorithm and a slightly more complex optimized algorithm that performs 10x faster, for a foundational crate I’ll take the faster algorithm. Surely some things could be documented better, but we’re mostly volunteers here on open source, and rewriting the wheel doesn’t necessarily mean better documentation.
I am not sure they have implemented Lemire’s algorithm correctly: it looks like the rand code is calculating the rejection threshold first, which involves a slow % operation. Lemire’s algorithm avoids the % in most cases by using the limit argument as an over-estimate for the rejection threshold; if the first sample passes the rough check it can be returned directly. Otherwise the exact threshold is calculated with %, then the first sample is re-checked in the first iteration of the rejection sampling loop. For performance I like to inline the fast path (up to the rough check) and move the slow path (% onwards) to a separate function.
This was an amazing read, great job on both detail and design! One thing I do wonder is if for the portion of sorting and cloning data, could moving the memory into a sorted order to match the sorted root data structures instead of just naive cloning reduce memory pressure/usage and therefore execution time? I don’t know if this is possible in safe Rust though, so it may not be an option.
As mentioned in the article, a principled method over naive cloning would be to use an arena to precisely control in which order allocated leafs are organized in memory, which is possible in Rust although usually clunky due to the self-referential nature of arenas (and it’s definitely more code and refactoring than adding a simple .clone() call).
I’m not sure I understood your question though, perhaps you meant whether it would be possible to swap the leaf objects in place, rather than cloning them all and then deleting them all? If it could be done, the main benefit would be to limit the peak memory usage. However, it would be difficult to do in practice, because the main reason for having allocated objects is that they have a variable size, so swapping them would be like trying to fit a square peg in a round hole. In theory, it should be possible to write an in-place defragmentation algorithm, but in practice that’d be lots of unsafe Rust and would likely end up slower than a clone (for those who can afford the peak memory of a clone).
The AWS announcement is more interesting than the Foundation’s announcement: it has a small survey of formal verification tools for Rust, and more details of the logistics. They seem to be aiming the challenges at academics; there aren’t any rewards beyond getting to publish a paper about it, as far as I can see.
In the last 3 years, 57 soundness issues have been filed in the Rust standard library and 20 CVEs have been reported.
I’d be curious to see a more specific reference for this. Since Rust 1.56 released just 3 years ago, there have been only 13 point releases to Rust stable, with CVEs mentioned only a handful of times (and not all related to the stdlib).
Does this mean that:
Most of these issues get detected before hitting Rust stable? That would imply that the nightly/beta process is working well. And yes, nightly is offered as the “raw snapshot of today” without guarantee, so using a nightly compiler for production isn’t a good security posture anyway.
Anyone not using the latest stable Rust is prone to known security issues, unfixed in their version of the compiler? As far as I can tell, when a CVE is reported, a fix is only proposed on top of the latest stable Rust, nothing is backported to previous Rust versions.
I see a lot of crates go out of their way to support the oldest MSRV possible, but is that relevant if only the latest stable Rust is supported for security fixes? (Yes some applications like medical devices require the code to still compile in 20 years, but that’s on them to vendor dependencies that compile with the toolchain they pin, and to backport security fixes.)
The AWS announcement is incorrect, or misleadingly inflating some numbers?
I have a couple of questions that I couldn’t see the answers to on a quick skim through the article: what was the copying trick that made it faster? The graphs seem to say that the rayon version of the code got 10x faster between the first chart and the second chart, but I didn’t see anything about that? The second chart seems to say the custom rayon replacement is 1.1x faster?
As the introduction briefly mentions, this (and many other optimizations that made the runtime improve by 10x) will be covered in a follow-up post, that isn’t published yet :)
The graphs seem to say that the rayon version of the code got 10x faster between the first chart and the second chart, but I didn’t see anything about that? The second chart seems to say the custom rayon replacement is 1.1x faster?
What’s perhaps not obvious is that the left axis scale changed in the second chart. If you look closely, the total runtime went down from 300ms to 30ms with 1 thread for example. So there was a 10x speed-up, but that wasn’t only by switching away from Rayon, the Rayon backend got a 10x speed-up too!
So in short: the switch from Rayon wasn’t the main factor for a 10x speed-up (far from it), and the other optimizations will be covered in the second post. I decided to split the article into 2 posts as it was getting quite long, but I realize that it’s perhaps not the clearest now as some pieces are missing. Stay tuned for the second part next week!
At first the pattern (locking for read, checking a condition and then if needed locking again for write) seemed like a variant of error-prone patterns like TOCTOU bugs. Rust usually avoids this with patterns like HashMap::entry(), which allow to only lookup a hashmap entry once and if needed modify it (e.g. let value = map.entry(key).or_default()).
For read-write locks, the “nice” way would be to lock only once with write() for the whole scope of the function, but that would be detrimental for performance if there are many readers / writers are rare. The best way would be to lock with read() in the beginning (avoiding contention in the happy path), and upgrade the read lock to a write lock if needed. Unfortunately, the standard library doesn’t provide such upgradable locks, but fortunately the parking_lot crate does!
use parking_lot::RwLock;
fn main() {
let map: RwLock<Option<u32>> = RwLock::new(Some(2));
// RwLockUpgradableReadGuard
let num = map.upgradable_read().unwrap();
if num.is_some() {
eprintln!("There's a number in there: {}", num.unwrap());
} else {
// RwLockWriteGuard
let num = num.upgrade();
*num = Some(5);
eprintln!("There will now be a number {num:?}");
}
eprintln!("Finished!");
}
parking_lot‘s upgradable read lock can sometimes help, but in practice whenever I’ve needed that kind of functionality the result would’ve been the same as taking a write lock from the start as there can only be one upgradable read lock held at a time.
An upgradeable read excludes writers but not readers. Possibly useful if a writer needs to do some slow read-only preparatory work before excluding the readers for a brief mutation.
The microsecond scaling (multiply by a frequency and divide by 1 million). This 32-bit CPU lacks a dedicated division instruction, so the long division on 64-bit integers indeed generates quite a lot of code.
However, the division in hertz * us / 1_000_000 can traditionally be optimized by the compiler because the divisor is a compile-time constant. In this case, the division can be replaced by a (widening) multiplication followed by a right shift (https://en.wikipedia.org/wiki/Division_algorithm#Division_by_a_constant), which is much less code and more efficient than the traditional long division algorithm.
The proposed subtraction loop is sub-optimal, and even though this CPU lacks 128-bit multiplication (perhaps the reason why LLVM didn’t try to optimize this division), the following “magic” expression generates less code (whether optimizing for speed or for size): https://godbolt.org/z/qY9v7qh3a
fn div_by_million(x: u64) -> u32 {
((x as u128).wrapping_mul(4835703278458516699) >> 82) as u32
}
The early return in print_full_process(), where if true leads to dead code elimination. However, the existing CONFIG.debug_panics is a const expression, so setting that to false should likewise cause the whole function to be treated as dead code. And being able to eliminate dead code was part of the discussion when this config was added (https://github.com/tock/tock/pull/1443).
Here I wonder if the authors didn’t try to change this config (arguably, it’s not trivial to find how to change Tock’s config among all the Makefiles), or if the compiler failed to optimize it away (more concerning).
What’s strikes me more that isn’t discussed in this post is that the
ppv_lite86crate is mostly a hacked-together version of 10% ofstd::simd‘s functionality, specialized for Intel (as the name suggests), and with little activity. Ultimately it will go away when theportable_simdfeature is stabilized. I cannot tell when that will happen given how big of a beast it is, but in my experience it works well already so it’s just a matter of time. And I think it’s good that the stdlib is taking its time to make sure such things get done right rather than included too quickly.What the post also missed is how complex some things have become within
randitself, which is the price to pay for performance. Take[T]::shuffle(): it’s implementation first defers topartial_shuffleandIncreasingUniformwith already non-trivial functionality, which then callsrandom_rangesomewhere else. This is defined in theSampleRangetrait which defers to theSampleUniformtrait, itself defined for integers via a macro. Lastly, we arrive at the implementation of sampling an integer in a rangebegin..end, which seems obscure at first. There is an explanation comment, but it doesn’t seem to match the implementation and is indeed copy pasted from the previous version. The CHANGELOG leads to PR #1287 and PR #1272 that mention “Lemire’s trick”. Looking up “lemire uniform random range” on a search engine finally leads to the algorithm description (which is pretty neat).All to say that this is inherently complex code, that definitely has its place in the
randcrate regardless of dependencies: sampling a uniform integer in a range is fairly common, and asking users to convert uniform bytes into uniform integers in a range themselves is a recipe for disaster as the naivex % rangesampling is biased (and division/modulo is slow).Would have I done it more simply? Probably not: all these intermediate traits and macros serve their purpose. And if I had to choose between a naive algorithm and a slightly more complex optimized algorithm that performs 10x faster, for a foundational crate I’ll take the faster algorithm. Surely some things could be documented better, but we’re mostly volunteers here on open source, and rewriting the wheel doesn’t necessarily mean better documentation.
A Fisher-Yates shuffle should be 4 lines of code.
I am not sure they have implemented Lemire’s algorithm correctly: it looks like the rand code is calculating the rejection threshold first, which involves a slow % operation. Lemire’s algorithm avoids the % in most cases by using the limit argument as an over-estimate for the rejection threshold; if the first sample passes the rough check it can be returned directly. Otherwise the exact threshold is calculated with %, then the first sample is re-checked in the first iteration of the rejection sampling loop. For performance I like to inline the fast path (up to the rough check) and move the slow path (% onwards) to a separate function.
This was an amazing read, great job on both detail and design! One thing I do wonder is if for the portion of sorting and cloning data, could moving the memory into a sorted order to match the sorted root data structures instead of just naive cloning reduce memory pressure/usage and therefore execution time? I don’t know if this is possible in safe Rust though, so it may not be an option.
As mentioned in the article, a principled method over naive cloning would be to use an arena to precisely control in which order allocated leafs are organized in memory, which is possible in Rust although usually clunky due to the self-referential nature of arenas (and it’s definitely more code and refactoring than adding a simple
.clone()call).I’m not sure I understood your question though, perhaps you meant whether it would be possible to swap the leaf objects in place, rather than cloning them all and then deleting them all? If it could be done, the main benefit would be to limit the peak memory usage. However, it would be difficult to do in practice, because the main reason for having allocated objects is that they have a variable size, so swapping them would be like trying to fit a square peg in a round hole. In theory, it should be possible to write an in-place defragmentation algorithm, but in practice that’d be lots of unsafe Rust and would likely end up slower than a clone (for those who can afford the peak memory of a clone).
The AWS announcement is more interesting than the Foundation’s announcement: it has a small survey of formal verification tools for Rust, and more details of the logistics. They seem to be aiming the challenges at academics; there aren’t any rewards beyond getting to publish a paper about it, as far as I can see.
Oh I missed that! I wonder why they buried that sentence.
It’s super easy to miss and as far as I can tell it doesn’t say anywhere whether it’s $10 or $10,000.
From the AWS announcement:
I’d be curious to see a more specific reference for this. Since Rust 1.56 released just 3 years ago, there have been only 13 point releases to Rust stable, with CVEs mentioned only a handful of times (and not all related to the stdlib).
Does this mean that:
Most of these issues get detected before hitting Rust stable? That would imply that the nightly/beta process is working well. And yes, nightly is offered as the “raw snapshot of today” without guarantee, so using a nightly compiler for production isn’t a good security posture anyway.
Anyone not using the latest stable Rust is prone to known security issues, unfixed in their version of the compiler? As far as I can tell, when a CVE is reported, a fix is only proposed on top of the latest stable Rust, nothing is backported to previous Rust versions.
Looking at the Debian changelog for rustc, it’s not like Debian stable is backporting any fixes either.
I see a lot of crates go out of their way to support the oldest MSRV possible, but is that relevant if only the latest stable Rust is supported for security fixes? (Yes some applications like medical devices require the code to still compile in 20 years, but that’s on them to vendor dependencies that compile with the toolchain they pin, and to backport security fixes.)
I have a couple of questions that I couldn’t see the answers to on a quick skim through the article: what was the copying trick that made it faster? The graphs seem to say that the rayon version of the code got 10x faster between the first chart and the second chart, but I didn’t see anything about that? The second chart seems to say the custom rayon replacement is 1.1x faster?
As the introduction briefly mentions, this (and many other optimizations that made the runtime improve by 10x) will be covered in a follow-up post, that isn’t published yet :)
What’s perhaps not obvious is that the left axis scale changed in the second chart. If you look closely, the total runtime went down from 300ms to 30ms with 1 thread for example. So there was a 10x speed-up, but that wasn’t only by switching away from Rayon, the Rayon backend got a 10x speed-up too!
So in short: the switch from Rayon wasn’t the main factor for a 10x speed-up (far from it), and the other optimizations will be covered in the second post. I decided to split the article into 2 posts as it was getting quite long, but I realize that it’s perhaps not the clearest now as some pieces are missing. Stay tuned for the second part next week!
At first the pattern (locking for read, checking a condition and then if needed locking again for write) seemed like a variant of error-prone patterns like TOCTOU bugs. Rust usually avoids this with patterns like
HashMap::entry(), which allow to only lookup a hashmap entry once and if needed modify it (e.g.let value = map.entry(key).or_default()).For read-write locks, the “nice” way would be to lock only once with
write()for the whole scope of the function, but that would be detrimental for performance if there are many readers / writers are rare. The best way would be to lock withread()in the beginning (avoiding contention in the happy path), and upgrade the read lock to a write lock if needed. Unfortunately, the standard library doesn’t provide such upgradable locks, but fortunately theparking_lotcrate does!parking_lot‘s upgradable read lock can sometimes help, but in practice whenever I’ve needed that kind of functionality the result would’ve been the same as taking a write lock from the start as there can only be one upgradable read lock held at a time.Ah, I didn’t know the details, just found the API in
parking_lot. What’s the point of an upgradable read lock then, if it’s anyway exclusive?An upgradeable read excludes writers but not readers. Possibly useful if a writer needs to do some slow read-only preparatory work before excluding the readers for a brief mutation.
Two things caught my eye.
The microsecond scaling (multiply by a frequency and divide by 1 million). This 32-bit CPU lacks a dedicated division instruction, so the long division on 64-bit integers indeed generates quite a lot of code.
However, the division in
hertz * us / 1_000_000can traditionally be optimized by the compiler because the divisor is a compile-time constant. In this case, the division can be replaced by a (widening) multiplication followed by a right shift (https://en.wikipedia.org/wiki/Division_algorithm#Division_by_a_constant), which is much less code and more efficient than the traditional long division algorithm.The proposed subtraction loop is sub-optimal, and even though this CPU lacks 128-bit multiplication (perhaps the reason why LLVM didn’t try to optimize this division), the following “magic” expression generates less code (whether optimizing for speed or for size): https://godbolt.org/z/qY9v7qh3a
The early return in
print_full_process(), whereif trueleads to dead code elimination. However, the existingCONFIG.debug_panicsis aconstexpression, so setting that tofalseshould likewise cause the whole function to be treated as dead code. And being able to eliminate dead code was part of the discussion when this config was added (https://github.com/tock/tock/pull/1443).Here I wonder if the authors didn’t try to change this config (arguably, it’s not trivial to find how to change Tock’s config among all the Makefiles), or if the compiler failed to optimize it away (more concerning).