This year, some members of the Rust Programming Language Community Server on Discord set out to solve AoC in under 1ms. I’m pleased to announce that through the use of LUTs, SIMD, more-than-questionable unsafe, assertions, LLVM intrinsics, and even some inline ASM that goal has been reached (for real this time)!
*As of today, our solutions are able to solve all 49 problems in <1ms!*
I have obtained consent from all the top participants to post their solutions to a shared repo, for the community to review and learn from. All solutions are now available at the linked GitHub repo!
Our solutions have a total runtime of *988936ns*!
Context/Caveats
All submissions were run on the same hardware (Ryzen 5950X) to ensure consistency, with the same compiler flags and features available. This was on rustc nightly (updated throughout the course of the contest), and with CPU speed capped at 3400 MHz with boost clock disabled.
AVX-512 was not available on the machine so none (?) of the solutions utilize that particular set of accelerated instructions, but there is plenty of other SIMD in use.
All submissions were run against the same inputs to ensure consistency.
Caching anything that has been fed with input was not allowed to prevent cheating and/or trivial solutions like Map<Input, Output>.
For the same reason, inputs were not directly available to the participants, and were not provided at compile-time.
Participants were allowed to use compile-time tricks in their answers. Due to limitations in the benchmark bot, the runtime of these optimizations could not be measured. This was considered acceptable as the compiled binaries were expected to otherwise work correctly for arbitrary inputs. This means that participants are allowed to use look-up tables (LUTs) in their answers, but those LUTs are expected to work for arbitrary inputs, not just specific ones.
I/O is trivial, and was thus not measured as part of the benchmark. That is, participants were provided with an &str or &[u8] input (their choice) and expected to provide an impl Display as part of their result. Therefore, input parsing was measured (but not I/O).
If you are interested, join us in #advent-of-code-2024 on the Discord server for further discussion :)
Further Reading
If you would like a more in-depth explanation of some of the optimization techniques used, I highly recommend you check out this article by ameo (one of our participants). It covers the process they used to optimize their solution for Day 9 Part 2, and how they got it to the top of our leaderboard. The article provides incredible information on the process of both high-level and micro optimization.
Credits:
Thank you to the members of the Rust Programming Language Community and Serenity-rs Discord servers and everyone else who participated in the challenge!
Thank you to Eric Wastl for hosting AoC every year!
Thank you to Noxim for writing the original version of our benchmark bot.
Extra special thank you to yuyuko, bend-n, and giooschi for their help in maintaining and improving our benchmark bot.
Some of the results say they took 3 nanoseconds, and one says it took 1 nanosecond. I take it this is just timing of the inner loop or something like that? L2 cache couldn’t even do that.
Roughly, you are correct. We are only timing how fast the solution runs in a loop. With some black_box to prevent certain compiler optimizations.
A sub 20ns result is usually because there is a look-up table (LUT) approach that can be used to solve the problem for that day. In such a scenario, the LUT can be generated at compile-time, or even baked in to the source. Our benchmark does not measure compile time, only runtime. If you have a LUT, the runtime performance comes down to how fast you can parse the input and how fast you can look-up a value in the table. For example, in alion02’s 1ns solution to day 17 part 2, the entire solution (excluding the LUT) is:
So roughly 6 assembly instructions (actually 13 according to Godbolt, when accounting for the bits of Rust surrounding the asm! macro).
The bot is locked to a clock speed of 3.4 GHz, which works out to approximately 3.4 clock cycles per nanosecond. Modern CPUs can execute multiple instructions per cycle in certain cases, and as with all benchmarking there will be some measurement error (likely +- a few nanoseconds in this case). On the 5950X, the L1 cache latency is 0.792ns. Also, on Zen3, I believe the hardware may even optimize out certain instructions entirely.
alion02’s LUT is 16384 u64s, which comes out to roughly 131KB. This fits comfortably inside the 5950X’s 512KB L2 cache. Presumably, given that we run the benchmark in 100 batches of N iterations, the relevant parts of the LUT may stay in L1 (under-reporting the real timing by a bit I suppose). Given all that, it should be possible to solve the problem with 13 instructions in 1 nanosecond (+- measurement error).
Feel free to examine the benchmark code for more details, or try running the solution against your own AoC inputs in your benchmark harness of choice :)
All that being said, I may be wrong in my interpretation of the results, and I welcome any feedback or any improvements to our benchmark harness! The number does seem suspiciously low but I can’t think of any reason for the under-reporting.
EDIT: Compiled solution + benchmark harness under Godbolt found here. Relevant lines start at 2400.
Definitely something going on with measurement, but in an absolute sense that’s not much measurement error.
The benchmark bot is running the solution in a loop to get an average time, after a warmup period (I think, from skimming benchmark bot’s code). I would assume all relevant data is in L1 cache. 1ns is still too fast
Edit: Actually I’m not convinced that 1ns is too fast, especially when considering 1ns presumably includes up to 1.5ns
Edit2: The benchmarking code looks like it uses naive integer division, so I think “1ns” actually means “up to 2ns”.
This year, some members of the Rust Programming Language Community Server on Discord set out to solve AoC in under 1ms. I’m pleased to announce that through the use of LUTs, SIMD, more-than-questionable
unsafe, assertions, LLVM intrinsics, and even some inline ASM that goal has been reached (for real this time)!*As of today, our solutions are able to solve all 49 problems in <1ms!*
I have obtained consent from all the top participants to post their solutions to a shared repo, for the community to review and learn from. All solutions are now available at the linked GitHub repo!
Our solutions have a total runtime of *988936ns*!
Context/CaveatsAll submissions were run on the same hardware (Ryzen 5950X) to ensure consistency, with the same compiler flags and features available. This was on rustc nightly (updated throughout the course of the contest), and with CPU speed capped at 3400 MHz with boost clock disabled.
AVX-512 was not available on the machine so none (?) of the solutions utilize that particular set of accelerated instructions, but there is plenty of other SIMD in use.
All submissions were run against the same inputs to ensure consistency.
Caching anything that has been fed with input was not allowed to prevent cheating and/or trivial solutions like
Map<Input, Output>.For the same reason, inputs were not directly available to the participants, and were not provided at compile-time.
Participants were allowed to use compile-time tricks in their answers. Due to limitations in the benchmark bot, the runtime of these optimizations could not be measured. This was considered acceptable as the compiled binaries were expected to otherwise work correctly for arbitrary inputs. This means that participants are allowed to use look-up tables (LUTs) in their answers, but those LUTs are expected to work for arbitrary inputs, not just specific ones.
I/O is trivial, and was thus not measured as part of the benchmark. That is, participants were provided with an
&stror&[u8]input (their choice) and expected to provide animpl Displayas part of their result. Therefore, input parsing was measured (but not I/O).If you are interested, join us in #advent-of-code-2024 on the Discord server for further discussion :)
Further ReadingIf you would like a more in-depth explanation of some of the optimization techniques used, I highly recommend you check out this article by ameo (one of our participants). It covers the process they used to optimize their solution for Day 9 Part 2, and how they got it to the top of our leaderboard. The article provides incredible information on the process of both high-level and micro optimization.
Credits:Thank you to the members of the
Rust Programming Language CommunityandSerenity-rsDiscord servers and everyone else who participated in the challenge!Thank you to Eric Wastl for hosting AoC every year!
Thank you to Noxim for writing the original version of our benchmark bot.
Extra special thank you to yuyuko, bend-n, and giooschi for their help in maintaining and improving our benchmark bot.
Some of the results say they took 3 nanoseconds, and one says it took 1 nanosecond. I take it this is just timing of the inner loop or something like that? L2 cache couldn’t even do that.
Roughly, you are correct. We are only timing how fast the solution runs in a loop. With some
black_boxto prevent certain compiler optimizations.A sub 20ns result is usually because there is a look-up table (LUT) approach that can be used to solve the problem for that day. In such a scenario, the LUT can be generated at compile-time, or even baked in to the source. Our benchmark does not measure compile time, only runtime. If you have a LUT, the runtime performance comes down to how fast you can parse the input and how fast you can look-up a value in the table. For example, in alion02’s 1ns solution to day 17 part 2, the entire solution (excluding the LUT) is:
So roughly 6 assembly instructions (actually 13 according to Godbolt, when accounting for the bits of Rust surrounding the
asm!macro).The bot is locked to a clock speed of 3.4 GHz, which works out to approximately 3.4 clock cycles per nanosecond. Modern CPUs can execute multiple instructions per cycle in certain cases, and as with all benchmarking there will be some measurement error (likely +- a few nanoseconds in this case). On the 5950X, the L1 cache latency is 0.792ns. Also, on Zen3, I believe the hardware may even optimize out certain instructions entirely.
alion02’s LUT is 16384
u64s, which comes out to roughly 131KB. This fits comfortably inside the 5950X’s 512KB L2 cache. Presumably, given that we run the benchmark in 100 batches of N iterations, the relevant parts of the LUT may stay in L1 (under-reporting the real timing by a bit I suppose). Given all that, it should be possible to solve the problem with 13 instructions in 1 nanosecond (+- measurement error).Feel free to examine the benchmark code for more details, or try running the solution against your own AoC inputs in your benchmark harness of choice :)
All that being said, I may be wrong in my interpretation of the results, and I welcome any feedback or any improvements to our benchmark harness! The number does seem suspiciously low but I can’t think of any reason for the under-reporting.
EDIT: Compiled solution + benchmark harness under Godbolt found here. Relevant lines start at 2400.
The one that says 1 nanosecond is 6 instructions long: https://github.com/indiv0/aoc-fastest/blob/main/2024/d17p2.rs#L367
Definitely something going on with measurement, but in an absolute sense that’s not much measurement error.
The benchmark bot is running the solution in a loop to get an average time, after a warmup period (I think, from skimming benchmark bot’s code). I would assume all relevant data is in L1 cache.
1ns is still too fastEdit: Actually I’m not convinced that 1ns is too fast, especially when considering 1ns
presumably includes up to 1.5nsEdit2: The benchmarking code looks like it uses naive integer division, so I think “1ns” actually means “up to 2ns”.