Or use it on literally everything. Blazingly fast coffee. Blazingly fast summer vacations. Blazingly fast third degree burns. Blazingly fast global warming. Blazingly fast sex. 🔥🔥🔥
I keep wondering how a programming language can make an IO bound application blazingly fast by implementing in Rust (assuming you have asyncio and threading support in existing implementation).
Then, you have this gang wanting to write Python FFI to an underlying C library in Rust and promising performance improvements…
Fighting this myth occupies part of my day at work as a principal.
I get a lot of perf gains at my company rewriting components or applications in Rust, but it’s usually replacing Java which doesn’t really have a good async story save for Netty, Node.js, or Python which wasn’t using asyncio. We have some Python applications using asyncio and I must confess to being unimpressed.
e.g. I replaced a library built on an HTTP client used in Node.js (this was async I/O) with a native Rust library that exposed a NodeJS API via FFI. The Rust library emitted data using a Kafka Producer client instead of HTTP. Tail and median response latencies, reliability, overall efficiency all greatly improved. A big differentiator is that I’m not just able to do the work faster (parallelization, concurrency), I’m able to do more with less (efficiency).
Another example: I have an application that processes a Kafka data stream that decompresses to about 5.5 GB/second. The Java applications consuming this same exact topic have, 64, 128, or 256 instances, each an individual server deployment w/ a whole JVM. The 128 instance deployment has 1,100 cores and 2.5 TB of RAM allocated.
I’m processing that in Rust doing very similar work with about 5 seconds of consumer lag with 10 instances, ~65-75 CPU cores utilized during peak throughput and 100 cores allocated. 7.8 GB - 8 GB RAM utilized, 15 gigabytes allocated. The Java applications were written by some of the best backend developers at my (not small) company.
Another example: rewrote a relatively low-level library function that looks up IP addresses. Node.js version took double-digit microseconds, the Rust and Java versions I wrote both benchmarked at ~130-150 nanoseconds. I’m not even sure how to write an efficient bitwise prefix trie representation in NodeJS, but if such a library exists for Node, I couldn’t find it. I’d probably just reuse the Rust library in a Node app. This library function gets invoked trillions of times per day at my company, this isn’t small potatoes.
I have many examples for this because much of my work has been upgrading/replacing unreliable and inefficient systems and libraries the last few years. It’s not a panacea, you have to measure and you have to understand how things work well enough to accurately judge where wins can be obtained. I don’t think it being non-trivial justifies prejudicially ripping out a chunk of your engineering decision tree.
Your job as a principal is not only to be Dr. No but also to equip people to accurately judge, test, and evaluate engineering outcomes themselves and w/ minimum time staked.
There definitely are a lot of valid use cases where you can get visible performance gains. I guess it will mostly be when you move some implementation from byte code (virtual machine) based languages to natively compiled languages.
I am particularly wary of using Rust as a Python FFI to call into an underlying C library when there are decent tooling to use Python CFFI to call into C library without the intermediary Rust bindings.
It’s not a panacea
Yes, you need a good reason to rewrite something.
you have to measure and you have to understand how things work well enough to accurately judge where wins can be obtained.
This is the key, fully agree.
Your job as a principal is not only to be Dr. No but also to equip people to accurately judge, test, and evaluate engineering outcomes themselves and w/ minimum time staked.
Not being a luddite here trying push back anything Rust. When there is proposal, I would like to see data driven approach (prototype & measure).
I guess it will be mostly be when you move some implementation from byte code (virtual machine) based languages to natively compiled languages.
That’s part of it but not even most of it. There are natively compiled languages out there like GHC Haskell, Golang, and OCaml that can’t get super close to what i can do in Rust. I say this as someone well known for teaching and using Haskell professionally. The bitwise prefix trie in Java and Rust being the exact same in a benchmark is one exception but that isn’t a compiled language, not that I think it really matters here.
IME benchmarks like the programming language shootout understate and minimize the differences and typical outcomes I see when I make changes to an existing production application.
I am particularly wary of using Rust as a Python FFI to call into an underlying C library when there are decent tooling to use Python CFFI to call into C library without the intermediary Rust bindings.
If that makes sense for your application that’s great but it usually doesn’t shake out as nicely as you’d hope for. It’s not enough to have librdkafka bindings, you need something to manage the producer in a thread-safe manner, do some data processing and serialization before sending it over the wire, etc. etc. etc. It’s faster and more reliable if that I just do that in Rust.
Speaking as someone who patched librdkafka recently I am pretty sure I could make a faster Kafka Consumer in and of itself in pure Rust and leverage that to make the aforementioned application processing 5.5 GB/second more efficient.
You’d be surprised how often you don’t want the limitations of Python’s runtime, GC, and inefficiency. There are a number of things I can do more efficiently and safely with Rust bindings to librdkafka than you could with Python C FFI talking to librdkafka directly. For one thing, I can read and share data with C safely and without copying. e.g. https://docs.rs/rdkafka/latest/rdkafka/message/struct.BorrowedMessage.html
I leverage BorrowedMessage in the aforementioned application that processes 5-12 gigabytes / second in real-time.
Further, in Rust I have things like tokio, async/await, etc. I can do things with ease that would be tremendous work to make bug-free in C and be virtually unmaintainable for most working programmers.
Rust isn’t adding any overhead in your alternative. It’s logic that would end up living in either Python (slow, untyped) or C (unsafe, error prone) anyhow. The FFI between C and Rust isn’t like Python and C.
All that aside, not much of my Rust work involves wrapping a C library in the sense you mean unless you’re counting glibc or something. The Kafka clients are the main thing for me there. The native Rust libraries are often better when they have everything you need.
I’m not trying to personify the Rust Evangelism Strike Force here, it’s just that it felt like you were judging from experience in industry that didn’t include exposure to the kinds of applications and workloads I deal with.
In the era of SSDs and multi-gigabit connections I/O bound isn’t what it used to be. You can get accidentally CPU-bound on silly things, especially if you use a high level web framework.
FreeBSD’s GEOM stack hit a lot of this a decade or so ago. It was beautifully designed with clean message-passing abstractions between layers. They added some overhead but with 10ms disk response times the latency was in the noise. NVRAM-based storage suddenly turned them into bottlenecks and there was a load of work done to allow a single thread to do synchronous calls through the stack.
It doesn’t seem unreasonable. A web server. TLS. Globs, regex, date/time processing.
Note that Rust libraries are typically split into tiny components, so transitively there’s many of them, e.g. cookie parser is a crate instead of being copy pasted code in a web server monolith. It actually helps code reuse.
Complaints about number of deps to me sound like “I can’t eat a pizza cut into 12 slices, that’s too many slices! Cut it in four instead!”
This is a pitfall of looking at the Cargo.lock. axum is not used in this project. It’s not even compiled. It’s there, because it’s part of a disabled optional feature of shuttle-common that supports multiple back-ends.
I’m looking at the latest commit (f20e2d8d), and I’m not seeing axum it either cargo tree or cargo build output. Maybe those symbols are from some panicking placeholders that just print “axum is disabled?”
The number of transitive dependencies is quite important when you consider the attack vectors of a rogue library.
To take you analogy back, each slice is cooked by a different chef. Each has a chance to be poisoned, and you eat all the pizza.
Here’s the list of chefs (and their subordinates) you need to trust:
That is true, but you have to consider what are the alternatives. Usually it’s either write the whole thing yourself from scratch, which is highly impractical for most projects, or use dependencies from your OS’s package manager.
Do you know how many maintainers an average distro has? I don’t think any distro does vetting that could catch a state actor. And even good maintainers can’t do much beyond just “LGTM” the tons of code they pull in. And you usually don’t get fewer transitive dependencies, it’s just that package managers using precompiled binaries are worse at showing you the full tree.
In the end transitive dependencies are a form of a web-of-trust, so it’s not a big set of random people. If you’re really at risk of state attacks, you need to review every line of code you use. Rust has cargo-vet and cargo-crev for these.
You’re usually better off using the stdlib, and then implementing pieces of what you need. Really the only thing you should ever pull in is a cryptography and hardware abstraction library.
“Web-of-trust” only works where entities are verified in real life, like what GPG does. This is not a web of trust package systems have… it’s a web of “hopefully Im too small to matter to be attacked”.
Why is everyone pulling in serde, when nanoserde already covers 90% of use cases? Answer: laziness.
I’m uncomfortable nowadays with large crate dependency graphs, but I think this is a poor example with an overconfidently uncharitable conclusion.
nanoserde may suffice for an application that works only with the few, concrete serialization formats supported by nanoserde, but serde, as a generic infrastructure for supporting all formats, seems more useful for a library.
I’m sure many crates have dependencies that could be switched out to reduce bloat, but one would need to remove the heavier dependency from the entire graph, which, for a crate with a large dependency graph, may be difficult for such a standard choice as serde. If one doesn’t, then using an uncommon choice like nanoserde could worsen the bloat.
Attribution to laziness is itself a lazy explanation.
serde is not just a crate, but an ecosystem. You can serialize almost any data type to any format, because everyone supports serde. OTOH nanoserde is just your structs to/from its couple of built-in formats, and only if you do it manually, because your web framework’s extractors use serde.
serde is well-documented in both its own docs and on 3rd party QA sites.
serde is maintained by a well-known respected author, and is very very widely used. It’s in over 45000 crates. nanoserde is in 28 total.
and in the end, there isn’t that much difference between them in code size. I’ve tried a simple roundtrip of a struct via JSON, and with strip+lto nanoserde gave me essentially the same executable size as serde.
there isn’t even a practical difference in dependencies. nanoserde’s “no deps” sounds nice, but all of serde’s dependencies are by the same author, they’re just not copypasted into the same folder. Plus they are so commonly used (in 60% of all crates) that they’re de-facto the standard library and you’d have them in any non-trivial project even if you used nanoserde.
So you use less known, less supported, worse documented, less interoperable library, wrangle with your whole dependency tree to avoid pulling in serde anyway, write more glue code, and you save a couple of KB from your executable? So lazy.
The problem with bloated dependencies isn’t executable size, it’s compile times.
there isn’t even a practical difference in dependencies.
There’s a big difference. Serde pulls in syn, pro_macro2 and quote. That alone bumps up my clean compile times by 15-20 seconds.
All the points you mentioned are valid, but I doubt that the majority of library authors actually make an informed choice about their dependencies, evaluated alternatives, and measured their compilation times.
Most people follow the herd, and pull in serde, clap, actix without doing their research on alternatives. This is a textbook definition of laziness.
It’s your assumption that compile times matter more than run-time performance, file size, development time, functionality, or reliability. This is not universal, and people who don’t have the same values as you aren’t necessarily lazy.
Just to avoid some misunderstanding here: I’m not calling everyone who uses serde lazy. Obviously there are good reason to use that crate. It’s awesome. I also don’t believe that compile times matter more than the other things you mentioned.
My point is that many, many library authors pull in large dependencies for no net benefit. Compile times are one of the most noticeable examples, because they directly impede my productivity.
I’m talking about cases where they use maybe 1-2% of functionality of the crate. Cases where picking a lighter dependency increases runtime performance, without being brittle or poorly documented. This is what I’m referring to.
I think the way you’ve presented this situation is a false dichotomy, where picking the “standard choice” gives you all these other benefits.
I’m not sure if there are any disk space limits for a Shuttle deployment and we couldn’t figure it out on Discord so let me know if you know anything about this.
I think we should stop using this on everything.
Or use it on literally everything. Blazingly fast coffee. Blazingly fast summer vacations. Blazingly fast third degree burns. Blazingly fast global warming. Blazingly fast sex. 🔥🔥🔥
Blazingly fast spinlocks and deadlocks .
Blazingly fast bankruptcy
Blazingly fast federal investigation
Blazingly fast PR rejection
“Blazingly fast” “federal ” is an oxymoron.
I was certainly surprised by the speed at which a certain Congressmember was indicted after being elected amid lots of lies. Might be a new record.
While we’re at it let’s do supercharged too.
I keep wondering how a programming language can make an IO bound application blazingly fast by implementing in Rust (assuming you have asyncio and threading support in existing implementation).
Then, you have this gang wanting to write Python FFI to an underlying C library in Rust and promising performance improvements…
Fighting this myth occupies part of my day at work as a principal.
I get a lot of perf gains at my company rewriting components or applications in Rust, but it’s usually replacing Java which doesn’t really have a good async story save for Netty, Node.js, or Python which wasn’t using asyncio. We have some Python applications using asyncio and I must confess to being unimpressed.
e.g. I replaced a library built on an HTTP client used in Node.js (this was async I/O) with a native Rust library that exposed a NodeJS API via FFI. The Rust library emitted data using a Kafka Producer client instead of HTTP. Tail and median response latencies, reliability, overall efficiency all greatly improved. A big differentiator is that I’m not just able to do the work faster (parallelization, concurrency), I’m able to do more with less (efficiency).
Another example: I have an application that processes a Kafka data stream that decompresses to about 5.5 GB/second. The Java applications consuming this same exact topic have, 64, 128, or 256 instances, each an individual server deployment w/ a whole JVM. The 128 instance deployment has 1,100 cores and 2.5 TB of RAM allocated.
I’m processing that in Rust doing very similar work with about 5 seconds of consumer lag with 10 instances, ~65-75 CPU cores utilized during peak throughput and 100 cores allocated. 7.8 GB - 8 GB RAM utilized, 15 gigabytes allocated. The Java applications were written by some of the best backend developers at my (not small) company.
Another example: rewrote a relatively low-level library function that looks up IP addresses. Node.js version took double-digit microseconds, the Rust and Java versions I wrote both benchmarked at ~130-150 nanoseconds. I’m not even sure how to write an efficient bitwise prefix trie representation in NodeJS, but if such a library exists for Node, I couldn’t find it. I’d probably just reuse the Rust library in a Node app. This library function gets invoked trillions of times per day at my company, this isn’t small potatoes.
I have many examples for this because much of my work has been upgrading/replacing unreliable and inefficient systems and libraries the last few years. It’s not a panacea, you have to measure and you have to understand how things work well enough to accurately judge where wins can be obtained. I don’t think it being non-trivial justifies prejudicially ripping out a chunk of your engineering decision tree.
Your job as a principal is not only to be Dr. No but also to equip people to accurately judge, test, and evaluate engineering outcomes themselves and w/ minimum time staked.
There definitely are a lot of valid use cases where you can get visible performance gains. I guess it will mostly be when you move some implementation from byte code (virtual machine) based languages to natively compiled languages.
I am particularly wary of using Rust as a Python FFI to call into an underlying C library when there are decent tooling to use Python CFFI to call into C library without the intermediary Rust bindings.
Yes, you need a good reason to rewrite something.
This is the key, fully agree.
Not being a luddite here trying push back anything Rust. When there is proposal, I would like to see data driven approach (prototype & measure).
That’s part of it but not even most of it. There are natively compiled languages out there like GHC Haskell, Golang, and OCaml that can’t get super close to what i can do in Rust. I say this as someone well known for teaching and using Haskell professionally. The bitwise prefix trie in Java and Rust being the exact same in a benchmark is one exception but that isn’t a compiled language, not that I think it really matters here.
On the side of GC, tail latencies, etc. Discord’s blog post here is good: https://discord.com/blog/why-discord-is-switching-from-go-to-rust
For efficiency/CPU usage, https://andre.arko.net/2018/10/25/parsing-logs-230x-faster-with-rust/
IME benchmarks like the programming language shootout understate and minimize the differences and typical outcomes I see when I make changes to an existing production application.
If that makes sense for your application that’s great but it usually doesn’t shake out as nicely as you’d hope for. It’s not enough to have
librdkafka
bindings, you need something to manage the producer in a thread-safe manner, do some data processing and serialization before sending it over the wire, etc. etc. etc. It’s faster and more reliable if that I just do that in Rust.Speaking as someone who patched librdkafka recently I am pretty sure I could make a faster Kafka Consumer in and of itself in pure Rust and leverage that to make the aforementioned application processing 5.5 GB/second more efficient.
You’d be surprised how often you don’t want the limitations of Python’s runtime, GC, and inefficiency. There are a number of things I can do more efficiently and safely with Rust bindings to
librdkafka
than you could with Python C FFI talking to librdkafka directly. For one thing, I can read and share data with C safely and without copying. e.g. https://docs.rs/rdkafka/latest/rdkafka/message/struct.BorrowedMessage.htmlI leverage
BorrowedMessage
in the aforementioned application that processes 5-12 gigabytes / second in real-time.Further, in Rust I have things like tokio,
async
/await
, etc. I can do things with ease that would be tremendous work to make bug-free in C and be virtually unmaintainable for most working programmers.Rust isn’t adding any overhead in your alternative. It’s logic that would end up living in either Python (slow, untyped) or C (unsafe, error prone) anyhow. The FFI between C and Rust isn’t like Python and C.
All that aside, not much of my Rust work involves wrapping a C library in the sense you mean unless you’re counting glibc or something. The Kafka clients are the main thing for me there. The native Rust libraries are often better when they have everything you need.
I’m not trying to personify the Rust Evangelism Strike Force here, it’s just that it felt like you were judging from experience in industry that didn’t include exposure to the kinds of applications and workloads I deal with.
In the era of SSDs and multi-gigabit connections I/O bound isn’t what it used to be. You can get accidentally CPU-bound on silly things, especially if you use a high level web framework.
True, we saw this happen in NFS over NVMe. Saturating the available network became the new challenge in the NFS server implementation.
FreeBSD’s GEOM stack hit a lot of this a decade or so ago. It was beautifully designed with clean message-passing abstractions between layers. They added some overhead but with 10ms disk response times the latency was in the noise. NVRAM-based storage suddenly turned them into bottlenecks and there was a load of work done to allow a single thread to do synchronous calls through the stack.
Rust - love it. So safe and blazingly fast, in particular compilation.
In particular I like to have so many dependencies: https://github.com/orhun/rustypaste/blob/master/Cargo.lock Feels cozy.
Ouch, that’s a truckload of dependencies.
https://github.com/orhun/rustypaste/blob/master/Cargo.toml#L26
It doesn’t seem unreasonable. A web server. TLS. Globs, regex, date/time processing.
Note that Rust libraries are typically split into tiny components, so transitively there’s many of them, e.g. cookie parser is a crate instead of being copy pasted code in a web server monolith. It actually helps code reuse.
Complaints about number of deps to me sound like “I can’t eat a pizza cut into 12 slices, that’s too many slices! Cut it in four instead!”
Two web servers
https://github.com/orhun/rustypaste/blob/f20e2d8d12ceecf65ac60a0acb4b59277b148fac/Cargo.lock#L476
This is a pitfall of looking at the
Cargo.lock
.axum
is not used in this project. It’s not even compiled. It’s there, because it’s part of a disabled optional feature ofshuttle-common
that supports multiple back-ends.It seems both are used:
I’m looking at the latest commit (f20e2d8d), and I’m not seeing
axum
it eithercargo tree
orcargo build
output. Maybe those symbols are from some panicking placeholders that just print “axum is disabled?”The number of transitive dependencies is quite important when you consider the attack vectors of a rogue library. To take you analogy back, each slice is cooked by a different chef. Each has a chance to be poisoned, and you eat all the pizza.
Here’s the list of chefs (and their subordinates) you need to trust:
You can be absolutely certain that one of such organizations/users is at reach to be exploited by a malicious state actor.
That is true, but you have to consider what are the alternatives. Usually it’s either write the whole thing yourself from scratch, which is highly impractical for most projects, or use dependencies from your OS’s package manager.
Do you know how many maintainers an average distro has? I don’t think any distro does vetting that could catch a state actor. And even good maintainers can’t do much beyond just “LGTM” the tons of code they pull in. And you usually don’t get fewer transitive dependencies, it’s just that package managers using precompiled binaries are worse at showing you the full tree.
In the end transitive dependencies are a form of a web-of-trust, so it’s not a big set of random people. If you’re really at risk of state attacks, you need to review every line of code you use. Rust has cargo-vet and cargo-crev for these.
You’re usually better off using the stdlib, and then implementing pieces of what you need. Really the only thing you should ever pull in is a cryptography and hardware abstraction library.
“Web-of-trust” only works where entities are verified in real life, like what GPG does. This is not a web of trust package systems have… it’s a web of “hopefully Im too small to matter to be attacked”.
That by itself isn’t, but you need to look at the actual crates.
Many crates are way too bloated, with unnecessary dependencies, and they’re being pulled in by other crates without concern for efficiency.
It’s a cultural problem. Why is everyone pulling in
serde
, whennanoserde
already covers 90% of use cases? Answer: laziness.I’m uncomfortable nowadays with large crate dependency graphs, but I think this is a poor example with an overconfidently uncharitable conclusion.
nanoserde
may suffice for an application that works only with the few, concrete serialization formats supported bynanoserde
, butserde
, as a generic infrastructure for supporting all formats, seems more useful for a library.I’m sure many crates have dependencies that could be switched out to reduce bloat, but one would need to remove the heavier dependency from the entire graph, which, for a crate with a large dependency graph, may be difficult for such a standard choice as
serde
. If one doesn’t, then using an uncommon choice likenanoserde
could worsen the bloat.Attribution to laziness is itself a lazy explanation.
So you use less known, less supported, worse documented, less interoperable library, wrangle with your whole dependency tree to avoid pulling in serde anyway, write more glue code, and you save a couple of KB from your executable? So lazy.
The problem with bloated dependencies isn’t executable size, it’s compile times.
There’s a big difference. Serde pulls in syn, pro_macro2 and quote. That alone bumps up my clean compile times by 15-20 seconds.
All the points you mentioned are valid, but I doubt that the majority of library authors actually make an informed choice about their dependencies, evaluated alternatives, and measured their compilation times.
Most people follow the herd, and pull in serde, clap, actix without doing their research on alternatives. This is a textbook definition of laziness.
It’s your assumption that compile times matter more than run-time performance, file size, development time, functionality, or reliability. This is not universal, and people who don’t have the same values as you aren’t necessarily lazy.
Just to avoid some misunderstanding here: I’m not calling everyone who uses serde lazy. Obviously there are good reason to use that crate. It’s awesome. I also don’t believe that compile times matter more than the other things you mentioned.
My point is that many, many library authors pull in large dependencies for no net benefit. Compile times are one of the most noticeable examples, because they directly impede my productivity.
I’m talking about cases where they use maybe 1-2% of functionality of the crate. Cases where picking a lighter dependency increases runtime performance, without being brittle or poorly documented. This is what I’m referring to.
I think the way you’ve presented this situation is a false dichotomy, where picking the “standard choice” gives you all these other benefits.
This is how rust is, small std - best decision, I add only what I need.
That scares me. I know they don’t use Kubernetes, but I hope there’s some equivalent to a limit on ephemeral-storage: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#local-ephemeral-storage
By default, there are no limits, and I’ve seen many-a-cluster have whole nodes lock up due to one container filling up the disk!