Threads for kt315

  1. 3

    I suggest to take a look at Stressgrid (disclaimer: I am a contributor). You can create elaborate behaviours with Elixir scripts, run them at scale in public clouds, and collect detailed telemetry.

    1. 2

      Very useful benchmark, thanks for the analysis!

      I’m curious about the behaviour under overload – how would the Erlang servers manage to keep latency constant? Do they reject requests with 503 or similar? Is this tracked by the test setup?

      Also, I’d love to see how the servers would do if the test continued after that hour while keeping request rate constant.

      1. 4

        I can really only speak about Erlang here, but I will give a quick rundown of what cowboy, and presumably mochiweb, are doing and how the BEAM will help keep latency low.

        When a request comes in to cowboy, it follows this procedure (with a bit of hand-waving)

        1. Accept the socket using a central socket acceptor (central here meaning there is only one, because only a single process can listen on a given port)
        2. Start an Erlang process to handle the actual work of the request
        3. On the new process, do the work of the given endpoint (in the case of this test, it just sleeps for ~100ms)
        4. The new socket has access to the socket (or port in Erlang terms) and sends the response.

        This is roughly what all web servers do.

        The specific part of Erlang that helps here with low response times, even under load, is that Erlang uses a preemptive scheduler. What this means is that if you have N cores, you can have up to N active processes doing work at any given time. The special part, though, is that after some number of reductions (1 reduction is roughly equal to 1 function call), the scheduler will stop that process and start/resume a different process (assuming you have more than N processes trying to do work). The whole point of this is so that each process will get roughly the same amount of CPU time as one another.

        So in this test when the sleep is called (or if it were an actual database call), the scheduler is actually taking this process, throwing it to the end of the scheduling queue, and starting to do work on another process. So when the processes start to “wake up” from their sleep, they are scheduled and just immediately send off their response.

        There is actually an interesting talk by Saša Jurić on the difference between preemptive scheduling (Erlang/Elixir) and cooperative scheduling (golang) and how it can affect the performance of your application. It is very obviously shown from the perspective of an Erlang/Elixir developer, but it is interesting to watch nonetheless.

        1. 1

          This is actually very good question! While no 503s were observed in the test, there were timeouts. I updated the article with corresponding graph.

          We will add sustained phase to the future tests to see what happens at constant rate of requests.

        1. 3

          I wonder how a webserver in Rust using futures would cope.

          1. 3

            Do you have recommendation for Rust webserver to include in benchmark?

            1. 2

              Take a look here: https://www.techempower.com/benchmarks/#section=data-r18&hw=ph&test=fortune&l=xhnr73-f

              Languages included: Erlang, Elixir, Go, Java, Javascript, Rust.

              Rust’s actix / actix-web is inhumanly fast.

            1. 3

              I kinda hate it when an article presents a benchmark like this without even trying to find out the cause of the difference. It’s like a whodunit story with no conclusion.

              1. 7

                We’re working on follow-up benchmarks for some changes–there will be part 2!

                1. 2

                  Oh snap, thank you!

                  1. 1
              1. 3

                Syn, Syn/Ack, Ack TCP segments will be 64 bytes, which is also the smallest Ethernet frame a driver will send (the nicer ones padding the payload with zeroes if a packet was shorter). This is a good lower-bound to use for stress-testing, as the connection-per-second will be influenced by the capability of a system to process such short packets.

                Then, it is also good to know there is a PPS budget on EC2 instances. What I find curious is that kernel-bypass solutions are able to go way over this limit, and Amazon published an ENA driver in DPDK for higher-loads. With the budget seen here, such kernel-bypass solution seems useless.

                1. 1

                  I updated with 0 bytes payload (54 byte packets) results.

                  Yes, trying DPDK would be interesting. At packet rates reported in our blog kernel handling seems to be perfectly sufficient.

                1. 4

                  It’s not clear why the graphs have an x axis of seconds, since that doesn’t appear to be mentioned as part of the experiment. Also, why does it take up to 300 seconds to warm up to a steady state?

                  1. 4

                    We should have made it more clear. In all rate graphs load gradually increases from 0 to 4M (2M when testing with server) packets/s over the period of 10 minutes.

                    1. 3

                      I’d guess that’s how long an instance takes to launch, including successful cloud-init execution.

                      1. 4

                        Shouldn’t it stay and zero and shoot up then? They’re all perfectly linear.

                    1. 3

                      Wow, node performed amazingly! A single thread mostly keeping up with multi-threaded alternatives! Just run $(ncpus) per host and the throughput would soar.

                      But who runs benchmarks on VMs anyway? Seems like you’d want to run it on real hardware to get numbers you could count on.

                      1. 1

                        Yes. The next test will include multiple node processes!

                        There were several runs each with new target EC2 instance to see if there would be notable difference in the results from getting “noisy neighbors”. All runs were very consistent.

                      1. 1

                        Would you see a difference if you ran these on Dedicated hosts on AWS?

                        1. 1

                          Interesting test would be to run shared vs dedicated with everything else being the same. My guess is the larger is the instance the less would be the advantage of dedicated host.

                        1. 15

                          I’m curious about the CPU Usage shown for Elixir. Maybe since it wasn’t otherwise fully utilized, that was BEAM’s busy-waits spinning for a bit?

                          1. 11

                            Yes, busy waits are the very likely cause of CPU saturation since BEAM was very responsive. I’d need to run the test with microstate accounting enabled VM to confirm this.

                            1. 10

                              You could do an easy check by running the emulator with “+sbwt none” and see if the saturation disappears.

                              (Also, darn, I didn’t see meredith’s question before I posted my other comment.)

                              Edit: maybe you would also want to include “+sbwtdcpu none” and “+sbwtdio none”? I’m not sure.

                          1. 7

                            I’m curious why the cluster module wasn’t used for Node.js? I’m not saying it’s as good as Elixir/Go lang, I just think perhaps the comparison would have been more fair.

                            1. 7

                              I wasn’t aware of cluster module and agree it should be included—something for the next round of tests!

                            1. 2

                              Is node still single threaded?

                              Also, some of the graphs are confusing. The range bars dip below what should be the minimum limit. Sometimes the latency is below the 90ms (100 - 10%) floor. In one graph, go receives a request in negative microseconds.

                              1. 3

                                Is node still single threaded?

                                Yes, Node is still essentially single threaded. (Essentially because you can, as of 11.7, create Worker threads, but it’s nothing like goroutines or BEAM processes)

                                1. 1

                                  Agree. Bars are based on standard error and I added them as both positive and negative, yet they’re likely to reflect mostly positive error from the graphed mean line. What would be the right way to do this?

                                  1. 2

                                    A lot of natural processes actually follow a lognormal distribution, which nicely handles the impossibility of values below zero.