There is Erlang tool called Tsung that supports exactly what you need (unless you need some logic there). About maintaining the tests, you can use proxy recorder which makes it fairly easy to create new test suites depending on your needs. And with distributed testing you can really push your SUT to the knees if you need.
I love tsung. Back when I did more devops than c++ it was my goto for load testing. I’ve written about it and how to setup a distributed setup, have more servers load test: https://raymii.org/s/articles/Basic_Website_load_testing_with_Tsung.html
Combine the distribution with AWS or openstack and you can spin up hundreds of servers for an hour long load test, pay pennies and have a nice report.
There is one tool that I’ve enjoyed: Locust
Locust is pretty nice, although it falls into the same bucket of tools that end up supporting simple workloads and failing to capture more complexity. We used it at work until we couldn’t. We were probably using it wrong but when we integrated WebSocket calls into it, it just became super difficult to maintain our code in it and its performance suffered—we had to use nearly as many resources as we did for the system under test! It’ll be a little more feasible for this stuff when it supports async I think.
This is a problem I am currently facing at work.
We run virtual events of up to fifty participants, for 99% of the time the server stack remains at near zero load until the event gets moved into a new phase and there is a tsunami of activity (~100 req/min to ~100 req/sec) for a brief moment in time.
I’d like to be able to automate load testing but it would involve spinning up 50 user agents able to connect a fake video and audio source to our av solution, and then running through at least one cycle of our event schedule entering data, navigating, etc.
So far my solution to this has been to get everyone in the office together a few times a month for a real world load test. If someone can offer a better more automated alternative I would snap it up.
Oh hey, that’s almost identical to the traffic patterns that we have! We have very low traffic until the moment things start, then it’s a huge spike of traffic as everyone starts at the same time.
If you can generate enough load with everyone in the office that seems pretty reasonable (especially since you could do that by dogfooding your product for something like a town hall or internal conference). If you do look into automating more I’d love to have a chat about the problems you run into or how you approach it because it sounds like we’re facing similar problems in different markets.
When you load test like this, it seems like there’s a lot of layers between the activity that starts the test, the actual failure, and the symptom that you recognize, not to mention the actual complexity in the system under test. It makes me wonder whether load testing is actually worth the effort (from the article it sounds like you’re trying really hard to get it to work) and whether that effort would be better spent trying to better understand the ways that the system could fail (not that you haven’t done that already, but it seems like the more efficient route).
Here’s an article you might find useful: sled simulation guide (jepsen-proof engineering)
The key is to add instrumentation at every layer of the stack and load test, observing all of these monitors/dashboards. Usually simulation testing is even more expensive than adding observability at different points in the stack because it constrains design patterns heavily, but yes this is the gold standard of design. When doing research on distributed systems, discrete event simulation is one of the best ways other than proving bounds on your algorithm to demonstrate desirable properties (such as linearizability), but if great care isn’t taken to either design the system with simulation in mind or to render every aspect of the algorithm into a separate simulation environment, you can end up coding in bugs that your model does not actually have. In this regard, load tests are both simpler and do not risk testing behavior that is different from production behavior. That said, both approaches are fairly time consuming.
I love reading load test articles. We have a team member dedicated to performance testing applications and infrastructure in which we hit 1000s of requests per second.
Our tests are stateful, emulate user traffic in the same profile as real load, and use JMeter… so he must be doing something right!
The other other thing to look out for in load testing tools is “coordinated omission”, where the load testing tool gives you unrealistically optimistic latency numbers because it accidentally pauses sending samples when the system under test slows down.
Gil Tene has a pretty good talk on this (*) which is written up at http://highscalability.com/blog/2015/10/5/your-load-generator-is-probably-lying-to-you-take-the-red-pi.html and there’s a video with slides at https://www.youtube.com/watch?v=lJ8ydIuPFeU which I enjoyed watching. He’s a pretty good speaker.
(* and other related measurement errors. e.g. trying to compare latency numbers between systems that are giving unbounded latency because they are terminally oversubscribed. This is analogous to trying to compare the speed of two race cars after they have both crashed and exploded.)
I’ve been dreaming for few years about building a replay harness, and then use it to replay real traffic it at different speed. Sadly I’ve never managed to get my $employer interested in it. Let alone OSSing it.
The new replay feature that was released this December for AWS event bridge does come a bit close.
I was not aware of any tool either, but a quick google shows https://goreplay.org/ and https://github.com/edmunds/shadowreader
Fwiw I was thinking to stream to kafka at the nginx layer and replay from it at a later time. Can have it eventually track the responses as well to compare those.
I can’t really vouch for it (since I’ve used it for a sum total of about 30 minutes) but https://k6.io/ was pretty easy to get going.
That’s about how long I’ve used k6s for as well! We evaluated it before writing our own tool, tried so hard to avoid writing our own tool. The scenario we have is basically we open a WebSocket and also do HTTP calls, and we have to trigger those HTTP calls based on the WebSocket but we also have a periodic background task that runs every 30 seconds (heartbeat). I couldn’t get all of that to fit into k6s, but could be I missed something in the docs.