These articles are interesting, but they often gloss over what is the most interesting part to me, the decision to go from something off the shelf to something home-grown. Why doesn’t HAProxy scale horizontally? Why not something like nginx? Are there hardware load balancers that could have helped here? What’s the expected ROI of scaling load balancing horizontally instead of vertically? What are other folks doing in the industry?
I’ve noticed an interesting desire amongst engineers to save money and use open source, ignoring the cost of their effort.
A site we run at $work also had performance issues with our HA Proxies. We simply replaced them with commercial software LBs and went on to other things. We didn’t blog about it because the project took a month or so and wasn’t very glamorous.
I don’t think it’s necessarily a desire to save money, it’s a desire to use software you can understand, modify and enhance as needed. I’m guessing the commercial load balancer you’re using is pretty much a black box - if you have problems you’re going to have to rely on vendor support ([insert horror story here]) instead of being able to fix it yourself. Troubleshooting is a helluva lot easier if you have source code…
Yes, going with a commercial product is better in a lot of cases, but there are always trade-offs.
Agreed - there’s always the risk of bugs and black boxes. On that topic, the question is if the velocity you gain is worth it? After all - many are comfortable to run on EC2 with ELBs, despite both of them being very opaque.
Bug wise, I can only talk about my experience; we’ve had no major implementation bugs and the experience has been very smooth. We have been running these devices for several years.
This of course could change but as a reference point, I also have a cluster of several hundred Nginx proxies which very work well, but we’ve had some showstopper bugs over the years. At those times, having the ability to dive into the code has not been helpful due to the complexity of the product and the fact that these bugs happen infrequently enough that we don’t have an nginx code internals expert on staff. Sure we can read/fix the code, but the MTTR is still high.
In GHs case, they now need to maintain at least 1 or 2 members of staff full time on this codebase otherwise their knowledge will begin to degrade. The bus factor is high.
For future features, they can at best have a small team working on this problem without it becoming a distraction for a company their size. I do see they plan to open source this, which may reduce the size of that issue, assuming the project gets traction.
In my case, I pay a vendor for a team of hundreds of specialists working full time on this problem. We have gained many features for “free” over the past years.
In terms of debugging - the inspection capabilities on the software we chose have been unmatched by anything else I’ve used. We can do realtime deep inspection of requests. This isn’t anywhere near the blackboxyness of ELBs which most are comfortable to use.
For control, the cluster has a very complete REST API and to assist teams, somebody wrote a Terraform provider in their 20% time.
We run the devices in a service provider model, meaning we have a centrally managed hardware platform and then we do containerised virtual loadbalancer deploys so that teams who have unique use cases can get their own instances. The devices are integrated into our BGP routing mesh and so traffic is seamlessly directed to where it should be. This is all vendor supported.
ITO traffic, we do tens of gigs over millions of connnections at peak. We use many protocols - not just HTTP and a good portion of our requests are short lived.
As you might infer, I’m very happy with my choice :)