So I ran into almost exactly the same thing at Square around 2012. Our whole production network, including our firewalls were getting hammered and we didn’t know why. The growth in traffic was alarming and we thought it was due to huge growth in business.
We even did a substantial network upgrade to mitigate it. But I did the math again and the traffic growth was on track to overwhelm our upgrades in a few months. We didn’t have great instrumentation at the time but we also saw that Redis (we only had one server and a replica, IIRC) was basically saturating its network.
Me and another engineer finally sat down in a big conference room to figure it out. After a bit of tcpdump we realized there was a set that kept getting items added to it, and not cleaned up. It was several MB in size and we were pulling it for every API call. Napkin math and it added up to like 99% of the redis traffic we were seeing.
We manually truncated it and the traffic instantly dropped. Then a small PR and deploy later. All fixed.
This was a particularly crazy case. But I always feel that any unoptimized software system has at least one 10x perf improvement to he found, if you just look.
I’m responsible for approving cpu/ram/storage increase requests from developers and stories like these do kinda make me wonder if I should be as lenient as I am.
I pretty much approve every request because what else am I going to do? Scour the source code of every app for inefficiencies? I did do that once: someone who wanted 200GB of RAM just so they could load a huge CSV file instead of streaming it from disk.
Maybe it’s just a thing where trust can be built up or torn down over time.
Asking from ignorance here: I’ve never worked somewhere where you had to request cpu/ram/storage. Instances or VMs, yes, but not asking to have some more RAM and having to say how much up front. How is that managed? You have processes killing containers that use more RAM than the developer asked for? Or more CPU? And… why? Is it a fixed-hardware environment (eg not in cloud) where it’s really hard to scale up or down?
Yes it’s fixed-hardware (grant funded), shared amongst several different teams. My role is mainly to prevent the tragedy of commons, and a little bureaucratic speed bump is the best I could think of.
Figure out how much the hardware will cost, figure out how much developer time will equal that cost, and force them to spend at least that much time profiling and optimizing their app before the request is approved?
Problem in this case is that redis is (essentially) single threaded, so give it as many cpus as you might, if something is eating it at a high rate, you’ll need to solve the root cause.
I know several people who worked at Yahoo in the 00s. To get new hardware you’d have to go to a hardware board, that included David Filo.
He would grill you and then literally login to your servers to check utilization and if things didn’t look good enough he’d deny your request. In one case I was told he logged in and found miss-configured RAID controllers that wasted a bunch of resources. Request denied.
I’m not suggesting you do this. But thought it was interesting.
What an utterly bananas way for the cofounder to spend their time: hire people they don’t trust then micromanage their decision making.
If you don’t look at the reason behind the request, then the process seems weird… If you don’t know what the app is doing, how can you decide if it should be approved?
I mean, only investigating when something unusual is requested seems like a pretty reasonable heuristic.
With a stronger Observability setup, it feels like this debugging story would have been a non-event.