1. 26
  1.  

  2. 8

    In general, I agree with this. There’s no need to coordinate for something attempting to prevent basic / accidental abuses. There is reason to coordinate in at least 2 scenarios, though:

    1. I sell access to an API, and my business model is $X requests per time period, strict.
    2. I am trying to prevent coordinated abuse.

    In the second scenario, a local leaky bucket, or other simple strategy will result in the coordinated abuser maximizing requests by figuring out when bursts are possible, and then doing that across all nodes, either probabilisticly, or by understanding your DNS / load balancing setup.

    1. 4

      The first is that requests are presumably load-balanced across our nodes. That load balancing is either done in as uniform a way as possible, or in a “sticky” way where each request identifier gets sent to the same node (or small group of nodes, then again as uniformly as possible). In either case each node sees either zero requests per identifier or a representative sample of them. So having the nodes make local decisions about identifiers will in aggregate reflect the global behavior of the system.

      I don’t think this is necessarily the case! It’s hard to get even uniformity among total requests sent to each node. That’s why power of two load balancing is so useful.

      The second is that source behavior is usually binary. A given source is either well-behaved at a given moment, or completely batshit insane. Either behavior is perfectly visible at the local level and needs no coordination to figure out.

      I don’t think this is necessarily the case, either. Consider a business account with multiple services hitting a global rate limit- some could be well-behaved while others are going wild.

      I am but a simple country developer, and am probably missing something obvious, but it has never been clear to me why we care about fairness or correctness in these systems.

      I think you’re on to something here. In many systems the rate limit is a soft limit: 1001 requests per sec isn’t worse than 1k rps, but 10k rps is, so we have to put a cutoff somewhere. Doing local-level rate limiting reduces coordination, which is good, at the cost of letting people sometimes go a bit past the limit, which isn’t too big a deal. As long as people can’t get away with going way past the limit, and that people can reliably go up to to the limit. If local rate limiting cuts off people too early that’s a bigger problem than cutting them off too late.

      1. 4

        The most successful pattern I’ve seen for this kind of thing is a local copy of the traffic routing data, that is periodically refreshed from a central store. When the central store goes down you best guess based on the stale local data, and when the central store is up you get slightly better traffic management.

        1. 3

          I generally agree with the thesis, but the assumption of either perfect stickiness or perfectly random load balancing seems fragile. Anyway, if the latter assumption holds, then you can use the negative binomial distribution to derive confidence bounds on the total request rate from the locally observed rate. (This trick isn’t original, but I don’t recall where I saw it, though it seems obvious in retrospect.)

          1. 1

            Generally agree. Even without perfect stickiness or load balancing, I’d think that a scheme like this could be “good enough” for most use cases. Or, put another way, I wonder what the worst case would be when using something like this?

            Edit: Especially compared with the worst case for the distributed version (complete external store failure?)

            1. 3

              Not sure if by “worst case” you mean exceeding the global limit or leaving capacity unused (by unnecessary throttling)? Of course you can avoid ever exceeding the global limit by just setting the local rate limit to be 1/N of the global limit, for N nodes. If you have round-robin load balancing, then this conservative approach doesn’t actually leave capacity unused. If you have uniformly random load balancing, then you will have some unused capacity due to one-sided variance, but it shouldn’t be that bad (haven’t worked out the math details).

              A more interesting question IMO is how to set the rate limit in the first place. I did some work on that when I was at AWS around 7 years ago, but never got any interest from managers, and my private implementation never got beyond a trivial Lua prototype in nginx. I have many dozens of pages of notes on that project, though (and looked at maybe a couple hundred control theory papers for it), so I’d like to validate those ideas at some point. Briefly, the idea is to dynamically calculate a target “bottleneck rate” based on latency or other SLOs. Each client’s request rate is monitored independently, but the same bottleneck rate applies to all of them (except for a weighted variant that allows prioritizing clients differently). The bottleneck approach naturally leads to a max-min fair allocation of capacity, much like bitwise round-robin packet queuing. But for all I know, AWS is still hardcoding rate limits for their services, unnecessarily throttling customers and leaving tons of capacity unused.

              1. 1

                Here are a couple of bad cases I can imagine:

                • Overestimating call volume: You promise 100 calls/minute in your API docs, but you have sticky routing over 5 hosts, and the customer complains because they get throttled at 20 per minute.
                • Underestimating call volume: You’re in the middle of resizing the pool, and the currently-active hosts mistakenly think the pool is half the size it actually is. So they mistakenly accept a total of 200 requests/minute, and this kills some shared dependency that (for whatever reason) can’t throttle requests itself.

                In general, enforcing limits locally actually seems more correct to me: if an individual host is underloaded, it’s okay to serve more requests than normal. And if it’s overloaded, formally rejecting a request seems better than a worst-case fail (e.g., trying to service the request crashes the host, increasing load on remaining hosts).

                1. 1

                  Yes, I totally agree! The ultimate decision on whether to reject a request should be made by the edge server handling that request. Upstream rate limiting (in load balancers etc.) should only ensure that the edge servers aren’t overwhelmed by traffic that they can’t reject fast enough (so the rate limit there could be say an order of magnitude larger than the rate limit at the servers). One problem I can see though with a dynamic rate limiting strategy is its unpredictability for customers, which is why I think AWS has stuck with their hardcoded limits–because they can be documented and increased on an ad-hoc basis for specific customers. This still leaves a lot of utilization on the table when traffic is slow, though.

            2. 3

              A small shout-out to the push-sum family of gossip systems, which you can use to obtain asynchronous, exponential-fast convergence for things like rate-limiting counters, quantiles, etc. https://manishyadav.dev/blog/gossip-push-sum-protocols

              1. 2

                You’ve raised good points I want to think about and reply to. It’s the weekend so I’m with the family, and given my posting rhythm I feel comfortable promising a reply in the next six months.

                1. 2

                  In my opinion, local rate limiting is necessary but not sufficient for a robust distributed system. All members of a distributed system must be able to protect themselves and apply backpressure when they think they are overloaded. However, invariably some more global notion of distributed quotas and rate limiting are needed to enforce fairness, predictable performance, and optimize for utilization of resources.

                  The first is that requests are presumably load-balanced across our nodes. That load balancing is either done in as uniform a way as possible, or in a “sticky” way where each request identifier gets sent to the same node (or small group of nodes, then again as uniformly as possible). In either case each node sees either zero requests per identifier or a representative sample of them.

                  Larger services sometimes need to load balance requests across multiple load-balancers. You wouldn’t use another load-balancer or else this is turtles all the way down, so e.g. you can use DNS. But once you use something like DNS you have clients setting up connections via one load balancer and now traffic can be uneven. So load balancing unfortunately can be both sticky and uniform.

                  The second is that source behavior is usually binary. A given source is either well-behaved at a given moment, or completely batshit insane. Either behavior is perfectly visible at the local level and needs no coordination to figure out. If we need correctness for later accounting or observability, the nodes can report their local counts to a central authority on their own time.

                  Some multi-tenant services, i.e. services that allow multiple customers to use them, need to enforce some notion of fairness. For example, AWS Lambda by default says “You can have a function concurrency of 1,000 and a requests-per-second limit of 10,000”. If there wasn’t a global way of enforcing these limits some customers could be treated unfairly and not get the performance they expect. AWS Lambda also allows you to strictly enforce function concurrency to control costs or protect down-stream resources. Other services like AWS DynamoDB instantaneous adaptive capacity would like to adjust customer quotas per-partition to optimize utilization, again requiring some global view.

                  Distributed rate limiting solutions are tricky. Sometimes you think “threads or network connections are great, let’s use more of those” and you create a full-mesh of threads or network connections between all servers trying to allocate resources or throttle requests. Then events like this happen. With a data store you risk overwhelming it. With gossip protocols, you know they converge but how soon is soon enough. It’s a hard problem that sometimes isn’t worth solving, but usually eventually you must try to solve it.

                  Implementing layers of admissions control - Fairness in multi-tenant systems is a good summary. I’d be interested to hear other people’s experiences with this problem.

                  1. 1

                    I think it’s interesting that so many people talk about rate limiting when most damaging is high concurrency. The convenient thing is that it’s much easier to do statelessly like mentioned in this article.