This is a good opportunity to plug a technology that’s been around since well before WebSockets, is widely supported on both the client and server sides, was designed for precisely the kind of use case the article is focused on, and, bizarrely, seems virtually unknown in the developer community: Server-Sent Events.
Sort of. The implementations of long-polling I’ve seen are usually, “Keep the connection open until the server has an event to deliver, then deliver it as the response payload and end the request.” The client then immediately makes another long-polling request.
SSE is more of a streaming approach. A single connection stays open indefinitely and events are delivered over it as they become available. In that sense it’s more like WebSockets than traditional long-polling.
Interesting article, but I disagree with the conclusion.
/events and webhooks have very different tradeoffs, one is not just superior to the other. As stated at the beginning, webhooks are there to avoid polling. Even with long polling, this means your servers have to maintain a lot of connections open, the burden is on the provider.
Webhooks shift that burden on the consumers, and as noted it comes with its own set of drawbacks.
I think the solution you want to go for really depend on your usecase and what you are willing to pay and code for.
Yeah, long polling/websockets feel like they are ineffective in a big-O way here. Without thinking too much, it seems like the sweet spot is using /events as a source of truth, and providing a single, best-effort hook, “something’s changed, please re-poll /events”.
it seems like the sweet spot is using /events as a source of truth, and providing a single, best-effort hook, “something’s changed, please re-poll /events”.
That’s an option suggested in the last few paragraphs of the article.
I think it’s a “sweet spot” for a free service, but it’s neigh impossible to offer any kind of reasonable SLA without overbuilding so I do not recommend it ever.
long polling/websockets feel like they are ineffective in a big-O way here
Feelings can be confusing sometimes.
Polling /events without long-polling means it’s going to be customer-controlled. You may be able to influence this at your load-balancer.
In both polling /events and long-polling, the best-case RTO is about 3x RTT, but the worst-case differs greatly: With polling, it’s the frequency of your customer-controlled poll, whereas for long-polling it’s about 3x RTT (in an ideal implementation[1], and a trivial one[2] can get 3 seconds no problem).
Long-polling has another advantage: Simply route users to the same machine. This makes capacity planning and failure-detection trivial (server simply disconnects old pollers).
An open but idle TCP connection takes around 2k per. This can be amortised across multiple machines. The capacity can be planned in-advance, so errors (and retries!) visible to the client are uncommon. Polling is capped by a single server setting, and errors (retries!) are visible to clients. Without careful configuration, a misbehaving client can trivially deny service to legitimate clients.
In general, once you weigh everything: long-polling for a fixed set of clients provides the best experience to both producer and consumer, with the fewest number of failure conditions to handle, with a fixed memory cost that can be precomputed. If you’re doing something for free, and so you don’t care about your consumers (i.e. they can’t email you), you can just have people poll, because it is slightly better for the producer.
[1]: If you know the RTT, both sides can set TCP_USER_TIMEOUT to the distance (I like to add 10% or so for fuzz) when polling and enable quickacks while waiting. When sending data, simply turn off this option and turn on NDELAY. I do this in my SDKs, but if I have to publish a REST-style API I find supporting/explaining socket options to random people exhausting.
[2]: Simply turn on TCP keepalives at both ends, and set the interval timer to the lowest.
You could have the best of both, and have a webhook that notifies you when to poll /events. That’s still an improvement, because such webhook would be safe to miss (you’ll catch up next time, and can have a timer as a fallback).
Another strike against webhooks is that they require web servers. I’m sure I’m not the only one who has built a non-web-based system and included an HTTP server, with all the requisite infrastructure to make it accessible to the public Internet and secure against attacks, whose sole purpose is to receive incoming webhooks because there’s no other way to find out when things happen on some third-party service.
A polling-based approach, whether long-polling or not, imposes fewer requirements on the receiver side. Not everything is a public web service. But having both methods available is better still.
Not only that, how can you develop and debug a system using webhooks? You have to either do it on faith (as in, dry code it based off the docs, which are usually a bit shit) and then test and maybe debug on a staging server (ugh), or set up a complicated system to tunnel outside into the development box of whomever is currently working on that part of the code (double ugh).
It doesn’t have to be complicated: ngrok, cloudflare tunnels and a bunch of other similar services make it trivial to expose a local endpoint to receive webhooks.
Easy replay makes /events easier to develop against. Knowing this, there’s an obvious solution: set up a dummy server to capture and persist webhook requests, then replay them for development. It’s not as easy but once you’ve done the one time setup you get the same results.
Also, I’ve seen the worst of both worlds in two large production systems: Polling APIs which keep track of which events they think the consumer has already processed. This is horrible, because it means that:
Different deployments of a consumer system have to use different user accounts (with associated state/cursor position) on the producer system.
If you restore an older database on the consumer, you are now missing events you cannot get back. This is extra annoying when it’s important that staging and production need to be in sync when verifying that new code which processes this stuff differently works correctly, because you no longer can compare staging and production state reliably.
If your code is incorrect for whatever reason, you will not be able to process old events again to fix things up,
Debugging becomes hell, because you have no idea which events triggered an error state. The solution we came up with at my previous job was to simply copy all the events into a log file from which we could play it back.
When developing, you have to use a different account. But of course, when you have to come back to work on that piece of the code again later, you get ancient events.
It’s not even easier or better for the producing side, because whenever one account stops reading (let’s say because it was a developer account, or a staging/testing server that was decomissioned), your backlog starts building up like crazy.
Sourcehut has a quite interesting design for webhooks. It implements traditional webhooks, but each webhook also has an UUID, and you can request all of the deliveries of a specific webhook, with their payloads, UUIDs, delivery statuses, etc. This way, you can get all of the benefits of webhooks, but still be able to get the data if your service had an outage. Worth noting that this API is to be redesigned to work with GraphQL, and it’s still undecided how it will look, but I do expect it to have similar features.
Good read. I agree with most of it, but found the proposed solution a bit wierd. Maybe I misundersted, but isn’t long pooling a problem for setups like Django for example, where you have a limited number of worker processes?
They’re hard to maintain in general. A lot of web technology assumes you have short-lived requests. Not to say it isn’t possible or even increasingly common, but it’s a moderate investment.
I agree with the thrust of the article (webhooks are insufficient) but disagree with the conclusion that only polling or only listening to webhooks is the way. Poll, but also trigger a sync when you receive a webhook. It’s easy to reason about and doesn’t require much in the way of special infrastructure.
it’s easy to reason about and doesn’t require much in the way of special infrastructure.
Unless you’re a firewalled client. Then it requires special infrastructure, DNS, etc.
I think the cost of long-polling is overestimated by a lot of people. I have systems with over a million open connections each: each handle costs only about 2kb. Maybe long-polling is hard in some frameworks, so perhaps it’s worth adding some middleware?
Long polling, SSE, and web sockets all have one problem in common: the endpoint a client polls needs to know about new events. With plain polling the endpoints simply read from the database when they get a request.
In your long polling setup, how do your client facing endpoints get new events? Do they themselves poll, or have events pushed to them, or something else?
the endpoints simply read from the database when they get a request.
This just kicks the can: How did the database know about the change? Someone did an INSERT statement or an UPDATE statement. If you’re using Postgres they could have additionally done a NOTIFY (or a TRIGGER could have been created to do this as well automatically).
A decade ago, when I used MySQL, I had a process read the replication log and distribute interesting events to fifos that a PHP client would be consuming.
In your long polling setup, how do your client facing endpoints get new events? Do they themselves poll, or have events pushed to them, or something else?
They have events pushed to them: On client-connection, a subscription is made to receive updates, and on disconnection the endpoint unsubscribes. If you’re using Postgres this is just LISTEN. If you’re using erlang, you just use rpc+disk_log (or whatever). If you’re using q, you -11! the client log and hopen the realtime. In PHP, I’ve had a bank of fifos in /clients that we just read the events from. And so on.
isn’t long pooling a problem for setups like Django for example
You have X customers, so you simply need to handle X connections. If you allow less than that you have queueing which may be fine for a while, but really you need to be able to handle that anyway: otherwise your customers get errors and one customer can (trivially) deny another.
If X is larger than fits on one machine, you might want multiple machines anyway. Your load-balancer can route long-polls by customer and disconnect old pollers. Regular polling is a bit harder to capacity-plan.
Long-polling also shifts cost from the client to the provider. The provider has to have sockets open, RAM allocated, and potentially a thread per request.
There are ways to implement that don’t consume a thread per connection. These usually involve async I/O which is tricky in vanilla Java, C#, and the like. Actor based frameworks and languages make it easier, but the RAM and socket costs are still there.
Long-polling also shifts cost from the client to the provider. The provider has to have sockets open, RAM allocated, and potentially a thread per request.
There are ways to implement that don’t consume a thread per connection. These usually involve async I/O which is tricky in vanilla Java, C#, and the like. Actor based frameworks and languages make it easier, but the RAM and socket costs are still there.
Ok, I’m game. How do you implement /events that is backed by a database? I assume you want a forever log type of service? I guess the real trick is not having it fill your disk? I remember enabling auditing on a database and finding out how much is going on. Is this what OP is implying or other people are taking away? You’d just create a forever log of events (per user maybe) and when they say “the client side cursor” we’re also implying that no previous data was deleted? You can move the cursor back in time to any point?
Yeah, you have to have some database of events or subscriptions or whatever. Though you don’t have to never delete data; I don’t think “give me all events for the past 3 years” is reasonable. Anywhere from an hour to a couple days is probably the threshold you want.
This is a good opportunity to plug a technology that’s been around since well before WebSockets, is widely supported on both the client and server sides, was designed for precisely the kind of use case the article is focused on, and, bizarrely, seems virtually unknown in the developer community: Server-Sent Events.
Server-Sent Events are one of my favorite technologies.
SSE is a specific implementation of long-polling, no?
Sort of. The implementations of long-polling I’ve seen are usually, “Keep the connection open until the server has an event to deliver, then deliver it as the response payload and end the request.” The client then immediately makes another long-polling request.
SSE is more of a streaming approach. A single connection stays open indefinitely and events are delivered over it as they become available. In that sense it’s more like WebSockets than traditional long-polling.
It is with a different interface and more efficient bandwidth usage.
And a built-in protocol for reconnections and catching up on messages you missed while you were out.
I remember discovering Server-Sent Events and taking great joy in the simplicity of just going one way.
Indeed.
You can also get away with implementing an “endless” request dlivering JSON lines, like the Twitter Streaming API.
Interesting article, but I disagree with the conclusion. /events and webhooks have very different tradeoffs, one is not just superior to the other. As stated at the beginning, webhooks are there to avoid polling. Even with long polling, this means your servers have to maintain a lot of connections open, the burden is on the provider. Webhooks shift that burden on the consumers, and as noted it comes with its own set of drawbacks.
I think the solution you want to go for really depend on your usecase and what you are willing to pay and code for.
Yeah, long polling/websockets feel like they are ineffective in a big-O way here. Without thinking too much, it seems like the sweet spot is using
/events
as a source of truth, and providing a single, best-effort hook, “something’s changed, please re-poll /events”.That’s an option suggested in the last few paragraphs of the article.
I think it’s a “sweet spot” for a free service, but it’s neigh impossible to offer any kind of reasonable SLA without overbuilding so I do not recommend it ever.
Feelings can be confusing sometimes.
Polling /events without long-polling means it’s going to be customer-controlled. You may be able to influence this at your load-balancer.
In both polling /events and long-polling, the best-case RTO is about 3x RTT, but the worst-case differs greatly: With polling, it’s the frequency of your customer-controlled poll, whereas for long-polling it’s about 3x RTT (in an ideal implementation[1], and a trivial one[2] can get 3 seconds no problem).
Long-polling has another advantage: Simply route users to the same machine. This makes capacity planning and failure-detection trivial (server simply disconnects old pollers).
An open but idle TCP connection takes around 2k per. This can be amortised across multiple machines. The capacity can be planned in-advance, so errors (and retries!) visible to the client are uncommon. Polling is capped by a single server setting, and errors (retries!) are visible to clients. Without careful configuration, a misbehaving client can trivially deny service to legitimate clients.
In general, once you weigh everything: long-polling for a fixed set of clients provides the best experience to both producer and consumer, with the fewest number of failure conditions to handle, with a fixed memory cost that can be precomputed. If you’re doing something for free, and so you don’t care about your consumers (i.e. they can’t email you), you can just have people poll, because it is slightly better for the producer.
[1]: If you know the RTT, both sides can set
TCP_USER_TIMEOUT
to the distance (I like to add 10% or so for fuzz) when polling and enable quickacks while waiting. When sending data, simply turn off this option and turn on NDELAY. I do this in my SDKs, but if I have to publish a REST-style API I find supporting/explaining socket options to random people exhausting.[2]: Simply turn on TCP keepalives at both ends, and set the interval timer to the lowest.
Thanks for elaborating!
You could have the best of both, and have a webhook that notifies you when to poll
/events
. That’s still an improvement, because such webhook would be safe to miss (you’ll catch up next time, and can have a timer as a fallback).That’s also exactly how Telegram API expects clients to handle updates. https://core.telegram.org/api/updates#recovering-gaps
Another strike against webhooks is that they require web servers. I’m sure I’m not the only one who has built a non-web-based system and included an HTTP server, with all the requisite infrastructure to make it accessible to the public Internet and secure against attacks, whose sole purpose is to receive incoming webhooks because there’s no other way to find out when things happen on some third-party service.
A polling-based approach, whether long-polling or not, imposes fewer requirements on the receiver side. Not everything is a public web service. But having both methods available is better still.
Not only that, how can you develop and debug a system using webhooks? You have to either do it on faith (as in, dry code it based off the docs, which are usually a bit shit) and then test and maybe debug on a staging server (ugh), or set up a complicated system to tunnel outside into the development box of whomever is currently working on that part of the code (double ugh).
It doesn’t have to be complicated: ngrok, cloudflare tunnels and a bunch of other similar services make it trivial to expose a local endpoint to receive webhooks.
Those are new to me, thanks for the tip!
Easy replay makes
/events
easier to develop against. Knowing this, there’s an obvious solution: set up a dummy server to capture and persist webhook requests, then replay them for development. It’s not as easy but once you’ve done the one time setup you get the same results.Also, I’ve seen the worst of both worlds in two large production systems: Polling APIs which keep track of which events they think the consumer has already processed. This is horrible, because it means that:
It’s not even easier or better for the producing side, because whenever one account stops reading (let’s say because it was a developer account, or a staging/testing server that was decomissioned), your backlog starts building up like crazy.
Sourcehut has a quite interesting design for webhooks. It implements traditional webhooks, but each webhook also has an UUID, and you can request all of the deliveries of a specific webhook, with their payloads, UUIDs, delivery statuses, etc. This way, you can get all of the benefits of webhooks, but still be able to get the data if your service had an outage. Worth noting that this API is to be redesigned to work with GraphQL, and it’s still undecided how it will look, but I do expect it to have similar features.
Good read. I agree with most of it, but found the proposed solution a bit wierd. Maybe I misundersted, but isn’t long pooling a problem for setups like Django for example, where you have a limited number of worker processes?
They’re hard to maintain in general. A lot of web technology assumes you have short-lived requests. Not to say it isn’t possible or even increasingly common, but it’s a moderate investment.
I agree with the thrust of the article (webhooks are insufficient) but disagree with the conclusion that only polling or only listening to webhooks is the way. Poll, but also trigger a sync when you receive a webhook. It’s easy to reason about and doesn’t require much in the way of special infrastructure.
Unless you’re a firewalled client. Then it requires special infrastructure, DNS, etc.
I think the cost of long-polling is overestimated by a lot of people. I have systems with over a million open connections each: each handle costs only about 2kb. Maybe long-polling is hard in some frameworks, so perhaps it’s worth adding some middleware?
Long polling, SSE, and web sockets all have one problem in common: the endpoint a client polls needs to know about new events. With plain polling the endpoints simply read from the database when they get a request.
In your long polling setup, how do your client facing endpoints get new events? Do they themselves poll, or have events pushed to them, or something else?
This just kicks the can: How did the database know about the change? Someone did an INSERT statement or an UPDATE statement. If you’re using Postgres they could have additionally done a NOTIFY (or a TRIGGER could have been created to do this as well automatically).
A decade ago, when I used MySQL, I had a process read the replication log and distribute interesting events to fifos that a PHP client would be consuming.
They have events pushed to them: On client-connection, a subscription is made to receive updates, and on disconnection the endpoint unsubscribes. If you’re using Postgres this is just LISTEN. If you’re using erlang, you just use rpc+disk_log (or whatever). If you’re using q, you -11! the client log and hopen the realtime. In PHP, I’ve had a bank of fifos in /clients that we just read the events from. And so on.
You have X customers, so you simply need to handle X connections. If you allow less than that you have queueing which may be fine for a while, but really you need to be able to handle that anyway: otherwise your customers get errors and one customer can (trivially) deny another.
If X is larger than fits on one machine, you might want multiple machines anyway. Your load-balancer can route long-polls by customer and disconnect old pollers. Regular polling is a bit harder to capacity-plan.
Long-polling also shifts cost from the client to the provider. The provider has to have sockets open, RAM allocated, and potentially a thread per request.
There are ways to implement that don’t consume a thread per connection. These usually involve async I/O which is tricky in vanilla Java, C#, and the like. Actor based frameworks and languages make it easier, but the RAM and socket costs are still there.
Long-polling also shifts cost from the client to the provider. The provider has to have sockets open, RAM allocated, and potentially a thread per request.
There are ways to implement that don’t consume a thread per connection. These usually involve async I/O which is tricky in vanilla Java, C#, and the like. Actor based frameworks and languages make it easier, but the RAM and socket costs are still there.
Generally, stop trying to cramp everything network-related into a web browser.
Webhooks use HTTP as a transport mechanism, but they don’t have anything to do with browsers.
Ah, I should not use this website when tired…
Ok, I’m game. How do you implement
/events
that is backed by a database? I assume you want a forever log type of service? I guess the real trick is not having it fill your disk? I remember enabling auditing on a database and finding out how much is going on. Is this what OP is implying or other people are taking away? You’d just create a forever log of events (per user maybe) and when they say “the client side cursor” we’re also implying that no previous data was deleted? You can move the cursor back in time to any point?Yeah, you have to have some database of events or subscriptions or whatever. Though you don’t have to never delete data; I don’t think “give me all events for the past 3 years” is reasonable. Anywhere from an hour to a couple days is probably the threshold you want.
Maybe we need to look beyond HTTP as the protocol-for-everything. There are messaging protocols that are suited to spanning different organizations.