1. 27
  1. 13

    I first heard this approach suggested by Colm MacCárthaigh (who is technically a colleague, in that we share an employer) in this twitter thread talk. I’ll restate portions of Colm’s thread to better “yes, and…” the story.

    The article argued that poll-based systems are preferable to most event-based systems. While I agree, I’d amend that conclusion with two additions:

    1. Poll-based distributed systems are especially valuable in circumstances where said data isn’t really high volume, but it’s critical to get right—for instance, the percentage dialup of an A/B test or some traffic routing rules, but I can think of some other smaller, domain-specific examples.
    2. Poll-based distributed systems are unreasonably effective when paired with constant work architectures.

    To illustrate why, consider a configuration service. Other services rely on this configuration service to be, uh, configured. Someone makes a change to the configuration. How do you ensure this change is propagated to dependent services?

    • With an event-driven approach, the configuration service is now responsible for distributing the configuration to dependent services. How do you ensure that every downstream service has the newest configuration? How do you handle retries? What happens if some hosts are unavailable due to a deployment? To solve this, you’ll probably need a workflow orchestration system to retry failed deployments, plus, you need some centralized record keeping to determine whether the configuration took hold in the first place. Oh dear.
    • With a poll-based approach, the downstream services can either poll the configuration service directly or read from an object store like S3 every n seconds and apply the configuration regardless if it changed. I like this for a few reasons:
      • This system is dead-simple and you know every host will pick up the new configuration.
      • If it exists, you’ll barely notice the lag.
      • The system heals automatically! There is no special “recovery” mode; there is only the “applying the latest configuration” mode.
      • The level of traffic a configuration service receives is extremely predicable, as it’s a function of how many unique hosts you have in your fleet. Adding more hosts adds more traffic; removing hosts reduces traffic.

    This isn’t to say that only poll-based systems should be preferred over event-driven systems—far from it. There are plenty of cases where event-based systems are preferable due to volume or tolerance of data loss. What I am saying is that people underestimate how effective a poll-based system can be.

    1. 4

      I don’t see event-based and poll (request-reply) being mutually exclusive. Having a broker to distribute the published event is necessary and independent subscriptions (and progress) are necessary on the consumer-side. Assuming the event delivery eventually occurs, there is no real difference between that and a consumer performing a request to fetch the event.

      Request-reply can be useful for bootstrapping clients or (as the articular suggests) initiating a sync with the authoritative source of the information. Whether this initiation simply involves a transfer of a single state object or a stream of events to replay, that is necessary to get in sync (up to some moment). An event-based pub/sub model for general distribution is still superior to a server distribution or client poll for when online changes occur.

      In other words, request-reply are useful during startup and if a timeout/fault is detected.

      1. 2

        Lots of financial exchange market data feeds work this way. They have a “snapshot channel” (which usually pulses on a timer) and an “incremental channel”. At startup (or restart after a crash), you can bootstrap state from the snapshot channel, then consume small deltas from the update channel. Since everything is produced by a single producer, you can use the timestamps in messages to know where you are in the stream.

      2. 3

        You could fix the bug and keep your event-driven system with a little tweak.

        Update ==
            /\ Len(queue) # MaxQueueLength
            /\ \E u \in DOMAIN dataSourceA: \E v \in 1..10:
                /\ dataSourceA' =  [ dataSourceA EXCEPT  ![u] = v ]
            /\ queue' = Append(queue, dataSourceA')
            /\ UNCHANGED <<dataSourceB>>
        
        Receive ==
            /\ Len(queue) # 0
            /\ dataSourceB' = Head(queue)
            /\ queue' = Tail(queue)
            /\ UNCHANGED <<dataSourceA>>
        

        Just send the whole dataSource like you do in the poll example.

        That being said, @endsofthreads does bring up several good reasons why you may still want to go with polling.

        1. 3

          There are two downsides that we’ve run into with poll-only approaches ↝

          1. It puts undue, often unscalable load on the source system.
          2. It makes for a less real-time system (not always an issue).

          Increasing and staggering the poll interval helps with the former, but amplifies the latter.

          We’ve found that a hybrid approach leveraging a pub/sub model works well. Consumers subscribe to updates from the source system that they care about. When an update occurs, the source system publishes an event stating so, and the consumer can choose at that point to inquire for more information required to sync itself up. The consumer also polls with a long interval to guarantee that, even with degradation, it eventually syncs back up using a traditional, resilient mechanism.

          General rule of thumb that we’ve followed for events (learned from many financial exchange’s real-time APIs) ↝ events should have enough information to support the majority of use cases without requiring the consumer to make another inquiry, and enough information for the consumer to specifically inquire for the remainder of the information in the minority of use cases.

          1. 2

            Random thoughts: One “pull” is two “push”: one push is the request from consumer to producer, the other one is the response from producer to consumer.
            The difference is “who” made the decision to push (consumer VS producer), and “why” (consumer requested it VS things happened on producer side).
            If you make your consumer resilient to “who asked for this data I received” and “when should I receive it”, then you made your system ready for both pull and push.

            1. 1

              I feel like this fundamentally is about stateful vs stateless services, not push vs. pull (although this is a great TLA example!). The thing that accept snapshots is effectively stateless but the thing that has to store some local state and apply deltas to it is very stateful. It’s clear that the stateful thing is harder to manage operationally and likely has more failure modes.

              Replacing long lived statefulness with short lived statefullness has gone a long way for the reliability of the software I work on. My favorite example is in IGMP protocol. Consumers of a multicast group must periodically push out a message saying “I want to get packets sent to this group.” Switches only need to “remember” anything about what packets go where for a short interval, so the impact of a switch restart (which drops the what-goes-where state) is minimized.