1. 26
  1. 5

    This is really good - I appreciate the in-depth nature of it!

    I remain skeptical about the practicality of event sourcing, but it helps a lot that rather than handwaving over the details of archiving and event schemas, you’ve actually gone into detail about how you do it.

    What I’m still curious about (and most articles/books on event sourcing don’t discuss much) is:

    • How do you handle changing schemas for events (you mention rewriting your S3 store, I assume this is it?)
    • How many events do you have stored?
    • How long does it take you to rebuild a projection, or rewrite an event archive at the moment with that volume? Do you ever anticipate you’ll have too many events to make it practical? If so what will you do then?
    1. 3

      Thanks for the kind feedback!

      There are definitely tradeoffs to doing an event-sourced system I wrote this post up mostly about our experience overall, and trying to give enough detail about the system itself to make it understandable. From a business perspective it has enabled so much stuff to be so easy that we’ve found the tradeoffs to mostly be skewed in the direction of higher productivity and faster feature delivery. We’ll write more about specifics in future posts (not just by me). There is a whole analytics and BI component here that I didn’t even mention.

      To answer your questions above:

      • The schemas are Protobuf so we generally follow the recommended rules for changing Protobuf schemas: you can add fields at any time. We deprecate fields as needed, though this is rare. You can add new events any time. You never change a field name or type. We have processes around getting new event schemas approved that require review by folks who understand them well and know what works. We have LADRs that explain some of this. If you added a fields that needs to be back-calculated for previous events, you would run a backfill to emit all of those for the entities that were missing them. If it’s more complicated, we can use Spark to rewrite the store.
      • We have under 10 billion or so events (this is all the public stuff keep in mind, not all the RPC). At this scale you could obviously use other tools. But we plan to grow on this platform in perpetuity, hence some of the decisions. Generally you are not operating on all of the events, though. You are operating on a subset that you care about for your service. You can generally scope it to tens of millions for a service you are working on. We have some things that behave like events but which we treat as ephemeral and don’t archive because there is no utility in replays or audit trails.
      • We can run Spark over the whole event store in a couple of hours, less if we want to throw more/bigger instances at it. We can build a new projection for a reasonable service in a couple of hours. The limit here is usually the throughput on RabbitMQ for the size of cluster we replay on. We can pay for more if we need it faster. Snapshots are the solution to this longer term, by starting from an intermediary state rather than from scratch each time. We build and use them already, but by specializing them for each model, we could massively reduce bootstrap time. I mentioned the first way we will do that in the article.

      Hope that helps answer some of the questions. We have been really practical about it and solve problems as they arrive, using an iterative approach rather than building a lot of support up front. We have a business to run first! :) Generally Protobuf, S3, Spark, and Athena mean we haven’t seen anything frightening yet. Quite the opposite: the tooling seems more than up to the job ahead.

      1. 1

        Thanks so much for getting back to me, that’s really impressive :)