1. 86
    1. 28

      Whatever you think about Discord, this is a pretty entertaining read, and probably not an experience many people will ever get to replicate. From their listed numbers that’s half a petabyte of data to shuffle around.

      We elect to engage in some meme-driven engineering and rewrite the data migrator in Rust.

      …The term “meme-driven engineering” needs to be more common, I love it.

      Several days later, we gather to watch it hit 100%, and we realize that it’s stuck at 99.9999% complete (no, really). Our migrator is timing out reading the last few token ranges of data because…

      See, if this was fiction, I’d call it it unrealistic.

      It’s been a quiet, well-behaved database (it’s okay to say this because I’m not on-call this week).

      The listed performance improvements are pretty impressive, I hope it continues working out for them!

    2. 13

      I wish Scylla was more OSS and community driven rather than vendor locked. I have same wishes for RedPanda and other equivalent GC free versions of these amazing tools. I see a ray of hope with Rust getting more and more sophisticated and people not having to rely on GC based languages to build these databases, and servers.

      1. 3

        How is Scylla not OSS/community driven?

    3. 9

      I’m astonished that they can store and search all the messages on all Discord instances in just 72 nodes with 9TB storage each. I’m on a few and it seems like some people post thousands of messages a day! And they appear to stay forever.

      1. 18

        (I work at Discord on the Persistence Infra team)

        We also run many Elasticsearch clusters to handle searching through all of the messages.

        1. 6

          I would love to hear your thoughts on managing such a big ES cluster. The few times I’ve bumped into ES I felt like it was a very capable database but always a total bear from the operations side, and I imagine your cluster is 10x bigger than anything I’ve dealt with.

        2. 4

          Is that also why you don’t allow for exact search (which can be extremely annoying), being a different kind of index?

      2. 5

        I’m quite surprised it’s that big. Text is small. Wikipedia is 86 GiB for all of the text of the current English version. You could easily index and search that in RAM on a single moderately powerful server. I hadn’t realised how much more people type in Discord.

        1. 13

          I would definitely expect Discord to have more text than Wikipedia. Chat is append-only, whereas Wikipedia articles are edited so don’t necessarily grow with the number of edits. People also socialize more than they write encyclopedia articles.

          1. 3

            IIRC Wikipedia stores all edit history and lets you diff the versions on the website? But yeah that wouldn’t count against the downloadable dump size.

        2. 4

          Discord has over 100mm active users, and you also have overhead per message. That adds up fast with that many people.

    4. 4

      Great post! It has a lot of parallels to some of the classic Twitter engineering posts / talks.

      It’s interesting they went with hash based load balancing, and sending open requests to a specific instance. It seems like they’re kind of trading hot partitions in Cassandra for hot nodes in the data layer cluster. I wonder if they also load balance on CPU or geo a bit, or just over provision the worker nodes so much it doesn’t matter. As they continue to scale I wonder if they will need to in order to spread out requests for the same data to more nodes.

      Related to this, the new DynamoDB paper is interesting, as AWS had to solve some similar issues. I wish Scylla / Cassandra had more advanced features like this, but I suppose a lot of users would have no need for this (it’s highly use case dependent and only an issue at scale).

      It seems like a practical solution for these types of scalability problems could be:

      • Keep data in memory on nodes partitioned by a hash of the keys being looked up (i.e. how any distributed database/cache works).
      • Handle hot partitions with teaks to the hash partitioning: split the partitions into smaller chunks and allow the hottest partitions to be replicated to more nodes. There are tons of great open source databases to help with partitioning, but I haven’t seen the second point addressed very commonly.
    5. 2

      So the primary takeaways are if you care about performance, don’t use Java, and Rust is decent at quickly bringing up a new service with less footguns than the equivalent C/C++.

      Was surprised there apparently wasn’t a caching layer in front of the databases already, too. That seemed like an obvious easy win.

    6. 2

      But why does Discord store trillions of messages? I know that there’s an answer in the linked earlier blog post:

      We decided early on to store all chat history forever so users can come back at any time and have their data available on any device.

      Also in that post:

      Everything on Discord was stored in a single MongoDB replica set and this was intentional…

      Okay. It sounds like most performance issues are ultimately provoked by the multimodal nature of guilds, as detailed in that post.

      At risk of stating the obvious (or worse, giving free advice), why isn’t data keyed or sharded by guild? I recognize that this line of questioning leads to the idea that maybe “servers” should either be genuinely isolated from each other, etc.

    7. 2

      Kind of surprised about a lot of these points, in particular the fact that they have design decisions that seem to fall over given the bimodal nature of discord servers.

      The static buckets for timing and using a database with cheap writes and more expensive reads is like… for most systems you can get away with this (and they are getting away with it for the most part IMO). But given this is their core system, it feels like by now adding a different sort of indexing system for scrollbacks that allows for certain discords to have different buckets seems very important.

      EDIT: honestly it looks like the work is almost there. Bucket sizes being channel dependent seems like an easy win, and maybe you have two bucket fields just built-in so you can have auto-migration to different bucket sizes and re-compact data over time, depending on activity.

      I don’t know about Cassandra’s storage mechanisms, but I do know that a lot of people with multitenant systems with Postgres get bit by how the data is stored on disk (namely, you make a query and you have to end up fetching a lot of stuff on disk that is mostly data you don’t need). It feels so essential for Discord for data to properly be close together as much as possible.

      1. 3

        I’m also surprised that they would bias for writes. My intuition is that chat messages are always written exactly once, and are read at least once (by the writer), usually many times (e.g. average channel membership), and with no upper bound. That would seem to be a better match for a read-biased DB. But I’m probably missing something!

        1. 8

          It’s a great question and the answer isn’t straightforward. Theoretically btree based storage is better than LSM for read heavy workloads but almost all distributed DBs–Cassandra, BigTable, Spanner, CockraochDB–use LSM based storage.

          Some explanations why LSM has been preferred for distributed DBs:

        2. 3

          I’m assuming they are broadcasting the message immediately to everyone online in the channel, and you only read from the database when you either scroll back far enough for the cache to be empty, or when you open a channel you haven’t opened in a while. That would avoid costly reads except for when you need bulk reads from the DB.

          1. 6

            Yes, we have realtime message via our Elixir services to distribute messages: https://discord.com/blog/how-discord-scaled-elixir-to-5-000-000-concurrent-users

          2. 2

            I’d be very surprised if the realtime broadcast of a message represented more than a tiny fraction of its total reads. I’d expect almost all reads to come from a database (or cache) — but who knows!

            1. 2

              It’s a chatroom and people don’t scroll way up super often. They only need to check the last 50 messages in the channel unless the user deliberately wants to see more. There might be a cache to help that? But you can stop caching at a relatively small upper bound. That said I am curious how this interacts with search.

    8. 2

      In the second graph, I wonder what caused the downward spike in loading at 11:45?

    9. 2

      I am very sad that it’s not a excel spreadsheet :D

      1. 6

        One for every 1,048,576 messages?

        1. 1

          :D

    10. 1

      Amusing that they left MongoDB to scale better.

    11. 1

      This is interesting and well written. I’m not understanding the reason for the middle-tier request services vs a caching layer. Is batching reads to a DB more performant than serving individual reads from memory like redis?