A while ago, I asked about traffic statistics for lobste.rs for a research project. Finally, the result of that work has now been published as the linked system, which speeds up lobste.rs by ~5x over that provided by MySQL. The source code is available on GitHub, and there is also a MySQL adapter that lets you run existing applications!
Great to see that data get used. One question I had after skimming the paper: how much of this performance improvement do you attribute to Noria’s data-flow computation? One of the frustrations comparing these things is that MySQL has a lot of features that Noria presumably doesn’t, makes it sort of apples-and-oranges. But I really like this model of incremental computation, I see it as a larger popular trend in programming tools.
Hmm, I’m not entirely sure I understand the question, but I think maybe there’s a misunderstanding about how the system works. Most of the performance improvement in Noria is due to the fact that basically all
SELECT
s are now direct cache hits. The data-flow is just the way that we ensure that the results for thoseSELECT
queries remain updated as the underlying data changes (e.g., as new votes are added). This design basically decouples the performance of reads and writes: reads are always fast, and the write throughput is determined by how many things the write touches; we’re nowhere near the limit of how many reads per second we could do for the lobsters queries (see the vote results). As queries get more complicated (and again, we can support all the queries in the lobsters source code), reads do not get any slower, only writes do. It is true that now the writes are slower, but they are also rarer. In the Lobsters workload at the moment, the biggest bottleneck is the write path for updates toread_ribbons
, and that’s what’s preventing us from scaling beyond 5x MySQL. Sharding that write path may be the way to resolve that issue down the line.As for feature parity, I don’t know exactly what you’re referencing? Is there a particular feature you’re worried about Noria not having that you rely on for MySQL? Not sure if it came across in the paper, but you can take unmodified applications that just use mysql client libraries and just plug’n’play them with Noria. At least that’s the idea modulo our SQL parser and query planner still not being quite as mature as MySQL’s.
So Noria maintains materialized views, sort of like flexviews but with automatic refreshing or like pipelinedb but base data is permanent (table) rather than ephemeral (stream). Also reminds me somewhat of ksql. And since it is the database, the application doesn’t need to handle complicated and error-prone cache invalidation (e.g. in the typical MySQL + memcache scenario). Pretty neat!
I had the same question about apples-to-oranges comparison though. For example, transaction support, foreign keys, different index types, triggers, rocksdb vs innodb implications.
Yup, you are totally right that there are features of more traditional databases that we do not yet support. This is still a research prototype, so it’s focused on the research problems first and foremost. We don’t believe any of those additional features to be fundamentally impossible in the Noria paradigm though — for example, we’re designing a scheme for adding transactions, and we believe we can do it without adding much overhead to query execution in the common case!
Some of these other features are also really optimization details. For instance, since Noria knows the application’s queries, it could automatically choose indexes that fit the query load (even though currently it only uses hash indexes). Similarly, RocksDB vs InnoDB shouldn’t matter to the application. We use RocksDB only for storing the base table data, not for storing anything else, so it’s mostly just there for persistence, and rarely affects performance.
As for foreign keys and triggers, those should be pretty easy to add, and mostly just need engineering, not research. In a sense, triggers are really just additional operators in the graph, so they’re almost a non-feature in Noria.
You may also find the discussion on Reddit interesting.
My question isn’t about how the system works, it’s about the breadth of MySQL, which pays a performance cost for lots of features I presume Noria doesn’t have. Multi-master setups, sharding, charset collations, many more data types, support for at least five operating systems, date and time functions, multiple storage formats, a million things. Even if Lobsters doesn’t use them, some of those are going to result in conditionals on the hot path to serving even very simple, performant queries like
select * from users where id = 123
and account for some of the performance difference. I say it’s sort of an apples-and-oranges comparison because Noria and MySQL have such different featuresets - if it were possible to compile a version of MySQL that dropped support for every feature Lobsters doesn’t use, I wonder if that wouldn’t be in the neighborhood of 5x faster. I have so little intuition for it I wouldn’t be surprised at 1.01x or 20x.Edit: ah, and after I hit post I reloaded the page to see @tobym made this point and you already responded to it. I’ll check out the reddit link. :)
In addition to my response to @tobym, let me try to address some of your specific concerns too. First, Noria already supports multi-machine distribution and sharding, and replication is already nearly done. Noria is also more flexible than MySQL in its data types, since it doesn’t have strict column typing. If we did apply the same schema strictness as MySQL, that would improve our performance, since we could specialize data-structures to known types. While it is true that we don’t support as many data types as MySQL, adding news ones is pretty straightforward, and we already support quite a few. Similarly, adding date and time functions should be straightforward – they are just new projection and filter operations. Noria should also run without modifications on Linux, macOS, and Windows.
As for multiple storage formats, Noria is, in a sense, arguing that you as the developer shouldn’t have to think about that. You should tell the database what your queries are, and it should determine how best to persist and cache the data and the query results. Are there particular features associated with the storage systems that you had in mind?
You are right though that MySQL does more than Noria does, and that that adds overheads that Noria does not have in some cases. However, most of Noria’s performance advantage comes from the model — computing on write instead of read — as opposed to implementation. MySQL fundamentally has to compute things on reads, whereas Noria does not, and with most operations being reads, that translates to speed-ups that MySQL cannot recover. It would be great to disable lots of MySQL features, but it is unlikely to change the picture much due to this fundamental design difference.
The one exception to this is transactions: it could be that transactions are just so expensive to provide, that the MySQL was is just way faster than anything you could achieve in the Noria paradigm. We don’t believe this to be the case though, as we already have a design sketch that adds transactions and strongly consistent reads to Noria while introducing nearly no overhead in the common case.
Dude, that’s awesome! Great work! I see you published the source. Were there any patents on this or that not a concern?
Thanks! We certainly haven’t filed for any patents, and I’m not aware of any patents relevant to this. As you can see from the paper’s related work section, this does build on a lot of insights from other fields and systems, but I think it nonetheless carves out its own little nook of data-flow and database research that others haven’t really explored before. As for publishing it as open-source, that has been my goal all along. I don’t personally have any desire for commercializing this, though I also think it’s something that would work very well under a Redis-like or Postgres-like model where it could be a serious open-source production system with enterprise support.