I think the biggest challenge with SQLite being used by multiple servers is that even with the various replication tools (Litestream, LiteFS, etc), you still need to mount whatever disk your primary/write database is stored on onto every server. So if you have 20 servers, that disk needs to be mounted 20 times. This pretty much means you have to use NFS, CephFS, or others. At this stage the complexity and suffering you’re introducing is far greater than using e.g. PostgreSQL, even though running PostgreSQL at scale is rather frustrating. I not so fondly remember all the suffering we had to deal with at GitLab when we were still running NFS for our Git storage. Brrr…
Of course if you only have a single server I think SQLite makes perfect sense, and is probably a far better choice than the alternatives.
What? The whole point of LiteFS is that it’s NOT NFS, it’s its own FUSE FS. LiteFS. LiteFS copies the stuff from the primary to the secondary, it’s like other databases that have replication, you don’t use Ceph to replicate Postgres to Postgres, you use Postgres’s network replication protocol. The SQLite docs heavily discourage using NFS.
You still need to access the primary for writing, which means you still need to mount the disk everywhere. Or does LiteFS do some sort of multi master setup where it replicates to other primaries?
Don’t mount the disks everywhere, add an RPC request endpoint on the primary so that replicas can ask it to do whatever writes they need. Just like with Postgres, you can’t write to a replica. Unlike Postgres you have only one path into the database, through your RPC layer, without a lower-level distributed database access protocol. (Or just yolo it by invoking the sqlite command on the primary over ssh.)
OK, but writes to Postgres replicas are not propagated to the primary or elsewhere unless you have extra machinery, in which case it’s probably easier to set up machinery to write to the primary, instead of getting into a much more complicated distributed consensus problem. (Those who have a large enough system that writes to the primary are a bottleneck will have different tradeoffs.)
I gather that the filtering feature mentioned in your link allows a replica to avoid re-propagating changes that it received from the primary (aka loopback prevention).
But conflict handling isn’t part of core Postgres yet?
As in this configuration option for the third-party pglogical extension.
It sounds to me like anyone deploying this feature would need to do some application-level partitioning to avoid conflicts.
“you now need, at the very least, backups” This is true of every database; how is this a criticism of sqlite? “If you want to run the service across multiple machines you can use LiteFS” It seems like the core use case for sqlite is situations where your data and worker all fit on one machine. I’m unfamiliar with LiteFS besides this blog’s description, but its description argues in favor of sqlite because it describes LiteFS like it’s either a solution to that limitation or a bridging strategy while you migrate to another database.
“Migrations are not great in SQLite.” This sounds like a real concern but “not great” is vague and it suggests you “search online” to understand the scope of the problem. Maybe this is a compelling limitation but this is an empty two sentences.
“Decoupling storage from compute is the default architecture” This is point 1 again.
“Migrating from SQLite is not incredibly hard, but it’s not easy, and it’s still pointless work.” This is point 1 again. It’s also wrong: if any dependency or strategy stops being the correct decision, it’s not at all pointless to rework it. “SQLite, by default, is not very strict with your data, like enforcing types or foreign key constraints” This is a useful criticism; it seems like every time sqlite comes up I see a link to another list of settings and I’d say reviewing the options is mandatory where the defaults on mariadb/postgresql are fine for most apps. And “very different latency profiles” is useful warning about that migration path; every app ends up depending on a db-specific feature but 1 + n queries is a pretty big one to rework.
I’m not well-informed enough to agree or disagree with this post, just frustrated that it doesn’t make the best case it could.
Re: migrations: SQLite doesn’t support some commonplace ALTER TABLE behavior. Instead, you have to:
Start a transaction
Create a new table with your NOT NULL column without default, a constraint, whatever
Copy data from the old table to the new table
Disable foreign key enforcement
Drop the old table
Rename the new table to the old name
Re-enable foreign key enforcement
For large tables this will lock the database for a while as the data copies over.
Additionally, this is an error-prone process. Do this enough times and eventually you might get the steps wrong. Even if you don’t, I had an ORM delete data on a small project when it executed statements in an unintuitive, undocumented, and nondeterministic order, running the DROP before running the statement disabling foreign keys. The ORM’s built-in schema commands weren’t susceptible to this issue, but SQLite required me to use a lower-level approach.
To me this does not sound much less nutso than mysql online schema change machinery or vitess or whatever. Postgres is more better about this, sure. But sqlite is not bad, all the DDL is fully supported in transactions and it has a handy slot for storing your schema version number. If you’re using an ORM that is shredding your data, like, don’t use that software, but it’s not SQLite’s fault.
Yeah I agree this post does not make a compelling case. I have my own reasons I don’t love SQLite (really limited builtin methods and a too-flexible grammar) and I work for a Postgres company but from reading this article alone I can’t see a coherent reason not to use SQLite on the server.
I can’t see a coherent reason not to use SQLite on the server.
Did you miss the part about need to share a disk if you want more than one application server? Granted, not all applications need more than one copy running…
I guess the other option is that each application server has their own sqlite and you use RAFT or something to drive consensus between them… which maybe works if you can have at least 3 of these servers, at the expense of extreme complexity. :)
Granted, not all applications need more than one copy running…
Surely that’s the important point? SQLite is great when everything fits on a single machine, which is really the case for the vast majority of projects [citation needed].
When something is big enough to need multiple machines then it’s time to switch to a separate storage layer running PostgreSQL.
Isn’t this all just a case of picking the right tool for your situation?
This is the point of LiteFS - it replicates writes to other boxes - no NFS needed (and iirc locks on nfs were unreliable so you probably dont want to run the database file on nfs).
You DO need to take care to send writes to the leader but Fly (where the author works), makes this pretty easy. LiteStream (same author) is similar but is more of a continuous back up tool.
Right. A thing you can do is make your architecture more complicated by introducing LiteFS, or LiteStream, but it’s no longer “just sqlite” at that point.
I’ve been running Litestream for a while now and it’s VERY straight-forward. You start the process and point it at a SQLite database, it inexpensively streams WAL blocks to S3 or similar, you can use those for recovery.
It’s so simple, robust and cheap that I see it as a selling point for SQLite as a whole - much less complex/expensive than achieving point-in-time replica backups with PostgreSQL or MySQL.
Yeah but again, these work with only a single instance at a time. If that’s your setup it’s great, but many people need multi-instance setups and mounting a shared SQLite file on NFS in some cloud provider is not a great idea.
But it isn’t an NFS in the cloud, it replicates the WAL which is then made available, yes via a fuse file system but it’s not NFS, it’s not really any different that read replicas for Postgres or MySQL and it seems to be a lot easier to get going. You DO have HA on your DB don’t you?
If you don’t think it works for you, that’s fine, but I think you should understand how it works before making that decision.
Yeah but the point is that a cloud vendor DB gives me read replicas out of the box, I don’t need to set it up manually. Added to the other points, this makes it kinda hard to argue against PostgreSQL for a typical backend app.
The same is as true with sqlite as with other dbs. With Turso/libsql or litefs you can have a read replica materialized on the local nvme ssd of every application server. I’m not arguing SQLite is a better option for every app like everything in engineering the answer is “it depends”. But it feels unfair to dismiss SQLite based architecture because the sqlite3.c source file doesn’t directly automate some need and dismiss the sqlite vendors, and then compare that to Postgres + unlimited ancillary vendor support & automation.
Turso is one vendor. Cloudflare is another but its SQLite offering has a fairly low max database size. Meanwhile, most cloud providers have mature support for MySQL/Postgres. If the cloud provider situation changes for SQLite in the future and it is offered more widely, then the analysis changes, but until then I believe it’s fair. And so do the folks who actually make SQLite.
Sure… if you’re using AWS, Azure or Google… and yeah, even fly.io (though theirs is NOT managed) does too, but really, it’s run a single executable in the background - replay or routing writes to the master is more work of course, but it’s still pretty easy. And given the stance of the US govt lately towards my country, I am likely NOT to use any of those, preferring to host domestically (sorry to bring politics into this - but everything is a trade off for making technical decisions, including that) .
I once set up HA Postgres on Linode (floating ip,NBD and the watchdog) and it worked, but was cargo culted and seemed very fragile in it’s complexity. LiteFS in comparison seems clean and elegant and would be very robust against SPOF and as a benefit move the db reads to the edge, nearer to my hypothetical users.
But like I said, I am not running anything using it at the moment. The one project I AM considering doing would use Postgres for other reasons.
every app ends up depending on a db-specific feature but 1 + n queries is a pretty big one to rework
I think there might be a few cases in which SQLite forces one to use multiple queries where PostgreSQL wouldn’t, but it doesn’t usually, does it?
From that linked page:
So, SQLite is able to do one or two large and complex queries, or it can do many smaller and simpler queries. Both are efficient. An application can use either or both techniques, depending on what works best for the situation at hand.
“the situation at hand” could include “I want to leave open the option of replacing the RDBMS”.
“you now need, at the very least, backups” This is true of every database; how is this a criticism of sqlite?
Well, the article is pretty clear in what it’s trying to say, quote:
The value of SQLite is that it’s infrastructure-less. You don’t have to run anything additional to use it.
Which is saying, many choose to use sqlite because they don’t need to run anything extra - they just need to link their code with sqlite and it works. And trying to use sqlite on server-side negates that benefit. So, if that’s not why you want to use sqlite to begin with, this point doesn’t affect you at all.
imo, nothing implies the author is unaware of them. This is a contribution to the conversation of solving these problems, like those are. They explore similar ideas but the details and some motivations are fairly different.
We’ve implemented pipe syntax in GoogleSQL[19], the SQL dialect and implementation shared across all SQL systems at Google including F1[28], BigQuery[2], Spanner[19] and Procella[18], and the open-source release, ZetaSQL[12]
I remember reading that page quite some time ago (lots of good stuff on that website!), it got me to do some of this style in my framework but yeah, the UI just wasn’t very good so I put it back on the backburner. Let’s look at the ui advice there:
However, this is not a problem when using infinite scrolling. Showing page number to click on is a poor navigation interface anyway—IMHO.
…yeah im not a fan of infinite scroll, that’s basically the worst of all options. But I do try to think about how to make it less bad. I still think some kind of two-piece scrollbar is the way to do it. Inspired by the focus knobs on a microscope, the one is a coarse selector which gives you the “page number”, then the other one is the fine adjustment, scroll inside that page. But, what, exactly, is the page number widget? What happens when you get to the edges of a page?
The one nice thing about infinite scroll (when it isn’t manipulative anyway) is reducing discontinuities - you can just keep hitting the same button over and over again and keep going, without having a case where two adjacent items cannot be seen together.
(actually, my favorite pagination scheme is “none”. Just dump it all on me to the limits of that being possible. Then I can scroll it, jump around it, ctrl+f search it, etc. But when there’s too many things, the scrollbar becomes unusable because the handle is just too small and the slightest one pixel movement can jump hundreds of items… so that’s really the degenerate case I’m hoping to avoid. Lazy loading things as you scroll can be done is an implementation detail if the ui works right.)
Anyway, my first thought for the coarse selector was that it represented a range. So, for example, it goes from the time of the first message to the present and you’d select any time range in there, say, like last two days to now. Then it loads that up, select stuff from posts where date >= 48 hours ago and date < now, and that populates the fine control scroll bar. The problem is there’s some days when nobody posts and other days when everybody posts. So who knows what you get back.
But this keyset cursor does provide another idea: the coarse control still shows first and last entries to set its boundary range, but the size of the handle is determined by the lookup. Not date >= 48 hours ago and date < now, but instead where date >= 48 hours ago LIMIT 500 (or whatever number is reasonable to still be usable in your fine control scroll bar. It might even change by default with the size of the window. Taller window = load more. BTW again, you might not even actually load it, just pull the index so you know how many items are there and construct the scrollbar sizes, then lazy load the details as they scroll into view.)
So now the coarse scrollbar’s handle size changes to indicate where the last item actually was. If the last item is at the present, it fills up to the present. If yesterday was talkative and there were thousands of posts, it shows a little sliver to indicate it covers only a short period of time.
(BTW I think all scrollbars ought to have some extra controls too, like the ability to type in an input instead of just drag and drop to it. So assume that’s part of the widget too.)
What happens when you scroll to the edges of the fine control though? We don’t really want to make the user go over to the coarse bar and have to drag it. No, ideally, their attempt to scroll up again past the top of this will automatically pull the “previous” keyset cursor, adjusting the coarse control for you in that single action…. and then it can keep some of the stuff on screen at the time as part of this page, or maybe the previous cursor has some deliberate overlap or something, but I guess that isn’t strictly necessary, since you should be able to just hit scroll down to go back to where you were anyway. The fine scroll bar would then go from the top back to the bottom, indicating you’ve changed pages. Not drag friendly though - when mouse dragging, you should NEVER change pages. You must stop dragging and hit the up arrow to trigger the page change. This avoids the scrollbar jumping position in the new page, then your next drag disorients you. (Unless you were to warp the mouse along with the scrollbar but…. probably don’t do that, that sounds wrong to me. And browsers can’t do that anyway.) So you get to the top and the “more” thing appears right there for you to tap/click, and the traditional up arrow button on the scrollbar, up key, mouse wheel, etc., can do it too. As long as it is an explicit, clearly intended action. Ditto on down for the next page.
The coarse control btw might not be a bar per se, maybe it can expand into a calendar or something, but I do think a bar of some sort is a good indicator. Depending on what column you sort on, it might be a timeline, or an alphabet, or a number line, whatever. It might have its own zoom function or multi level control if it spans too much area to be easily used on screen.
…I think this is something workable. The coarse bar has a thick line indicating that is the manipulateable thing that is your keyset cursor, then a shade indicating the direction it is loading and how much it managed to cover inside the limit… I gotta get implementing to try it for real now lol
As the author of the post - I completely agree! My recommendations are mostly limited by exposure - I have mostly worked with RDMS’ of various flavours for years and running them at various levels of scale, along with Kafka and boring KV stores.
I feel like streaming is enough of its own thing to warrant its own post - I find it hard to talk about things like Kafka without talking about the wider stream processing ecosystem (Benthos/Bento, Flink, Materalize etc).
It’s interesting that you group them like this - I wouldn’t group SQLite and PostgreSQL together due to different operating models, but that’s probably more due to personal biases around operational concerns.
So if I understand this right, the intended usage is that you model the expected behavior, you instrument your real code to produce the traces and then you deploy it or fuzz it to get a lot of traces you can check for linearizability violations?
In practice what’s the easiest way to capture the traces? Just log some JSON? Derive it from otel spans?
You do need to ensure that whatever system records the history of events is itself Linearizable (hand-waving a bit here; you can actually get away with slightly weaker properties). That could be a single-node program which records invocations and completions in a Linearizable in-memory structure. Or you could journal them (synchronously!) to a Linearizable database before invocation and after completion, in production.
What you can’t do is, say, stamp log lines or OTEL events with local node clocks–those aren’t necessarily synchronized, and the checker might tell you your program violated Linearizability when it didn’t. You also can’t send them to a database asynchronously, because the message recording the invocation might be delayed and arrive at the database after the operation was actually performed. You also can’t use a Serializable or Causally consistent database, because it might fail to record your events in real-time order. A Sequential log is OK, but the scope of the Sequential object has to be be the entire log; otherwise you could (you guessed it!) get reorderings.
Another thing to be aware of is that if you log to the same datastore that’s under test, you could influence the results. For example, if your data store has an issue with stale reads in read-only transactions, and you add an “insert log line” between every read transaction down the same connection, you could see the bug become invisible.
I mean… no? Imagine you detected a Linearizability violation in a history recorded using timestamps on, say, Linux machines. From this you could conclude that either a.) the system being measured violated Linearizability, or b.) the history-collecting system had, at some time, poor synchronization; or the VM paused; or the OS paused; or the application paused; and so on. It doesn’t allow you to make a provable claim of correctness or incorrectness.
If you’re talking about something like EC2 timesync, where clock errors are ~50us, and your storage system is remote and best-case a couple hundred microseconds, and you’re careful about which end of the clock bound to pick, you can still get good results.
NTP in general, probably not, unless you’re very careful about the way you set it up and have great time infrastructure.
would it work to use lamport clocks? or stamp messages with the ones they’ve seen and then topologically sort (functionally the same thing but different)?
Sort of. You can use messages between nodes and a strategy like Lamport clocks to establish a partial order which is consistent (because messages always flow forwards in time) with the real-time order. However, you’d fail to detect violations of Linearizability over timescales shorter than the message propagation interval.
wouldn’t any really linearisable logging scheme induce synchronisation between nodes, potentially also causing you to fail to detect some bugs? are there bugs that are still systematically more likely to be caught or can only be caught that way? (it seems not very nice because it sacrifices scalability)
IIUC, the difference between sequential consistency and linearisability is effectively that linearisability is consistent with communication ‘outside the system’ where seqcst doesn’t have to be. and lamport clocks or causal ordering should not miss sequential consistency violations. but if there are any classes of ‘external communication’ that we’re interested in in some context, couldn’t we try to bring those into the testing/logging system too? (i think this is similar to using a linearisable logging system, except that you don’t record all the ordering relationships—so you don’t get to full linearisability—only some of them, but what you do record you get in a distributed way)
wouldn’t any really linearisable logging scheme induce synchronisation between nodes, potentially also causing you to fail to detect some bugs?
Yep! It’s actually impossible to measure precisely. Jepsen does this by putting all clients on a single node, allowing fast connections between clients and servers, and (optionally) adding latency between servers; this lets it see errors “faster than light”, so to speak.
(it seems not very nice because it sacrifices scalability)
In general, I wouldn’t worry about scalability. Linearizability checking is NP-hard and relies on total histories–both of those are already bad for scalability. Even if that weren’t the case, you’ll get the most bang for your buck testing in a controlled environment where you can induce failures, rather than in production, where failures are (I hope!?) comparatively infrequent.
Lamport clocks or causal ordering should not miss sequential consistency violations.
I’m not sure what this means exactly, but I should note that Causal is, in general, weaker than Sequential.
But if there are any classes of ‘external communication’ that we’re interested in in some context, couldn’t we try to bring those into the testing/logging system too?
Sure, you could do that, but then the property you’re measuring is something like Causal or Sequential, rather than Linearizable.
Yes you’ve got it! If the project is in Go you could just write the fuzzer and test harness in Go. Otherwise yes you’d want to write some intermediate format from your system and then write a Go program to convert it for Porcupine.
We added it in 2013. It lasted about three months (code). (A restricted version was proposed but didn’t get much interest.) In short, it was used as “I disagree” or “I dislike”, it didn’t improve anything. The “Hide” feature has worked out well. I’d be open to adding it back if there’s something different about it or the circumstances that seems like it’ll work out better, but so far this is pretty much exactly what we tried before.
I’ve been kicking around a similar idea with similar goals: remove the -1 to score from flagging to disincentivize people using flags to knock down things they’re not interested in or disagree with. As jcs predicted in that second thread, it’s still a problem and I occasionally DM people who make a habit of it to ask them to knock it off. Maybe twice it’s been a user following another around the site and flagging all their stuff, but it’s worth noting flags are infrequent and visible to mods, so that tends to pop out to us. As a second part of the change, make hiding a story give the score a -1 if the user also does not comment or vote on of its any comments. I’m trying to thread the needle so that there isn’t a value to misusing flags. I think this would reward the right things and would appreciate criticism.
Keeping the -1 makes a lot of sense, but I do like the requirement to comment on the story if you’re also flagging it.
Now of course, the flipside of this is that–as somebody who generally comments and explains my flags–the community needs to quit bitching every time somebody actually attempts to flag and explain content that’s bad or off-topic. The amount of “you’re not my real mod” I’ve seen occur is incredibly tiresome.
People use Hiding so much that this has marked database posts I normally submit as 0/going negative. I think they’re good posts but now my submission history looks like I’m submitting garbage.
This completely disincentivizes me from posting anything since of course people with diverse interests here will want to hide things they aren’t interested in. I use Hiding to remove stuff I just don’t personally want to see. I use flagging when something is low effort. I don’t want my own Hiding of something to penalize the post either personally.
I truly appreciate your efforts with the community but I am not going to submit posts anymore if/while Hiding a post counts as a -1.
I truly appreciate your efforts with the community
First of all, I have to say that I appreciate the efforts as well! I actually thought it was a great idea, but seeing this change in action made me realize that it was wrong (or at least that it needs more tuning).
A lot of the value I find here lies in the wide diversity of topics. Frequently, the stories I learn the most from do not score very well (around 10 or less). Of course, many of the posts I submit also do not score very well.
I think it’s OK and valuable to have such posts around. Some people get value from them and they disappear quickly from the first page anyway.
With this change, as @eatonphil says, many of these stories will not be submitted anymore. And Lobsters is going to look a lot more like a mono-culture.
I believe that part of the problem is that the upvotes are too random to represent anything significant (because we often forget or whatever), while hiding a story is never random.
I actually submitted a bug on GH for this because I was seeing so many on-topic submissions having zero or negative scores. Turns out it was Working As Designed.
I’m not a fan of this change either, at least not right now. It might shake out to be ok once people learn that hitting “Hide” is now a downvote. But I don’t think it works semantically. I don’t filter anything, because I like seeing what’s on the site. But I do get a bit tired when Rust nerds (for example) get into the weeds about something they really enjoy, and I use the “hide” feature to remove that particular source of comments from the /comments feed. I don’t have anything against Rust, it’s often on-topic and good quality, it’s just not something I’m always interested in.
Specifically, I feel that the action of a small number of people (the ones that are not interested in certain topics and are not filtering) have an outsized effect on the margins for new submissions. A few grumpy users who might expect a flood of comments they’re not interested in can keep an on-topic and otherwise high-quality submission from the front page.
@pushcx has the number of “hide” submissions for new submissions increased since the change? It’s not a factor I’ve made any note of before, and it’s also nothing that’s reflected in the JSON output.
I can’t answer, the database didn’t timestamp the record that a user had hidden a story. I’ve added it but there’s no historical data to compare against.
So I saw that story hiding was pretty infrequent, but I didn’t realize that people are often hiding stories within minutes of them appearing, before anyone has upvoted. In talking with these users, it’s most commonly because it’s a way they can mark the last time they read /newest. (I’ll add a line for this, similar to hckrnews.com.) But it wasn’t at all the only answer and I’m convinced that hide just has too many other uses and other reasons for this change to work.
It’s also been uncommon for stories to hit zero or negative on Lobsters for a long time. I also shared the unexpectedly strong emotional reaction to seeing /newest with several grayed-out stories, there’s ah “oh no” in the pit of my stomach. I thought of it as making scores a little more dynamic but underestimated the effect of the difference. I’m sorry to anyone who shared this feeling, especially submitters.
I’m backing out the change to try a sort of middle position, so that there’s a -1 to score if a user flags and hides the story. I’m still hoping to reduce the impulse to use arbitrary flags as a ‘disgree and punish’ button. (Me individually DMing users who do this is unsustainable, especially as it’s now rare for someone to do it often, so the situation is ambiguous, so the conversation is a delicate and disagreeable one.) To look at it from the perspective before hiding affected scores, flags only reduce a story’s score if the flagger is also so done with the story that they don’t want to see it again. The code is here and I’m running update_score_and_recalculate! on all stories submitted or hidden since Monday to revert the effects.
I appreciate folks giving it a couple days, and thank you for all the feedback and ideas for future improvements. I’m sorry this change was so unpleasant and hopefully future mistakes continue to be small and survivable.
Appreciate the consideration, pushcx! I was very worried I’d have to find some new community to discuss all the various systems programming topics we do here. I’m relieved to need not. :)
In retrospect it’s obvious. But I too did have a tunnel vision when you talked about it, and only considered the “hiding by disagreement” side of this. Even though I did hide things previously as I don’t like to filter full tags.
Sorry, I tried to cram a lot in that sentence. I mean: hiding a story gives it a -1, but only if the hiding user also hasn’t commented or voted in the thread. Sometimes people get angry in an argument and reach for any site feature that might punish the person they disagree with. So hiding would only reduce the story’s score if the hider really just didn’t want to see the story.
Given that we can only hide stories but it might become a moderator signal, is there a world where the similar logic applies to comments? A harsher alternative to simply hiding the conversation, I guess.
I hate these because at the end of the day it’s always a case of “it depends”. At the small or medium scale, most of these things work more than well enough, but once you end up using postgres as a key-value store with high update rate, then I’ve seen it crumble.
Need to delete a user’s millions of rows inside a table with billions of rows? Good luck.
Need to update a very busy table (Add/drop constraints or columns)? Just forget it.
The query planner made the decision to choose a plan that is better for 99% of the case, but have catastrophic performance in 1%? Be ready to be told it’s your fault and te Query Planner is smarter than you.
Try being clever and use a conditional index where the indexed rows should be empty most of the time, but in case of failure it might start pilling up to millions of rows, and then the planner still thinks it can do a full index scan because its stats say it really has a single row in it and then all queries fail with timeout? Don’t be clever.
Anyway, Postgres is a wonderful piece of technology, but I stopped counting how much time I lost so far trying to use it for everything.
I’ll stop here, and I’m sure someone could come in and tell me how to solve these problems, but I believe from my research that none of these have an obvious solution and they always happens at the worst time.
PS: I’m still looking for something like Postgres where I could create a “1 table per user” database with millions of users and dropping these tables would be instant.
Ready for an insane take? SQLite works really well for this, where you create 1 entire database per user. You can cram them anywhere, S3, on some disk, on whatever storage, and make it performant enough.
I work with an application which is SQLite all the way down and it’s remarkably performant. It has several layers of ‘metadata databases’ that go 2-3 layers deep.
You could absolutely do ‘postgres database as metadata store for sqlite files,’ for a ‘have your cake and eat it too.’
You could take this a step further and do something like just store XML/JSON in an S3 bucket with an object-per-user if you want something completely denormalized. Could also use Parquet, Protobuf-encoded messages, or something like that if you want to be hip. ;)
Yeah this is basically how we solved this with scaling the AT Proto PDS to hundreds of thousands of users per instance and millions of users across instances.
One SQLite per user, only ever one writer to the DB, keep recently used DBs open in a LRU cache, add one or two metadata SQLites for the whole instance for things like global event sequencing, user registry, etc.
It works great and feels like a “lite” form of ScyllaDB and/or other Cassandra-style NoSQL DBs.
How are the compute and data planes linked? How do the instances get at the SQLite databases? How are they persisted? Are instances stateless or sharded with a single reader/writer? How are hot keys (users) handled?
Each PDS instance is just one big baremetal box from OVH with redundant local SSDs.
The instances back up their data directory regularly (including all the SQLite DBs) via RClone and the event sequencer shared SQLite is litestreamed to S3 for backup.
We can recreate state missed between backups via replaying the event stream.
Hot keys aren’t much of a problem, write throughput on each PDS is somewhere like 2-10/sec across all users on a PDS. Each SQLite can handle up to 1K writes/sec with a single writer, so the contention there ends up shifting to the shared event stream DB.
We rate-limit writes per user to reasonable per-second, per-hour, and per-day limits to prevent hot shards as well (something like 15k writes/day per user is a reasonable maximum for our app atm).
Each SQLite is single-writer but could be multi-reader depending on read query throughput.
The key insight you have there though is “1 database per user” and I think you could apply that same principle if you wanted to without the cost of leaving Postgres by giving each user their own schema, no?
It is probably easier to scale up more SQLite instances than postgres. Simply creating more postgres databases on the same instance is probably detrimental because of shared cache, IO etc. Meanwhile for SQLite you can just drop your single sqlite instance and literally delete the single(!) file.
I am a fan of both DBs, but I don’t understand your statement.
Creating and deleting DBs (or schemas) in both DBs is easy. If DBs reside on the same instance they will share resources, if they don’t they won’t. So it’s again the same. You can delete “a signle(!) database” just as easily as you can delete a file. You can either use DROP DATABASE or do dropdb on the command line.
SQLite has exactly one writer on the DB, SQLite has exactly one file to care about (even with WAL this isn’t that much). So a DROP DATABASE in PG has to care about multiple locks, connections, caches, files, triggers and rollbacks etc. There is even the possibility to have cross DB references.
I do not claim to know exactly why this will be less performant, but I do imagine that a real benchmark will give you some surprising results about something that boils down to “just delete everything”.
I imagine that would work, but the trouble you run into is that the tooling often doesn’t support that kind of additional layer of abstraction. Table names, for instance, can’t be query parameters so you’re stuck doing your own string munging, escaping, and all that Tom-foolery.
You don’t need to parameterize anything, you set the “current schema” to the account’s when you connect to the database and shared schema objects are fully addressed.
Or you set the user’s schema at the front of the search path and then it should be transparent (but you don’t get a clear separation between exclusive and shared resources so it’s probably less safe).
Postgres can parameterize table names? That seems weird, normally parameters are for data, not query parts, I thought that distinction is the whole point.
Yea that would probably work OK, I know the way pgsql handles databases on physical filesystems is…. interesting. It’d make sharding real tough, however you could potentially work around this using pg_fdw and some kind of auto-sharding strategy.
Oh gosh my knowledge about this topic is extremely dated, like ~9.3 era. I had to deal with a lot of splitting of where pgsql did writes/reads due to some really large data sets (for the time, ~20TB per host) requiring esoteric disk setups (multiple local filesystems and LUNs across multiple hosts). There was some kind of really bizarre logic you need to think about for what postgres calls a ‘database’ which is more like an entire namespaced installation, I strongly recall there being something strange with how it handles WAL writes for separate databases. There was a situation where we ended up with millions of WAL files and it made both postgres and linux very upset. You can tune your way around this but that comes with trade-offs.
Creating lots of tables/schemas/databases in postgres just leads to some interesting things, like having a lot of files (probably more than you expect). In most situations (lets say <1000 tables) you won’t ever encouter any of these and can just have a real good time.
The issue with SQLite is that I feel I need to end up re-implementing a database server which is a whole product by itself. I know about Litestream or LiteFS, but it still comes with the requirement of managing some kind of writer service, another service that will load and cache user database on clients. Also the “small” other things that matters a lot: auth, monitoring, etc.
But maybe at the end of the day, outgrowing Postgres means building your own in-house very specialized database.
Would you still use PG up until you hit those issues, or would you start with something else in situations where you predict future workloads would hit PG limits?
Assuming commercial success is what helps you find PG’s limits, it should fund solutions around them. So build for now, knowing that commercial success may impose new constraint on your system.
Yeah probably. I think one big advantage of Postgres with a new project is that it’s so easy to run a local Postgres instance with almost the same configuration you will have in production. Testing is as simple as running one service versus ending up with complicated scripts to spawn 5 services that are “close enough” to production. But I think doing it all over again I would be more aggressive to getting out of Postgres before we hit the problems. The main challenge is to know what to replace Postgres with.
I’ve generally approached this problem by building on PG but designing my data schema to easily migrate to NoSQL in the future if/when scale justifies a move to ScyllaDB or something similar.
You can get most of the benefits of a NoSQL DB by designing a Postgres schema around the principles of NoSQL (denormalized tables, no joins, query-specific tables with only a primary key index and maybe one secondary index).
Then when it comes time to migrate to NoSQL you can basically copy/paste your schema and queries without having to make significant changes to your data model.
Not really. You already have an application server involved. Implement your APIs, etc. there, not in the database. Also, I have far more experience dealing with application servers than databases, so the mental overhead for me is lower than doing everything in the DB.
I may be off-based, but I don’t think the article recommends replacing the application server with the database, if such a thing is possible. Rather, using pg for cache storage, message queues, full text search, cronjobs, etc… all things you’ll probably need fairly early, and for which there is a bunch of tool chain recommended for scale in the Internet, including but not limited to redis, kafka, mongodb, elasticsearch, solr, and so on, all of which have their deployment constraints, differentiated pricing plans, require specific admin specialty, all of which you do not have when going 0 to 1 looking for people who actually want to use and pay for your product.
I’ve found Redis Streams to be a reasonable stand-in for Kafka in architectures where the amount of content in your stream WAL doesn’t grow unbounded (i.e. you can trim it regularly and maintain enough of a playback window).
It’s got very similar semantics to Kafka and allows you to cut one of the (more operationally costly) pieces from that stack if you’ve got the RAM for it.
One thing I’m not clear on with this article talking about SQLite, is SQLite’s rollback journal MVCC (the article would seem to imply no), a WAL (the fact that you can choose between it and WAL mode also seems to imply no to me) or something else?
The rollback journal is something else. You can pick between the rollback journal and the WAL mode. Neither version does MVCC. The unmerged effort to put SQLite on an LSM Tree (https://sqlite.org/src4/doc/trunk/www/lsmusr.wiki) does use MVCC. But it has been basically abandoned.
Isn’t WAL mode necessarily mvcc if in a limited capacity? Every writer creates a new version of data in the WAL without touching the older versions, both in the database and the WAL. That is why readers can keep on reading.
The rollback journal is a legacy feature that no one should be using anymore. The fact that it’s enabled by default is an unfortunate backwards-compatibility measure.
LMDB has no WAL. Instead it uses MVCC to update the b-tree in place. (Its b-tree is a persistent data structure.) It’s said to outperform SQLite, though it’s a bit hard to compare since LMDB is a lower-level db.
A WAL does accelerate writes, although eventually it has to be merged into the “real” db, which incurs extra reads+writes. WALs seem to slow down reads to some degree, since every page-fetch has to check the WAL first to see if a newer version exists.
since every page-fetch has to check the WAL first to see if a newer version exists.
That’s a fair implication. In my pseudo-code I assumed you could write to the btree in-memory and separately schedule it to be persisted to disk. That only works if your data fits in memory. Otherwise it would be as you say, you need to check the WAL (or some other in-memory summary of unflushed data) on every read.
Is it a typo that you have the 1st iteration listed twice in the second code snippet?
float Q_rsqrt(float number) {
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = *(long*)&y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what the fuck?
y = *(float*)&i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;
}
“Data-Oriented Design” looks good: some of the reviews seem to suggest it’s more aimed at people with a couple of years of experience though, rather than beginners. What do you think?
See also: Enough With All The Raft and the Data Replication Design Spectrum.
I think the biggest challenge with SQLite being used by multiple servers is that even with the various replication tools (Litestream, LiteFS, etc), you still need to mount whatever disk your primary/write database is stored on onto every server. So if you have 20 servers, that disk needs to be mounted 20 times. This pretty much means you have to use NFS, CephFS, or others. At this stage the complexity and suffering you’re introducing is far greater than using e.g. PostgreSQL, even though running PostgreSQL at scale is rather frustrating. I not so fondly remember all the suffering we had to deal with at GitLab when we were still running NFS for our Git storage. Brrr…
Of course if you only have a single server I think SQLite makes perfect sense, and is probably a far better choice than the alternatives.
What? The whole point of LiteFS is that it’s NOT NFS, it’s its own FUSE FS. LiteFS. LiteFS copies the stuff from the primary to the secondary, it’s like other databases that have replication, you don’t use Ceph to replicate Postgres to Postgres, you use Postgres’s network replication protocol. The SQLite docs heavily discourage using NFS.
You still need to access the primary for writing, which means you still need to mount the disk everywhere. Or does LiteFS do some sort of multi master setup where it replicates to other primaries?
Don’t mount the disks everywhere, add an RPC request endpoint on the primary so that replicas can ask it to do whatever writes they need. Just like with Postgres, you can’t write to a replica. Unlike Postgres you have only one path into the database, through your RPC layer, without a lower-level distributed database access protocol. (Or just yolo it by invoking the sqlite command on the primary over ssh.)
You can write to replicas in Postgres, just not with physical replication.
OK, but writes to Postgres replicas are not propagated to the primary or elsewhere unless you have extra machinery, in which case it’s probably easier to set up machinery to write to the primary, instead of getting into a much more complicated distributed consensus problem. (Those who have a large enough system that writes to the primary are a bottleneck will have different tradeoffs.)
Postgres supports bidirectional logical replication now fwiw.
https://www.highgo.ca/2023/12/18/new-in-postgresql-16-bi-directional-logical-replication/#:~:text=In%20short%2C%20Bi%2DDirectional%20Logical,be%20streamed%20to%20one%20another.
Oh, cool, thanks for pointing that out!
I gather that the filtering feature mentioned in your link allows a replica to avoid re-propagating changes that it received from the primary (aka loopback prevention).
But conflict handling isn’t part of core Postgres yet? As in this configuration option for the third-party pglogical extension.
It sounds to me like anyone deploying this feature would need to do some application-level partitioning to avoid conflicts.
That seems right to me.
(I am on the team that built/maintains pglogical but I’m still new.)
This post caught my eye because we’re considering sqlite for lobsters, but this is not really a compelling set of points.
I’m not well-informed enough to agree or disagree with this post, just frustrated that it doesn’t make the best case it could.
Re: migrations: SQLite doesn’t support some commonplace
ALTER TABLEbehavior. Instead, you have to:NOT NULLcolumn without default, a constraint, whateverFor large tables this will lock the database for a while as the data copies over.
Additionally, this is an error-prone process. Do this enough times and eventually you might get the steps wrong. Even if you don’t, I had an ORM delete data on a small project when it executed statements in an unintuitive, undocumented, and nondeterministic order, running the
DROPbefore running the statement disabling foreign keys. The ORM’s built-in schema commands weren’t susceptible to this issue, but SQLite required me to use a lower-level approach.I’m very happy with the solution I built for this problem, which I’ve been using for several years: https://sqlite-utils.datasette.io/en/stable/cli.html#transforming-tables
Just so happens that there’s a “migrations” focused post also on Lobsters today.
To me this does not sound much less nutso than mysql online schema change machinery or vitess or whatever. Postgres is more better about this, sure. But sqlite is not bad, all the DDL is fully supported in transactions and it has a handy slot for storing your schema version number. If you’re using an ORM that is shredding your data, like, don’t use that software, but it’s not SQLite’s fault.
Yeah I agree this post does not make a compelling case. I have my own reasons I don’t love SQLite (really limited builtin methods and a too-flexible grammar) and I work for a Postgres company but from reading this article alone I can’t see a coherent reason not to use SQLite on the server.
Did you miss the part about need to share a disk if you want more than one application server? Granted, not all applications need more than one copy running…
I guess the other option is that each application server has their own sqlite and you use RAFT or something to drive consensus between them… which maybe works if you can have at least 3 of these servers, at the expense of extreme complexity. :)
Surely that’s the important point? SQLite is great when everything fits on a single machine, which is really the case for the vast majority of projects [citation needed].
When something is big enough to need multiple machines then it’s time to switch to a separate storage layer running PostgreSQL.
Isn’t this all just a case of picking the right tool for your situation?
This is the point of LiteFS - it replicates writes to other boxes - no NFS needed (and iirc locks on nfs were unreliable so you probably dont want to run the database file on nfs).
You DO need to take care to send writes to the leader but Fly (where the author works), makes this pretty easy. LiteStream (same author) is similar but is more of a continuous back up tool.
Right. A thing you can do is make your architecture more complicated by introducing LiteFS, or LiteStream, but it’s no longer “just sqlite” at that point.
I’ve been running Litestream for a while now and it’s VERY straight-forward. You start the process and point it at a SQLite database, it inexpensively streams WAL blocks to S3 or similar, you can use those for recovery.
It’s so simple, robust and cheap that I see it as a selling point for SQLite as a whole - much less complex/expensive than achieving point-in-time replica backups with PostgreSQL or MySQL.
Yeah but again, these work with only a single instance at a time. If that’s your setup it’s great, but many people need multi-instance setups and mounting a shared SQLite file on NFS in some cloud provider is not a great idea.
But it isn’t an NFS in the cloud, it replicates the WAL which is then made available, yes via a fuse file system but it’s not NFS, it’s not really any different that read replicas for Postgres or MySQL and it seems to be a lot easier to get going. You DO have HA on your DB don’t you?
If you don’t think it works for you, that’s fine, but I think you should understand how it works before making that decision.
Yeah but the point is that a cloud vendor DB gives me read replicas out of the box, I don’t need to set it up manually. Added to the other points, this makes it kinda hard to argue against PostgreSQL for a typical backend app.
The same is as true with sqlite as with other dbs. With Turso/libsql or litefs you can have a read replica materialized on the local nvme ssd of every application server. I’m not arguing SQLite is a better option for every app like everything in engineering the answer is “it depends”. But it feels unfair to dismiss SQLite based architecture because the sqlite3.c source file doesn’t directly automate some need and dismiss the sqlite vendors, and then compare that to Postgres + unlimited ancillary vendor support & automation.
Turso is one vendor. Cloudflare is another but its SQLite offering has a fairly low max database size. Meanwhile, most cloud providers have mature support for MySQL/Postgres. If the cloud provider situation changes for SQLite in the future and it is offered more widely, then the analysis changes, but until then I believe it’s fair. And so do the folks who actually make SQLite.
Sure… if you’re using AWS, Azure or Google… and yeah, even fly.io (though theirs is NOT managed) does too, but really, it’s run a single executable in the background - replay or routing writes to the master is more work of course, but it’s still pretty easy. And given the stance of the US govt lately towards my country, I am likely NOT to use any of those, preferring to host domestically (sorry to bring politics into this - but everything is a trade off for making technical decisions, including that) .
I once set up HA Postgres on Linode (floating ip,NBD and the watchdog) and it worked, but was cargo culted and seemed very fragile in it’s complexity. LiteFS in comparison seems clean and elegant and would be very robust against SPOF and as a benefit move the db reads to the edge, nearer to my hypothetical users.
But like I said, I am not running anything using it at the moment. The one project I AM considering doing would use Postgres for other reasons.
Right, if you want multi-instance setups you need LiteFS, not Litestream.
To be clear. I’ve not used it but from all appearances, it looks like it adds very little overhead or complexity.
I think there might be a few cases in which SQLite forces one to use multiple queries where PostgreSQL wouldn’t, but it doesn’t usually, does it?
From that linked page:
“the situation at hand” could include “I want to leave open the option of replacing the RDBMS”.
Well, the article is pretty clear in what it’s trying to say, quote:
Which is saying, many choose to use sqlite because they don’t need to run anything extra - they just need to link their code with sqlite and it works. And trying to use sqlite on server-side negates that benefit. So, if that’s not why you want to use sqlite to begin with, this point doesn’t affect you at all.
There is no date on this web page, but it doesn’t cite any prior art or current art, like
https://research.google/pubs/sql-has-problems-we-can-fix-them-pipe-syntax-in-sql/
https://www.scattered-thoughts.net/writing/against-sql/ - talks about compositionality
https://www.edgedb.com/blog/we-can-do-better-than-sql (2019)
Lots of people are working on this, and from skimming, it doesn’t seem like the author is aware of them
I think it was published yesterday.
Dont forget The Third Manifesto https://wiki.c2.com/?TheThirdManifesto
The RSS feed has an article publish date. But yeah, the page itself should really be dated.
imo, nothing implies the author is unaware of them. This is a contribution to the conversation of solving these problems, like those are. They explore similar ideas but the details and some motivations are fairly different.
I’m curious what your impressions/summary of such efforts are (hoping you’re conversant in current art.)
Tangentially, are there any attempts to push SQL closer to relational calculus/datalog?
I don’t have a real informed opinion, but I found it interesting that the pipe syntax was tried in sqlite:
https://news.ycombinator.com/item?id=41347188
That is generally how things move forward – I think Google actually deployed this, and then another implementation copies it
(or doesn’t in this case – I think he said he would wait until its in the SQL standard)
But having something that’s actually implementable is a very good sign.
yup, the paper says
I did not know they all shared the same dialect!
No one shared my favorite resource on the topic: https://use-the-index-luke.com/no-offset.
I remember reading that page quite some time ago (lots of good stuff on that website!), it got me to do some of this style in my framework but yeah, the UI just wasn’t very good so I put it back on the backburner. Let’s look at the ui advice there:
…yeah im not a fan of infinite scroll, that’s basically the worst of all options. But I do try to think about how to make it less bad. I still think some kind of two-piece scrollbar is the way to do it. Inspired by the focus knobs on a microscope, the one is a coarse selector which gives you the “page number”, then the other one is the fine adjustment, scroll inside that page. But, what, exactly, is the page number widget? What happens when you get to the edges of a page?
The one nice thing about infinite scroll (when it isn’t manipulative anyway) is reducing discontinuities - you can just keep hitting the same button over and over again and keep going, without having a case where two adjacent items cannot be seen together.
(actually, my favorite pagination scheme is “none”. Just dump it all on me to the limits of that being possible. Then I can scroll it, jump around it, ctrl+f search it, etc. But when there’s too many things, the scrollbar becomes unusable because the handle is just too small and the slightest one pixel movement can jump hundreds of items… so that’s really the degenerate case I’m hoping to avoid. Lazy loading things as you scroll can be done is an implementation detail if the ui works right.)
Anyway, my first thought for the coarse selector was that it represented a range. So, for example, it goes from the time of the first message to the present and you’d select any time range in there, say, like last two days to now. Then it loads that up,
select stuff from posts where date >= 48 hours ago and date < now, and that populates the fine control scroll bar. The problem is there’s some days when nobody posts and other days when everybody posts. So who knows what you get back.But this keyset cursor does provide another idea: the coarse control still shows first and last entries to set its boundary range, but the size of the handle is determined by the lookup. Not
date >= 48 hours ago and date < now, but insteadwhere date >= 48 hours ago LIMIT 500(or whatever number is reasonable to still be usable in your fine control scroll bar. It might even change by default with the size of the window. Taller window = load more. BTW again, you might not even actually load it, just pull the index so you know how many items are there and construct the scrollbar sizes, then lazy load the details as they scroll into view.)So now the coarse scrollbar’s handle size changes to indicate where the last item actually was. If the last item is at the present, it fills up to the present. If yesterday was talkative and there were thousands of posts, it shows a little sliver to indicate it covers only a short period of time.
(BTW I think all scrollbars ought to have some extra controls too, like the ability to type in an input instead of just drag and drop to it. So assume that’s part of the widget too.)
What happens when you scroll to the edges of the fine control though? We don’t really want to make the user go over to the coarse bar and have to drag it. No, ideally, their attempt to scroll up again past the top of this will automatically pull the “previous” keyset cursor, adjusting the coarse control for you in that single action…. and then it can keep some of the stuff on screen at the time as part of this page, or maybe the previous cursor has some deliberate overlap or something, but I guess that isn’t strictly necessary, since you should be able to just hit scroll down to go back to where you were anyway. The fine scroll bar would then go from the top back to the bottom, indicating you’ve changed pages. Not drag friendly though - when mouse dragging, you should NEVER change pages. You must stop dragging and hit the up arrow to trigger the page change. This avoids the scrollbar jumping position in the new page, then your next drag disorients you. (Unless you were to warp the mouse along with the scrollbar but…. probably don’t do that, that sounds wrong to me. And browsers can’t do that anyway.) So you get to the top and the “more” thing appears right there for you to tap/click, and the traditional up arrow button on the scrollbar, up key, mouse wheel, etc., can do it too. As long as it is an explicit, clearly intended action. Ditto on down for the next page.
The coarse control btw might not be a bar per se, maybe it can expand into a calendar or something, but I do think a bar of some sort is a good indicator. Depending on what column you sort on, it might be a timeline, or an alphabet, or a number line, whatever. It might have its own zoom function or multi level control if it spans too much area to be easily used on screen.
…I think this is something workable. The coarse bar has a thick line indicating that is the manipulateable thing that is your keyset cursor, then a shade indicating the direction it is loading and how much it managed to cover inside the limit… I gotta get implementing to try it for real now lol
While nearly every database is unique and worth looking at, if I were doing this I’d probably pick one from each of a few major categories.
As the author of the post - I completely agree! My recommendations are mostly limited by exposure - I have mostly worked with RDMS’ of various flavours for years and running them at various levels of scale, along with Kafka and boring KV stores.
I feel like streaming is enough of its own thing to warrant its own post - I find it hard to talk about things like Kafka without talking about the wider stream processing ecosystem (Benthos/Bento, Flink, Materalize etc).
It’s interesting that you group them like this - I wouldn’t group SQLite and PostgreSQL together due to different operating models, but that’s probably more due to personal biases around operational concerns.
While we’re here, are you aware of an embedded document DB? like a sqlite for JSON, with a decent query syntax?
e.g. I would love something small and self-contained that can be queried with an extended SQL syntax: https://learn.microsoft.com/en-us/azure/cosmos-db/gen-ai/full-text-search#fulltextcontains
The two ones I know are fairly recent: PoloDB and SurrealDB. PoloDB is more MongoDB-like.
Polo looks very much like what I had in mind! Thank you
Lucene? Tantivy? Neither has a SQL frontend granted but you could do this.
Something like Tinybase?
https://tinybase.org/#build-complex-queries-with-tinyql
If you are using Go, you can take a look to boltdb/bbolt or badger
So if I understand this right, the intended usage is that you model the expected behavior, you instrument your real code to produce the traces and then you deploy it or fuzz it to get a lot of traces you can check for linearizability violations?
In practice what’s the easiest way to capture the traces? Just log some JSON? Derive it from otel spans?
You do need to ensure that whatever system records the history of events is itself Linearizable (hand-waving a bit here; you can actually get away with slightly weaker properties). That could be a single-node program which records invocations and completions in a Linearizable in-memory structure. Or you could journal them (synchronously!) to a Linearizable database before invocation and after completion, in production.
What you can’t do is, say, stamp log lines or OTEL events with local node clocks–those aren’t necessarily synchronized, and the checker might tell you your program violated Linearizability when it didn’t. You also can’t send them to a database asynchronously, because the message recording the invocation might be delayed and arrive at the database after the operation was actually performed. You also can’t use a Serializable or Causally consistent database, because it might fail to record your events in real-time order. A Sequential log is OK, but the scope of the Sequential object has to be be the entire log; otherwise you could (you guessed it!) get reorderings.
Another thing to be aware of is that if you log to the same datastore that’s under test, you could influence the results. For example, if your data store has an issue with stale reads in read-only transactions, and you add an “insert log line” between every read transaction down the same connection, you could see the bug become invisible.
In practice is it enough to synchronize the clocks using something like ntp?
I mean… no? Imagine you detected a Linearizability violation in a history recorded using timestamps on, say, Linux machines. From this you could conclude that either a.) the system being measured violated Linearizability, or b.) the history-collecting system had, at some time, poor synchronization; or the VM paused; or the OS paused; or the application paused; and so on. It doesn’t allow you to make a provable claim of correctness or incorrectness.
Maybe?
If you’re talking about something like EC2 timesync, where clock errors are ~50us, and your storage system is remote and best-case a couple hundred microseconds, and you’re careful about which end of the clock bound to pick, you can still get good results.
NTP in general, probably not, unless you’re very careful about the way you set it up and have great time infrastructure.
would it work to use lamport clocks? or stamp messages with the ones they’ve seen and then topologically sort (functionally the same thing but different)?
Sort of. You can use messages between nodes and a strategy like Lamport clocks to establish a partial order which is consistent (because messages always flow forwards in time) with the real-time order. However, you’d fail to detect violations of Linearizability over timescales shorter than the message propagation interval.
hrm—forgive me if these are stupid questions, but
wouldn’t any really linearisable logging scheme induce synchronisation between nodes, potentially also causing you to fail to detect some bugs? are there bugs that are still systematically more likely to be caught or can only be caught that way? (it seems not very nice because it sacrifices scalability)
IIUC, the difference between sequential consistency and linearisability is effectively that linearisability is consistent with communication ‘outside the system’ where seqcst doesn’t have to be. and lamport clocks or causal ordering should not miss sequential consistency violations. but if there are any classes of ‘external communication’ that we’re interested in in some context, couldn’t we try to bring those into the testing/logging system too? (i think this is similar to using a linearisable logging system, except that you don’t record all the ordering relationships—so you don’t get to full linearisability—only some of them, but what you do record you get in a distributed way)
Yep! It’s actually impossible to measure precisely. Jepsen does this by putting all clients on a single node, allowing fast connections between clients and servers, and (optionally) adding latency between servers; this lets it see errors “faster than light”, so to speak.
In general, I wouldn’t worry about scalability. Linearizability checking is NP-hard and relies on total histories–both of those are already bad for scalability. Even if that weren’t the case, you’ll get the most bang for your buck testing in a controlled environment where you can induce failures, rather than in production, where failures are (I hope!?) comparatively infrequent.
I’m not sure what this means exactly, but I should note that Causal is, in general, weaker than Sequential.
Sure, you could do that, but then the property you’re measuring is something like Causal or Sequential, rather than Linearizable.
Yes you’ve got it! If the project is in Go you could just write the fuzzer and test harness in Go. Otherwise yes you’d want to write some intermediate format from your system and then write a Go program to convert it for Porcupine.
See also Justin Jaffray’s post on the topic: https://justinjaffray.com/the-halloween-problem/.
Another good recent paper on the topic is BtrBlocks: https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf.
We added it in 2013. It lasted about three months (code). (A restricted version was proposed but didn’t get much interest.) In short, it was used as “I disagree” or “I dislike”, it didn’t improve anything. The “Hide” feature has worked out well. I’d be open to adding it back if there’s something different about it or the circumstances that seems like it’ll work out better, but so far this is pretty much exactly what we tried before.
I’ve been kicking around a similar idea with similar goals: remove the -1 to score from flagging to disincentivize people using flags to knock down things they’re not interested in or disagree with. As jcs predicted in that second thread, it’s still a problem and I occasionally DM people who make a habit of it to ask them to knock it off. Maybe twice it’s been a user following another around the site and flagging all their stuff, but it’s worth noting flags are infrequent and visible to mods, so that tends to pop out to us. As a second part of the change, make hiding a story give the score a -1 if the user also does not comment or vote on of its any comments. I’m trying to thread the needle so that there isn’t a value to misusing flags. I think this would reward the right things and would appreciate criticism.
Keeping the -1 makes a lot of sense, but I do like the requirement to comment on the story if you’re also flagging it.
Now of course, the flipside of this is that–as somebody who generally comments and explains my flags–the community needs to quit bitching every time somebody actually attempts to flag and explain content that’s bad or off-topic. The amount of “you’re not my real mod” I’ve seen occur is incredibly tiresome.
I deployed this change, which I developed on stream.
People use Hiding so much that this has marked database posts I normally submit as 0/going negative. I think they’re good posts but now my submission history looks like I’m submitting garbage.
This completely disincentivizes me from posting anything since of course people with diverse interests here will want to hide things they aren’t interested in. I use Hiding to remove stuff I just don’t personally want to see. I use flagging when something is low effort. I don’t want my own Hiding of something to penalize the post either personally.
I truly appreciate your efforts with the community but I am not going to submit posts anymore if/while Hiding a post counts as a -1.
First of all, I have to say that I appreciate the efforts as well! I actually thought it was a great idea, but seeing this change in action made me realize that it was wrong (or at least that it needs more tuning).
A lot of the value I find here lies in the wide diversity of topics. Frequently, the stories I learn the most from do not score very well (around 10 or less). Of course, many of the posts I submit also do not score very well.
I think it’s OK and valuable to have such posts around. Some people get value from them and they disappear quickly from the first page anyway.
With this change, as @eatonphil says, many of these stories will not be submitted anymore. And Lobsters is going to look a lot more like a mono-culture.
I believe that part of the problem is that the upvotes are too random to represent anything significant (because we often forget or whatever), while hiding a story is never random.
I actually submitted a bug on GH for this because I was seeing so many on-topic submissions having zero or negative scores. Turns out it was Working As Designed.
I’m not a fan of this change either, at least not right now. It might shake out to be ok once people learn that hitting “Hide” is now a downvote. But I don’t think it works semantically. I don’t filter anything, because I like seeing what’s on the site. But I do get a bit tired when Rust nerds (for example) get into the weeds about something they really enjoy, and I use the “hide” feature to remove that particular source of comments from the /comments feed. I don’t have anything against Rust, it’s often on-topic and good quality, it’s just not something I’m always interested in.
Specifically, I feel that the action of a small number of people (the ones that are not interested in certain topics and are not filtering) have an outsized effect on the margins for new submissions. A few grumpy users who might expect a flood of comments they’re not interested in can keep an on-topic and otherwise high-quality submission from the front page.
@pushcx has the number of “hide” submissions for new submissions increased since the change? It’s not a factor I’ve made any note of before, and it’s also nothing that’s reflected in the JSON output.
I can’t answer, the database didn’t timestamp the record that a user had hidden a story. I’ve added it but there’s no historical data to compare against.
So I saw that story hiding was pretty infrequent, but I didn’t realize that people are often hiding stories within minutes of them appearing, before anyone has upvoted. In talking with these users, it’s most commonly because it’s a way they can mark the last time they read /newest. (I’ll add a line for this, similar to hckrnews.com.) But it wasn’t at all the only answer and I’m convinced that hide just has too many other uses and other reasons for this change to work.
It’s also been uncommon for stories to hit zero or negative on Lobsters for a long time. I also shared the unexpectedly strong emotional reaction to seeing /newest with several grayed-out stories, there’s ah “oh no” in the pit of my stomach. I thought of it as making scores a little more dynamic but underestimated the effect of the difference. I’m sorry to anyone who shared this feeling, especially submitters.
I’m backing out the change to try a sort of middle position, so that there’s a -1 to score if a user flags and hides the story. I’m still hoping to reduce the impulse to use arbitrary flags as a ‘disgree and punish’ button. (Me individually DMing users who do this is unsustainable, especially as it’s now rare for someone to do it often, so the situation is ambiguous, so the conversation is a delicate and disagreeable one.) To look at it from the perspective before hiding affected scores, flags only reduce a story’s score if the flagger is also so done with the story that they don’t want to see it again. The code is here and I’m running
update_score_and_recalculate!on all stories submitted or hidden since Monday to revert the effects.I appreciate folks giving it a couple days, and thank you for all the feedback and ideas for future improvements. I’m sorry this change was so unpleasant and hopefully future mistakes continue to be small and survivable.
(ping to @eatonphil and @jmiven who replied upstream)
Appreciate the consideration, pushcx! I was very worried I’d have to find some new community to discuss all the various systems programming topics we do here. I’m relieved to need not. :)
Thank you!
In retrospect it’s obvious. But I too did have a tunnel vision when you talked about it, and only considered the “hiding by disagreement” side of this. Even though I did hide things previously as I don’t like to filter full tags.
Is this to mean “punishing” a user’s karma that hide stories that don’t interest them if they don’t comment or vote?
Sorry, I tried to cram a lot in that sentence. I mean: hiding a story gives it a -1, but only if the hiding user also hasn’t commented or voted in the thread. Sometimes people get angry in an argument and reach for any site feature that might punish the person they disagree with. So hiding would only reduce the story’s score if the hider really just didn’t want to see the story.
Thank you. That makes complete sense to me (as does your other comment).
Given that we can only hide stories but it might become a moderator signal, is there a world where the similar logic applies to comments? A harsher alternative to simply hiding the conversation, I guess.
http://muratbuffalo.blogspot.com/
Aleksey Charapko also often publishes the summaries from his distsys reading club: https://charap.co/.
Also: https://dbmsmusings.blogspot.com/, https://brooker.co.za/blog/, and https://distributed-computing-musings.com/.
I hate these because at the end of the day it’s always a case of “it depends”. At the small or medium scale, most of these things work more than well enough, but once you end up using postgres as a key-value store with high update rate, then I’ve seen it crumble.
Need to delete a user’s millions of rows inside a table with billions of rows? Good luck.
Need to update a very busy table (Add/drop constraints or columns)? Just forget it.
The query planner made the decision to choose a plan that is better for 99% of the case, but have catastrophic performance in 1%? Be ready to be told it’s your fault and te Query Planner is smarter than you.
Try being clever and use a conditional index where the indexed rows should be empty most of the time, but in case of failure it might start pilling up to millions of rows, and then the planner still thinks it can do a full index scan because its stats say it really has a single row in it and then all queries fail with timeout? Don’t be clever.
Anyway, Postgres is a wonderful piece of technology, but I stopped counting how much time I lost so far trying to use it for everything.
I’ll stop here, and I’m sure someone could come in and tell me how to solve these problems, but I believe from my research that none of these have an obvious solution and they always happens at the worst time.
PS: I’m still looking for something like Postgres where I could create a “1 table per user” database with millions of users and dropping these tables would be instant.
Ready for an insane take? SQLite works really well for this, where you create 1 entire database per user. You can cram them anywhere, S3, on some disk, on whatever storage, and make it performant enough.
I work with an application which is SQLite all the way down and it’s remarkably performant. It has several layers of ‘metadata databases’ that go 2-3 layers deep.
You could absolutely do ‘postgres database as metadata store for sqlite files,’ for a ‘have your cake and eat it too.’
You could take this a step further and do something like just store XML/JSON in an S3 bucket with an object-per-user if you want something completely denormalized. Could also use Parquet, Protobuf-encoded messages, or something like that if you want to be hip. ;)
Yeah this is basically how we solved this with scaling the AT Proto PDS to hundreds of thousands of users per instance and millions of users across instances.
One SQLite per user, only ever one writer to the DB, keep recently used DBs open in a LRU cache, add one or two metadata SQLites for the whole instance for things like global event sequencing, user registry, etc.
It works great and feels like a “lite” form of ScyllaDB and/or other Cassandra-style NoSQL DBs.
How are the compute and data planes linked? How do the instances get at the SQLite databases? How are they persisted? Are instances stateless or sharded with a single reader/writer? How are hot keys (users) handled?
I’m so curious how big SQLite installs works!
Each PDS instance is just one big baremetal box from OVH with redundant local SSDs. The instances back up their data directory regularly (including all the SQLite DBs) via RClone and the event sequencer shared SQLite is litestreamed to S3 for backup. We can recreate state missed between backups via replaying the event stream.
Hot keys aren’t much of a problem, write throughput on each PDS is somewhere like 2-10/sec across all users on a PDS. Each SQLite can handle up to 1K writes/sec with a single writer, so the contention there ends up shifting to the shared event stream DB.
We rate-limit writes per user to reasonable per-second, per-hour, and per-day limits to prevent hot shards as well (something like 15k writes/day per user is a reasonable maximum for our app atm).
Each SQLite is single-writer but could be multi-reader depending on read query throughput.
You can find the source for the PDS here - https://github.com/bluesky-social/atproto/tree/main/packages/pds
The key insight you have there though is “1 database per user” and I think you could apply that same principle if you wanted to without the cost of leaving Postgres by giving each user their own schema, no?
It is probably easier to scale up more SQLite instances than postgres. Simply creating more postgres databases on the same instance is probably detrimental because of shared cache, IO etc. Meanwhile for SQLite you can just drop your single sqlite instance and literally delete the single(!) file.
I am a fan of both DBs, but I don’t understand your statement.
Creating and deleting DBs (or schemas) in both DBs is easy. If DBs reside on the same instance they will share resources, if they don’t they won’t. So it’s again the same. You can delete “a signle(!) database” just as easily as you can delete a file. You can either use
DROP DATABASEor dodropdbon the command line.SQLite has exactly one writer on the DB, SQLite has exactly one file to care about (even with WAL this isn’t that much). So a DROP DATABASE in PG has to care about multiple locks, connections, caches, files, triggers and rollbacks etc. There is even the possibility to have cross DB references.
I do not claim to know exactly why this will be less performant, but I do imagine that a real benchmark will give you some surprising results about something that boils down to “just delete everything”.
People have used PostgreSQL schemas for multitenancy in the past, and my current understanding is that it doesn’t scale
I imagine that would work, but the trouble you run into is that the tooling often doesn’t support that kind of additional layer of abstraction. Table names, for instance, can’t be query parameters so you’re stuck doing your own string munging, escaping, and all that Tom-foolery.
You don’t need to parameterize anything, you set the “current schema” to the account’s when you connect to the database and shared schema objects are fully addressed.
Or you set the user’s schema at the front of the search path and then it should be transparent (but you don’t get a clear separation between exclusive and shared resources so it’s probably less safe).
It isn’t, because you have row based security policies on the shared tables.
Postgres can parameterize table names? That seems weird, normally parameters are for data, not query parts, I thought that distinction is the whole point.
Yea that would probably work OK, I know the way pgsql handles databases on physical filesystems is…. interesting. It’d make sharding real tough, however you could potentially work around this using pg_fdw and some kind of auto-sharding strategy.
What do you mean by that, and by “physical filesystems”?
Oh gosh my knowledge about this topic is extremely dated, like ~9.3 era. I had to deal with a lot of splitting of where pgsql did writes/reads due to some really large data sets (for the time, ~20TB per host) requiring esoteric disk setups (multiple local filesystems and LUNs across multiple hosts). There was some kind of really bizarre logic you need to think about for what postgres calls a ‘database’ which is more like an entire namespaced installation, I strongly recall there being something strange with how it handles WAL writes for separate databases. There was a situation where we ended up with millions of WAL files and it made both postgres and linux very upset. You can tune your way around this but that comes with trade-offs.
Creating lots of tables/schemas/databases in postgres just leads to some interesting things, like having a lot of files (probably more than you expect). In most situations (lets say <1000 tables) you won’t ever encouter any of these and can just have a real good time.
Tables are still their own files. The WAL is split into segment files but it’s one WAL for the whole instance.
The issue with SQLite is that I feel I need to end up re-implementing a database server which is a whole product by itself. I know about Litestream or LiteFS, but it still comes with the requirement of managing some kind of writer service, another service that will load and cache user database on clients. Also the “small” other things that matters a lot: auth, monitoring, etc.
But maybe at the end of the day, outgrowing Postgres means building your own in-house very specialized database.
Have done this, can vouch that it works super well.
Or even recutils :)
Would you still use PG up until you hit those issues, or would you start with something else in situations where you predict future workloads would hit PG limits?
Assuming commercial success is what helps you find PG’s limits, it should fund solutions around them. So build for now, knowing that commercial success may impose new constraint on your system.
Use Postgres.
Yeah probably. I think one big advantage of Postgres with a new project is that it’s so easy to run a local Postgres instance with almost the same configuration you will have in production. Testing is as simple as running one service versus ending up with complicated scripts to spawn 5 services that are “close enough” to production. But I think doing it all over again I would be more aggressive to getting out of Postgres before we hit the problems. The main challenge is to know what to replace Postgres with.
I’ve generally approached this problem by building on PG but designing my data schema to easily migrate to NoSQL in the future if/when scale justifies a move to ScyllaDB or something similar.
You can get most of the benefits of a NoSQL DB by designing a Postgres schema around the principles of NoSQL (denormalized tables, no joins, query-specific tables with only a primary key index and maybe one secondary index).
Then when it comes time to migrate to NoSQL you can basically copy/paste your schema and queries without having to make significant changes to your data model.
(Different commenter)
I would use Postgres as a database and select more appropriate application tools that can better scale in the future.
Sorry, what do you mean specifically? All the use cases above are databases in my book.
I’d rather use an application server to create a REST API than render in Postgres. Same for GraphQL; although, I probably wouldn’t use it.
I think you may be replying to a different thread accidentally, we’re talking about using PostgreSQL instead of other databases and message brokers.
Thereby increasing your operational overhead X-fold before you validated that people are actually going to use what you built?
Not really. You already have an application server involved. Implement your APIs, etc. there, not in the database. Also, I have far more experience dealing with application servers than databases, so the mental overhead for me is lower than doing everything in the DB.
I may be off-based, but I don’t think the article recommends replacing the application server with the database, if such a thing is possible. Rather, using pg for cache storage, message queues, full text search, cronjobs, etc… all things you’ll probably need fairly early, and for which there is a bunch of tool chain recommended for scale in the Internet, including but not limited to redis, kafka, mongodb, elasticsearch, solr, and so on, all of which have their deployment constraints, differentiated pricing plans, require specific admin specialty, all of which you do not have when going 0 to 1 looking for people who actually want to use and pay for your product.
What’s a reasonable load out beyond just Postgres that would go a long way?
Say: Postgres + Redis + Kafka?
I’ve found Redis Streams to be a reasonable stand-in for Kafka in architectures where the amount of content in your stream WAL doesn’t grow unbounded (i.e. you can trim it regularly and maintain enough of a playback window).
It’s got very similar semantics to Kafka and allows you to cut one of the (more operationally costly) pieces from that stack if you’ve got the RAM for it.
I run in the same edge case not so long ago. The solution to this issue is to use table partition.
One thing I’m not clear on with this article talking about SQLite, is SQLite’s rollback journal MVCC (the article would seem to imply no), a WAL (the fact that you can choose between it and WAL mode also seems to imply no to me) or something else?
The rollback journal is something else. You can pick between the rollback journal and the WAL mode. Neither version does MVCC. The unmerged effort to put SQLite on an LSM Tree (https://sqlite.org/src4/doc/trunk/www/lsmusr.wiki) does use MVCC. But it has been basically abandoned.
Isn’t WAL mode necessarily mvcc if in a limited capacity? Every writer creates a new version of data in the WAL without touching the older versions, both in the database and the WAL. That is why readers can keep on reading.
The rollback journal is a legacy feature that no one should be using anymore. The fact that it’s enabled by default is an unfortunate backwards-compatibility measure.
LMDB has no WAL. Instead it uses MVCC to update the b-tree in place. (Its b-tree is a persistent data structure.) It’s said to outperform SQLite, though it’s a bit hard to compare since LMDB is a lower-level db.
A WAL does accelerate writes, although eventually it has to be merged into the “real” db, which incurs extra reads+writes. WALs seem to slow down reads to some degree, since every page-fetch has to check the WAL first to see if a newer version exists.
That’s a fair implication. In my pseudo-code I assumed you could write to the btree in-memory and separately schedule it to be persisted to disk. That only works if your data fits in memory. Otherwise it would be as you say, you need to check the WAL (or some other in-memory summary of unflushed data) on every read.
And the WAL isn’t necessarily in memory. It could have been written to by another process since you last accessed it.
we already had a DB in the browser, it’s called localStorage /ducks
not only that, also indexeddb
Never ever use localstorage it’s broken by design.
Indexeddb is a real db though
Is that a DB or a storage layer?
The now-deprecated webSQL then
Is every key-value storage system not database enough for you? :)
I guess everything is a database if you are brave enough :D
Thanks for posting! I wasn’t sure if it was on-topic.
Happy to answer questions.
Do you find the Twitch streams valuable for discussion? Or is all the discussion on email?
There’s nothing live at all. Everything is email.
Is it a typo that you have the 1st iteration listed twice in the second code snippet?
It is - thanks Phil!
The link now 404s for me; the article only exists on its own branch, not on main.
… And fixed already
The paper had an interesting history being paywalled since publication until just now.
https://twitter.com/AlexMillerDB/status/1795736817922417049
I think my two favourite, most fun books are Data-oriented Design and The Design and Implementation of the 4.4 BSD Operating System. Both full of widely applicable, timeless, yet specific knowledge. Data-oriented Design is one of those books I keep by my desk, just to be in its presence.
“Data-Oriented Design” looks good: some of the reviews seem to suggest it’s more aimed at people with a couple of years of experience though, rather than beginners. What do you think?
I think anyone who picks it up and doesn’t grok it will pick it up again when they’ll be able to grok it.
Thanks for the recommendation for Data-oriented Design!