SQLite is my go-to for small to medium size webapps that could reasonably run on a single server. It is zero effort to set up. If you need a higher performance DB, you probably need to scale past a single server anyway, and then you have a whole bunch of other scaling issues, where you need a web cache and other stuff anyway.
Reasons not to do that are handling backup at a different place than the application, good inspection tools while your app runs, perf optimization things (also “shared” memory usage with one big dbms instance) you can’t do in sqlite and the easier path for migrating to a multi-machine setup. Lastly you’ll also get separation of concerns, allowing you to split up some parts of your up into different permission levels.
If I’m reading that right you’ll have to implement that into your application. postgres/mariadb can be backed up (and restored) without any application interaction. Thus it can also be performed by a specialized backup user (making it also a little bit more secure).
As far as I know, you can use the sqlite3 CLI tool to run .backup while your application is still running. I think it’s fine if you have multiple readers while one process is writing to the DB.
Ok but instead of adding another dependency that solves the shortcomings of not using a DBMS (and I’ll also have to care about) I could instead use a DBMS.
OK, but then you need to administer a DBMS server, with security, performance, testing, and other implications. The point is that there are tradeoffs and that SQLite offers a simple one for many applications.
Not just that, but what exactly are the problems that make someone need a DBMS server? Sqlite3 is thread safe and for remote replication you can just use something like https://www.symmetricds.org/, right? Even then, you can safely store data up to a couple of terabytes in a single Sqlite3 server, too, and it’s pretty fault tolerant by itself. Am I missing something here?
My experience has been that managing Postgres replication is also far from easy (though to be fair, Amazon will now do this for you if you’re willing to pay for it).
SymmetricDS supports many databases and can replicate across different databases, including Oracle, MySQL, MariaDB, PostgreSQL, MS SQL Server (including Azure), IBM DB2 (UDB, iSeries, and zSeries), H2, HSQLDB, Derby, Firebird, Interbase, Informix, Greenplum, SQLite, Sybase ASE, Sybase ASA (SQL Anywhere), Amazon Redshift, MongoDB, and VoltDB databases.
This seems quite remarkable - any experience with it?
Where do you see the difference between litestream and a tool to backup Postgres/MariaDB? Last time I checked my self-hosted Postgres instance didn’t backup itself.
You have a point but nearly every dbms hoster has automatic backups and I know many backup solutions that automate this. I am running stuff only by myself though (no SaaS)
No, it’s fine to open a SQLite database in another process, such as the CLI. And as long as you use WAL mode, a writer doesn’t interrupt a reader, and a reader can use a RO transaction to operate on a consistent snapshot of the database.
Most sites only need ~99.99% availability, since you get significantly diminishing returns as soon as you’re more reliable than your end user’s ISP or home network. Most internet users have less than 99.99% availability just from their ISP having problems or their cat eating their ethernet cable or whatever — why try to be orders of magnitude more available than that, when it won’t make a significant dent in the uptime as it’s experienced by the end user?
Any given user may be down 0.01% of the time, but it could be a different 0.01% for each user! I haven’t run the numbers, but I’d guess that you wouldn’t need all that many users before someone experienced your downtime.
Feels weird to chime in into this dialog, but I think you are reading this “wrong”.
The claim as I see it is Amdahl’s law: if the user experiences 10 hours of downtime per year, of which 9 is due to cable-chewing cats and 1 due to planned app downtime, than eliminating 100% of app downtime would lead to only 10% reduction in downtime from the user’s perspective.
It seems that you read this as a different claim of P(A|B) = P(A) + P(B) - P(AB) sort (something about app downtime not mattering if there’s cat downtime at this same instant).
Yeah, this is true — and relatedly, now that we’re living in this strange world where most people have at least two independent internet connections at their home (traditional ISP + cell provider), the maximum availability that each individual user has to the internet is a lot better these days.
I still think that the vast majority of sites don’t need more than 99.99% availability, but it can be a complicated tradeoff. That section of this article was mostly aimed at:
Explaining some tools that people can use to think about this (converting availability to downtime budget, thinking about end-to-end availability as well as just the parts you control)
Trying to get people to consider whether they actually need super high availability — in my experience, SREs who have significant experience (both at large and small companies) basically always aim for lower availability than laypeople tend to, and there are very good reasons for this :)
99.99% is roughly 9 hours per year, which, sorry to burst everyone’s bubble, hasn’t been met by most google services or amazon website in its full function, let alone AWS or GCP, which had higher annual downtime than that.
By extension, the same or higher downtimes have affected virtually all websites hosted at these services.
It’s quite telling people brag about their high availability kubernetes clusters and whatnot, when in fact that usually results in more fragile systems that are always going down due to incidents. “But that was because…”
It’s not the case that most websites don’t need. It’s more like not even the big guys succeed to deliver that level of service. If people would be more honest about the uptimes of their services, we would come to the conclusion that in reality, the vast majority of the websites is more like 99.9% uptime if not less.
you can use strict typing instead, which somewhat improves the situation. It’ll take a little while for this to get rolled out to things like the python stdlib, but it’ll be excellent when it’s there.
No need to wait, apsw wraps the latest SQLite in a convenient package.
while Django supports it, they say that their migration tool “can be slow and occasionally buggy”
That’s kinda funny because to me, Django’s ORM is the ‘slow and buggy’ one, with fun issues like ‘deleting a bunch of rows takes exponential time’: https://code.djangoproject.com/ticket/31624
I would advise people to use apsw and not the Python standard library binding for anything important.
The Python stdlib tries to conform to this Python-specific “DBAPI” standard and in doing so really messes up the API with respect to concurrency and performance. As far as I remember it will actually wrap your own SQL in its own SQL, and that is not what I wanted for a high performance application, so I disabled it.
It will raise exceptions if you’re using it “wrong” too.
Just use apsw because the API matches what you read in the sqlite docs.
Somebody please correct me if I’m wrong because I haven’t worked with SQL in a while, but that looks like O(n²), and not exponential time? Each deletion (n of them) involves a comparison against at most n elements, in other words n² operations.
I don’t completely disagree, but I think there’s also other reasons for PHP picking up. For example it providing the 90s version of serverless, and no compilation being required.
On top of that the fact that approaches to get closer to natural languages (in one way or another) as well as text processing were big goals, also easy interfacing with databases in the standard library.
Times change, we see trends in the opposite direction. Ambiguity and expressiveness over simplicity are not the goal anymore, and people want things to look familiar, trading writing a few more understood constructs for simplicity. People prefer being explicit instead of implicit.
An example of that is also how there’s a trend away from duck typing. Once that was a good thing, because it’s closer to natural language and less writing, and just like with natural language you were able to be shorter when many things are implied.
Then we had a rise of Java style languages and now the interesting new languages are the ones copying much of the philosophy that people associate with C.
Not saying people should use C or BCHS, but that a language isn’t bad because people follow different trends, learn other languages in school or similar. Popular languages tend to keep evolving and the ecosystems maturing.
Of course that also means legacy as in “failed attempts”, old interfaces (see Go getting new network address types), unmaintained libraries, etc. accumulating and I have to admit the lack of these is really exciting about new languages. There’s usually also just less bad, ugly, buggy, unidiomatic code on very new languages.
However, times have changed and given that there isn’t much too similar to BCHS, also because both the world and the technologies it depends on were at least different I think the existence of PHP doesn’t seem like a good argument against BCHS.
Again, not saying you should use it or that PHP worse or better. Just that such general rules usually aren’t the best helpers for practical decisions.
It has full access to the kernel’s system calls and a massive set of development tools and libraries.
I don’t think C the programming language has anything to do with syscalls. Which ones you have access to instead depending on which ones your standard library decides to implement wrappers for, same as many other languages. Granted on BSD this is likely most of them, but certainly isn’t guaranteed.
Until recently (I’m old) concurrent access was a problem. Even in small projects I have some sort of concurrency and DB locked errors force me to add a dedicated thread that communicates with others through an in-memory queue. And if you have multiple processes, it becomes an async task mess. Or you add retries and backoffs. Until super recently, I had db file corruptions when handling concurrent access that way. So pick your poison kind of situation.
Until the very recent versions, it didn’t support alter column which made it problematic for even small projects that work with migration scripts. Now ORMs need to adapt to this change. Maybe they all did.
Selecting groupwise maximum rows was also a mess the last time I checked.
Having been bitten multiple times trying (maybe not hard enough) to use it and switching to a RDBMS every time, I am pretty reluctant to dive in and discover new, ah, idiosyncrasies.
I love SQLite. But is there any off-the-shelf solutions for having multiple web servers accessing a single shared SQLite database? “running that same app all on a single machine using SQLite” is a pretty big constraint.
If you need a server, what’s wrong with using a DBMS server? SQLite is nice for local file database, but beyond that, any off-the-shelf solutions will most likely add extra complexity and lose most SQLite benefits, especially given alternative like Postgresql.
Because now you have two servers, and the DBMS needs accounts and privileges set up, and you have to configure the app server to tell it where the DBMS is, etc.
Obviously a DBMS is more scaleable, but for something that doesn’t need to scale to multiple nodes, why add all the complexity? Why drive an SUV to the corner store when you can just hop on your bike?
In the environment of the person asking, if you’ve got multiple web servers you already need to deal with accounts, privileges, address discovery, and all the relevant bits.
You’re right, I wasn’t reading back up to @nelson’s question so I missed the “multiple web servers” part.
Running SQLite as a shared db server does seem an odd choice … I agree with @isra17.
Nothing at all is wrong with using a DBMS server! But this fine article is all about how you can use SQLite instead. I’ve been wondering if someone’s built a really simple DBMS wrapper around SQLite. The other answers suggesting rqlite or reading from replicas are the kind of thing I had in mind.
From my understanding, the article makes the point that you can run a single web server and therefore, keep a sqlite database on this server to keep everything simple and fast. If for some reason you need to scale to multiple nodes, then the article points does not apply anymore. When using a project like rqlite, you are using sqlite as much as how you are using files when using postgres. rqlite is a whole different systems with a whole new performance and guarantees profiles. Calling it simply a sqlite server wrapper is an understatement. It seems like rqlite is more in the etcd and consul category than general DBMS.
Agreed, rqlite could be considered similar to etcd or Consul, but with relational modeling available. But I think it’s perfectly fair to call it a DBMS too, just a particularly lightweight and easy-to-run one.
“running that same app all on a single machine using SQLite” is a pretty big constraint.
I’m not sure it is – single machines are BIG these days. You can fit a lot on a single node before you even need to think about scaling. However, if you feel that way, don’t use sqlite. There are database servers out there.
well, not really. The WAL mode is really just based of file locking. So if you use the sqlite library for your programming language, it will make sure that during write operation changes are written to the WAL catalog instead of the original file, which is locked during that time. This works for certain amount of write concurrency, but as soon as you use a complex application which issues a lots of writes, you will soon start to notice your sqlite layer to respond with messages like “database is busy, locked”.
Thats the time you have to deal on your application backend to either handle retries, or even adjust the timeout setting a sqlite write operation is “waiting” for the lock to become free (default is a few microseconds). You can imagine this wont work well with multiple servers accessing the same sqlite database, doing concurrent writes. Really, if you want that, just use some DBMS.
Also, long running SQL write operations can really fuck up the response time for any other application thread on the same database, the SQL queries will receive the busy error and have to deal with that.
And i have seen pretty bad application designs NOT dealing with that, and just “forgetting” about the query, not even issuing an error to the user.
SQLITE has WAL, yes, but it gives you very limited write concurrency.
That FAQ entry is about having multiple readers opening the same database file. Which works great! But it’s read only. More importantly, it only works if you have access to the database file, which in practice means only a single machine. (The doc even explicitly warns against trying this with NFS.)
I can recommend using managed DB services, e.g. AWS RDS, to offload the DB server management. You get backups and other features without having to implement all that yourself.
SQLite is the perfect “move fast do things that don’t scale” tech
SQLite will save you latency all else being equal (which of course it often isn’t)
Moving between rdbmses is a huge pain
Hosted databases might be even less fuss to get things like zonal failover within your cloud provider
Unless your hosting provider gives you zero effort read/write replicas, a single database server will become a choke point on scale as well and you’ll still have to adapt your application.
I’ve used SQLite for an application that never needed to scale
I actually use (hosted, Google) Postgres for an application I’m trying to scale.
I suspect this choice would hinge on how well defined the application is such that you know SQLite will give you good performance with its feature set and you really want that sweet money saving vs the features of Postgres that make certain workloads faster.
In the current application I work on, I’d really like partial (i.e. fiiltered) replication of tenant specific data from Postgres to an in memory or local database on a tenant specific web server instance. Sadly this does not exist for free right now.
SQLite is my go-to for small to medium size webapps that could reasonably run on a single server. It is zero effort to set up. If you need a higher performance DB, you probably need to scale past a single server anyway, and then you have a whole bunch of other scaling issues, where you need a web cache and other stuff anyway.
Reasons not to do that are handling backup at a different place than the application, good inspection tools while your app runs, perf optimization things (also “shared” memory usage with one big dbms instance) you can’t do in sqlite and the easier path for migrating to a multi-machine setup. Lastly you’ll also get separation of concerns, allowing you to split up some parts of your up into different permission levels.
Regarding backups: what’s wrong with the .backup command
If I’m reading that right you’ll have to implement that into your application. postgres/mariadb can be backed up (and restored) without any application interaction. Thus it can also be performed by a specialized backup user (making it also a little bit more secure).
As far as I know, you can use the sqlite3 CLI tool to run .backup while your application is still running. I think it’s fine if you have multiple readers while one process is writing to the DB.
Yes, provided you use WAL mode, which you should probably do anyway.
You could use litestream to stream your SQLite changes to local and offsite backups. Works pretty well.
Ok but instead of adding another dependency that solves the shortcomings of not using a DBMS (and I’ll also have to care about) I could instead use a DBMS.
OK, but then you need to administer a DBMS server, with security, performance, testing, and other implications. The point is that there are tradeoffs and that SQLite offers a simple one for many applications.
Not just that, but what exactly are the problems that make someone need a DBMS server? Sqlite3 is thread safe and for remote replication you can just use something like https://www.symmetricds.org/, right? Even then, you can safely store data up to a couple of terabytes in a single Sqlite3 server, too, and it’s pretty fault tolerant by itself. Am I missing something here?
What does a “single sqlite3 server” mean in the context of an embedded database?
How do you run N copies of your application for HA/operational purposes when the database is “glued with only one instance of the application”?
It’s far from easy in my experience.
My experience has been that managing Postgres replication is also far from easy (though to be fair, Amazon will now do this for you if you’re willing to pay for it).
This seems quite remarkable - any experience with it?
Where do you see the difference between litestream and a tool to backup Postgres/MariaDB? Last time I checked my self-hosted Postgres instance didn’t backup itself.
You have a point but nearly every dbms hoster has automatic backups and I know many backup solutions that automate this. I am running stuff only by myself though (no SaaS)
No, it’s fine to open a SQLite database in another process, such as the CLI. And as long as you use WAL mode, a writer doesn’t interrupt a reader, and a reader can use a RO transaction to operate on a consistent snapshot of the database.
Good article, one nitpick:
Any given user may be down 0.01% of the time, but it could be a different 0.01% for each user! I haven’t run the numbers, but I’d guess that you wouldn’t need all that many users before someone experienced your downtime.
Feels weird to chime in into this dialog, but I think you are reading this “wrong”.
The claim as I see it is Amdahl’s law: if the user experiences 10 hours of downtime per year, of which 9 is due to cable-chewing cats and 1 due to planned app downtime, than eliminating 100% of app downtime would lead to only 10% reduction in downtime from the user’s perspective.
It seems that you read this as a different claim of
P(A|B) = P(A) + P(B) - P(AB)
sort (something about app downtime not mattering if there’s cat downtime at this same instant).Yeah, that’s a more sensible read than mine.
Yeah, this is true — and relatedly, now that we’re living in this strange world where most people have at least two independent internet connections at their home (traditional ISP + cell provider), the maximum availability that each individual user has to the internet is a lot better these days.
I still think that the vast majority of sites don’t need more than 99.99% availability, but it can be a complicated tradeoff. That section of this article was mostly aimed at:
99.99% is roughly 9 hours per year, which, sorry to burst everyone’s bubble, hasn’t been met by most google services or amazon website in its full function, let alone AWS or GCP, which had higher annual downtime than that.
By extension, the same or higher downtimes have affected virtually all websites hosted at these services.
It’s quite telling people brag about their high availability kubernetes clusters and whatnot, when in fact that usually results in more fragile systems that are always going down due to incidents. “But that was because…”
It’s not the case that most websites don’t need. It’s more like not even the big guys succeed to deliver that level of service. If people would be more honest about the uptimes of their services, we would come to the conclusion that in reality, the vast majority of the websites is more like 99.9% uptime if not less.
Apparently true for US: https://www.census.gov/library/stories/2018/08/internet-access.html
I wonder about global south, though.
No need to wait, apsw wraps the latest SQLite in a convenient package.
That’s kinda funny because to me, Django’s ORM is the ‘slow and buggy’ one, with fun issues like ‘deleting a bunch of rows takes exponential time’: https://code.djangoproject.com/ticket/31624
I would advise people to use apsw and not the Python standard library binding for anything important.
The Python stdlib tries to conform to this Python-specific “DBAPI” standard and in doing so really messes up the API with respect to concurrency and performance. As far as I remember it will actually wrap your own SQL in its own SQL, and that is not what I wanted for a high performance application, so I disabled it.
It will raise exceptions if you’re using it “wrong” too.
Just use apsw because the API matches what you read in the sqlite docs.
Somebody please correct me if I’m wrong because I haven’t worked with SQL in a while, but that looks like O(
n
²), and not exponential time? Each deletion (n
of them) involves a comparison against at mostn
elements, in other wordsn
² operations.In n^2, the 2 is an exponent ;-)
Typically when people refer to exponential time they refer to 2 ^ n, not n ^ 2.
https://en.m.wikipedia.org/wiki/Time_complexity
That Django issue wasn’t about a performance regression?
https://learnbchs.org
I think there’s a reason why languages like PHP once were created, despite having such a capable language as C/C++ widely available.
It goes something like:
I don’t completely disagree, but I think there’s also other reasons for PHP picking up. For example it providing the 90s version of serverless, and no compilation being required.
On top of that the fact that approaches to get closer to natural languages (in one way or another) as well as text processing were big goals, also easy interfacing with databases in the standard library.
Times change, we see trends in the opposite direction. Ambiguity and expressiveness over simplicity are not the goal anymore, and people want things to look familiar, trading writing a few more understood constructs for simplicity. People prefer being explicit instead of implicit.
An example of that is also how there’s a trend away from duck typing. Once that was a good thing, because it’s closer to natural language and less writing, and just like with natural language you were able to be shorter when many things are implied.
Then we had a rise of Java style languages and now the interesting new languages are the ones copying much of the philosophy that people associate with C.
Not saying people should use C or BCHS, but that a language isn’t bad because people follow different trends, learn other languages in school or similar. Popular languages tend to keep evolving and the ecosystems maturing.
Of course that also means legacy as in “failed attempts”, old interfaces (see Go getting new network address types), unmaintained libraries, etc. accumulating and I have to admit the lack of these is really exciting about new languages. There’s usually also just less bad, ugly, buggy, unidiomatic code on very new languages.
However, times have changed and given that there isn’t much too similar to BCHS, also because both the world and the technologies it depends on were at least different I think the existence of PHP doesn’t seem like a good argument against BCHS.
Again, not saying you should use it or that PHP worse or better. Just that such general rules usually aren’t the best helpers for practical decisions.
So much to unpack here.
I don’t think C the programming language has anything to do with syscalls. Which ones you have access to instead depending on which ones your standard library decides to implement wrappers for, same as many other languages. Granted on BSD this is likely most of them, but certainly isn’t guaranteed.
I think you might have replied to this comment in error.
Let’s pretend I replied one level up, or that I was also unpacking that section of the website. :)
Sorry, I missed that you were referring to content on the same site. The phrase “non-mustachioed” was a thought terminator.
I think the entire site is meant to be read as satire. “The internet is unsafe” - granted, but recommending C in that case is not best practices.
OK then.
It’s not as if you can’t use a sane language like Perl on OpenBSD, it’s included in base for building.
I tried to, but:
Until recently (I’m old) concurrent access was a problem. Even in small projects I have some sort of concurrency and DB locked errors force me to add a dedicated thread that communicates with others through an in-memory queue. And if you have multiple processes, it becomes an async task mess. Or you add retries and backoffs. Until super recently, I had db file corruptions when handling concurrent access that way. So pick your poison kind of situation.
Until the very recent versions, it didn’t support alter column which made it problematic for even small projects that work with migration scripts. Now ORMs need to adapt to this change. Maybe they all did.
Selecting groupwise maximum rows was also a mess the last time I checked.
Having been bitten multiple times trying (maybe not hard enough) to use it and switching to a RDBMS every time, I am pretty reluctant to dive in and discover new, ah, idiosyncrasies.
I love SQLite. But is there any off-the-shelf solutions for having multiple web servers accessing a single shared SQLite database? “running that same app all on a single machine using SQLite” is a pretty big constraint.
If you need a server, what’s wrong with using a DBMS server? SQLite is nice for local file database, but beyond that, any off-the-shelf solutions will most likely add extra complexity and lose most SQLite benefits, especially given alternative like Postgresql.
Because now you have two servers, and the DBMS needs accounts and privileges set up, and you have to configure the app server to tell it where the DBMS is, etc.
Obviously a DBMS is more scaleable, but for something that doesn’t need to scale to multiple nodes, why add all the complexity? Why drive an SUV to the corner store when you can just hop on your bike?
In the environment of the person asking, if you’ve got multiple web servers you already need to deal with accounts, privileges, address discovery, and all the relevant bits.
You’re right, I wasn’t reading back up to @nelson’s question so I missed the “multiple web servers” part. Running SQLite as a shared db server does seem an odd choice … I agree with @isra17.
Nothing at all is wrong with using a DBMS server! But this fine article is all about how you can use SQLite instead. I’ve been wondering if someone’s built a really simple DBMS wrapper around SQLite. The other answers suggesting rqlite or reading from replicas are the kind of thing I had in mind.
From my understanding, the article makes the point that you can run a single web server and therefore, keep a sqlite database on this server to keep everything simple and fast. If for some reason you need to scale to multiple nodes, then the article points does not apply anymore. When using a project like rqlite, you are using sqlite as much as how you are using files when using postgres. rqlite is a whole different systems with a whole new performance and guarantees profiles. Calling it simply a sqlite server wrapper is an understatement. It seems like rqlite is more in the etcd and consul category than general DBMS.
rqlite author here.
Agreed, rqlite could be considered similar to etcd or Consul, but with relational modeling available. But I think it’s perfectly fair to call it a DBMS too, just a particularly lightweight and easy-to-run one.
https://github.com/rqlite/rqlite
Along those lines there’s also https://dqlite.io/
I’m not sure it is – single machines are BIG these days. You can fit a lot on a single node before you even need to think about scaling. However, if you feel that way, don’t use sqlite. There are database servers out there.
Isn’t this just built into SQLite?
https://www.sqlite.org/faq.html#q5
well, not really. The WAL mode is really just based of file locking. So if you use the sqlite library for your programming language, it will make sure that during write operation changes are written to the WAL catalog instead of the original file, which is locked during that time. This works for certain amount of write concurrency, but as soon as you use a complex application which issues a lots of writes, you will soon start to notice your sqlite layer to respond with messages like “database is busy, locked”.
Thats the time you have to deal on your application backend to either handle retries, or even adjust the timeout setting a sqlite write operation is “waiting” for the lock to become free (default is a few microseconds). You can imagine this wont work well with multiple servers accessing the same sqlite database, doing concurrent writes. Really, if you want that, just use some DBMS.
Also, long running SQL write operations can really fuck up the response time for any other application thread on the same database, the SQL queries will receive the busy error and have to deal with that.
And i have seen pretty bad application designs NOT dealing with that, and just “forgetting” about the query, not even issuing an error to the user.
SQLITE has WAL, yes, but it gives you very limited write concurrency.
That FAQ entry is about having multiple readers opening the same database file. Which works great! But it’s read only. More importantly, it only works if you have access to the database file, which in practice means only a single machine. (The doc even explicitly warns against trying this with NFS.)
I did a single duckduckgo search and found https://www.symmetricds.org/, i cant vouch for it though. fwiw my search terms were “sqlite mirroring”
I think LiteStream supports having multiple readers following one leader. If you’re read heavy, it can work to just consolidate writing to one big box
Litestream doesn’t support live read replication but it’s being worked on right now and should be available in the new year.
I can recommend using managed DB services, e.g. AWS RDS, to offload the DB server management. You get backups and other features without having to implement all that yourself.
A couple of random thoughts: