taking containers and upgrading them to full-fledged virtual machines
I’m not at all an expert on containers but wasn’t one of the first reasons to move to containers that they were lightweight? I’ve seen this other places too, I’m wondering why there’s been the shift to VMs now? Or what’s the benefit of VMs that plain containers don’t have?
The main downside of containers is that they don’t really have any security guarantees. The Linux kernel interface is a big attack surface for apps to escape from
I believe Amazon started using firecracker VMs a few years ago, which are lighter VMs but providing more security than the Linux kernel does for containers
Google has stuff like this as well, I think gvisor
Either way, if you can take OCI images and run them in a sandbox with stronger guarantees without making startup time too slow, or making it hard to mount directories or file systems, that’s a win
Not meant as a correction as much as a fun rabbit hole: gVisor is not always a VM-based hypervisor like Firecracker but this oddball sui generis thing: in one of its modes, a Go process intercepts the sandboxed process’s system calls with ptrace and makes other system calls to the Linux kernel as needed. The calls it makes to the “real” Linux kernel are a subset of all the possible ones, which was enough to mitigate some kernel bugs.
I think that link is saying that even in KVM mode, it’s still running its own pseudo-kernel vs. doing more traditional virtualization, but using KVM to provide some extra isolation and speed some things up.
(gVisor is also a place where race conditions are unusually relevant, because it’s meant to be a security boundary and you can poke at it using a multithreaded local process.)
And when we ditch Linux for something with security guarantees for native processes then native processes will be the new VMs for containers for VMs for native processes.
To save you some external reading if you just want a quick answer: VMs are fast now. Because of things like mature CPU extensions for things like nested page tables, paravirtual driver improvements, and technologies like Firecracker that boot VMs extremely quickly on minimal virtual hardware profiles.
From a security perspective it’s easier to rely on hardware extensions that isolate page tables and IO devices than rely on multitudes of permissions checks added in to every corner of the kernel to try and support containers. It reduces the security problem to a few key areas.
From a performance perspective it’s faster to rely on hardware extensions that keep guest pages separate from host pages than rely on multitudes of nested or partitioned data structures all throughout the kernel to try and support containers. Theoretically there should be no difference, but the kernel has already gone through decades of optimization for the base case, whereas these containerized / namespaced data structures are relatively new. And certain security features can never be implemented with equal performance. For example syscall filtering just isn’t necessary for a VM, since the VM doesn’t interact with the host on that level. And these container security features don’t have hardware acceleration like modern VMs do.
Ironically all this hardware virtualization acceleration was done for the cloud, so you could argue that VMs are more literally “cloud native” than containers, as VMs are natively supported by all modern server hardware. 😉
Operating Proxmox (KVM,QEMU,ceph “Distribution”) I can say that it’s a breeze. Snapshots, Hardwareconfig, HA etc is really easy. So yeah, I’m a fan of VMs and if your cpu isn’t from 20 years ago, you’ll have no performance issues operating them.
I’m glad VMs are seeing a resurgence in popularity now, but one thing I am happy that came out of the Docker/containers world is consistent tooling, management, configuration and deployment. I don’t have to mount in a .conf file somewhere, a .yaml file somewhere else, ssh in and manually edit some file. For the most part, my deployments all look the same: a bunch of key-value environment variables, loaded the same way, written the same way and easily understood compared to many different configuration formats with their own syntax and rules.
So the main reason I love Fly.io is I can keep all of that consistency that Docker gave me while benefitting from Firecracker turning those container images into nice fast VM images. It’s also portable so I can test my containers locally with a simple docker command!
But I also mean fast as in low execution overhead. Any performance difference you see between a VM on a cloud provider and a bare metal core on the same hardware has a lot more to do with noisy neighbors or CPU frequency scaling limits than actual virtualization overhead.
Gotcha. This is fine for a ton of use cases. I usually work in environments where SLOs are defined in terms of roundtrip or time-to-first-byte latency, and the p99 targets are typically on the order of single-digit milliseconds. Alas.
Tl;dr: you can either commit locally and then ship the data around (LiteFS, Litestream, Verneuil, et. al.) or force replication before commit (Raft-based backends like Rqlite and Dqlite.)
Longer version:
There are basically two main approaches to “distributed SQLite”: LiteFS, Litestream, and Verneuil all hook in at the filesystem/WAL level and ship those elsewhere for replication and snapshotting. You have to explicitly pull down one of those snapshots and load it into a local SQLite file (or do the equivalent with VFS trickery) to get a current view, but reads and writes are still 100% local (and therefore fast, as mentioned in the article.)
Rqlite and Dqlite (plus probably many others) instead bolt a distributed consensus algorithm (Raft, in this case) into the SQLite runtime to build a distributed transaction engine. Once a txn is committed on the Raft cluster, it’s fully replicated and ready to read on any node.
Which one works for your use case depends a lot on how you use SQLite. If you care more about consistency, a Raft backend will give you a “proper” multi-writer, transactional store. If performance and availability in the face of the network or server peers being down matter more, one of the async “ship these file changes somewhere after commit” will keep trucking along without being able to reach the outside world.
Anyone know what guarantees it provides there? There are ways around that (eg: blocking on wal file flushes until sufficiently replicated), but I can’t imagine they’ll mix well with sqlite serializing COMMITs.
But then again, if you have a case where the risk of losing writes is too high, you’re probably out of scope for a system like this.
LiteFS author here. Right now, LiteFS provides async replication similar to Postgres async replication so writes to the primary can be lost if they haven’t yet been replicated to other nodes. LiteFS is able to automatically handle failover because it maintains a rolling checksum of the contents of the database so it can detect if an old primary tries to rejoin with unreplicated transactions. Async systems have limited guarantees as a trade-off to higher performance.
We do have plans for synchronous replication as well as time-bounded async replication in the near future. That will improve guarantees to ensure an upper bound of transaction loss in the face of catastrophic failure.
I do agree with your last point though. If you have a system where you require something stronger than synchronous replication (such as distributed consensus) then LiteFS probably isn’t for you. :)
Yes, that’s correct. LiteFS uses physical replication so there’s no practical way to try to merge transactions that deviate. That’s common for async log shipping replication so the main way to avoid that is to add synchronous replication. :)
Transactions are still atomic with LiteFS. Your transaction won’t commit locally until you hit the COMMIT. Only then can it be replicated out to the rest of the cluster.
LiteFS . . . allows each application node to have a full, local copy of the database and respond to requests with minimal latency . . . so that operations can continue even in the event of catastrophic failures.
This suggests each application node processes both read and write transactions against its own local copy of the database, and synchronizes state, er, asynchronously with other nodes. Is that correct?
If so, this suggests that a partition can result in concurrent “causal histories” — modifying the same row on different sides of the partition in different ways would result in COMMIT OK responses for both actors, but when the partition heals, only one “lineage” would be accepted as the truth. Is that correct?
edit: (This may be a re-statement of riking’s comment.) If this is all correct, I’m surprised that it’s acceptable! This isn’t meant as a dig at all, I’m truly intrigued. My assumption has been that application code receiving a COMMIT OK from the state layer would treat that signal as sacrosanct, and would break catastrophically if it were violated. Is that not your experience, among your customers?
This suggests each application node processes both read and write transactions against its own local copy of the database, and synchronizes state, er, asynchronously with other nodes. Is that correct?
Not quite. There’s only one primary at any given time. It works pretty similarly to async Postgres replication except it supports automatic handoff of the primary to a replica. There’s no CRDT-style synchronization. It’s a single linear history streaming from the primary. Right now it only supports async replication so there can be a time window where transactions could be lost in the event of a catastrophic failure of the primary. That’s the same guarantees as async Postgres replication.
This may be a re-statement of riking’s comment
Regarding his comment, a transaction is still atomic. Either all UPDATE commands will succeed or fail depending on whether the COMMIT is successful. Again, since it’s async replication, that transaction could get lost if there is a catastrophic failure on the primary before the transaction gets replicated.
My assumption has been that application code receiving a COMMIT OK from the state layer would treat that signal as sacrosanct, and would break catastrophically if it were violated. Is that not your experience, among your customers?
I think local transactional integrity is important and LiteFS maintains that. However, async replication is pretty common for many applications that use Postgres or MySQL. There’s a performance and complexity trade-off to ensuring durability within a cluster and some applications are ok to relax those guarantees. Catastrophic failures are generally uncommon so it’s not like users are losing transactions left and right.
All that being said, async replication doesn’t work for many applications and we are adding synchronous replication in the near future.
a transaction is still atomic. Either all UPDATE commands will succeed or fail depending on whether the COMMIT is successful. Again, since it’s async replication, that transaction could get lost if there is a catastrophic failure on the primary before the transaction gets replicated.
I’m super confused. If you accept and commit write transactions at each replica, then two users who are connected to different replicas could update the same row, at the same time, with different values, and both would get success responses, right? Assuming so, wouldn’t (async) replication to the primary result in at most one of those updates being accepted, and the other(s) being rejected?
This suggests each application node processes both read and write transactions against its own local copy of the database
is incorrect — only read transactions are served from (non-primary) replicas, write transactions are proxied to the primary, I guess synchronously, and those updates get async replicated to the other nodes. Got it.
If you need to be able to keep writing transactions in the presence of failures, CAP is not especially forgiving of netsplits; you can’t have transactional consistency. Losing the primary is indistinguishable from a netsplit from the perspective of the client.
Yes, if you have a database that favoure constistency over partition tolerance, then it should stop accepting writes if it can’t communicate with it’s secondaries. The fault is that it may accept writes that may never be visible on a secondary.
Neat, this mode seems to be a recent addition, but it is currently a bit restrictive though. I like the improved ANY behavior, but there is no json, coordinate or date type for it to constrain. Yet this is a big improvement! Thanks
sqlite> create table test (id integer primary key, tags text, check (json(tags) is not null));
sqlite> sqlite> insert into test (tags ('["one", "two"]');
sqlite> insert into test (tags) values ('["one", "two"');
Error: stepping, malformed JSON (1)
This is a nifty project! It’ll enable good efficiency wins for a lot of applications.
It seems like if you use this, you need to be careful about eventual consistency. Forwarding writes to the master but processing reads locally means read-after-write will often return stale data. Totally possible to design around, of course, but treating this as a drop-in replacement for something like PostgreSQL could be a recipe for race conditions.
I wonder at what database size the economics stop making as much sense. For small databases, replicating to all the application server instances and ditching the central database server will probably save you money. But if your database is many terabytes, provisioning all your worker nodes with big disk arrays will get pricey fast. There’s clearly an inflection point somewhere between those extremes (and it’ll vary based on workload and so forth, of course).
But there are lots of applications where that’s just not a concern. This seems pretty interesting for those.
LiteFS author here. Yes, consistency is a factor for sure. LiteFS provides a transaction ID so you could have replicas wait to catch up or you could forward reads after writes back to the primary as needed.
As for economics, yes, database size is one factor. The main goal for the project is to make it dead simple to do geographic replication. Many apps are hosted in the US and that’s low latency to other US residents but that’s 100ms of latency to Europe and 250ms to Asia. That’s pretty brutal.
Right now, it’s a bit of a pain and pretty expensive to set up a globally replicated database or to pay for a managed service. Eventually, we hope that LiteFS can scale up easily and with no cost beyond your application server instances.
database size the economics stop making as much sense
I guess you’d have to factor in the cost of employees doing database maintenance, too–since unless you have staff are part time, the cost of a managed database service (even at GBP400/mo) will be significantly less than the cost of staffing. Sadly, it’s also a harder question to answer (I mean, you’d need developers to produce absurdly detailed time sheets, for one).
I’m not at all an expert on containers but wasn’t one of the first reasons to move to containers that they were lightweight? I’ve seen this other places too, I’m wondering why there’s been the shift to VMs now? Or what’s the benefit of VMs that plain containers don’t have?
The main downside of containers is that they don’t really have any security guarantees. The Linux kernel interface is a big attack surface for apps to escape from
I believe Amazon started using firecracker VMs a few years ago, which are lighter VMs but providing more security than the Linux kernel does for containers
Google has stuff like this as well, I think gvisor
Either way, if you can take OCI images and run them in a sandbox with stronger guarantees without making startup time too slow, or making it hard to mount directories or file systems, that’s a win
Firecracker is exactly the technology that fly.io uses.
Not meant as a correction as much as a fun rabbit hole: gVisor is not always a VM-based hypervisor like Firecracker but this oddball sui generis thing: in one of its modes, a Go process intercepts the sandboxed process’s system calls with ptrace and makes other system calls to the Linux kernel as needed. The calls it makes to the “real” Linux kernel are a subset of all the possible ones, which was enough to mitigate some kernel bugs.
I think that link is saying that even in KVM mode, it’s still running its own pseudo-kernel vs. doing more traditional virtualization, but using KVM to provide some extra isolation and speed some things up.
(gVisor is also a place where race conditions are unusually relevant, because it’s meant to be a security boundary and you can poke at it using a multithreaded local process.)
And when we get security guarantees for containers then containers will be the new VMs for your containers for VMs.
And when we ditch Linux for something with security guarantees for native processes then native processes will be the new VMs for containers for VMs for native processes.
To save you some external reading if you just want a quick answer: VMs are fast now. Because of things like mature CPU extensions for things like nested page tables, paravirtual driver improvements, and technologies like Firecracker that boot VMs extremely quickly on minimal virtual hardware profiles.
From a security perspective it’s easier to rely on hardware extensions that isolate page tables and IO devices than rely on multitudes of permissions checks added in to every corner of the kernel to try and support containers. It reduces the security problem to a few key areas.
From a performance perspective it’s faster to rely on hardware extensions that keep guest pages separate from host pages than rely on multitudes of nested or partitioned data structures all throughout the kernel to try and support containers. Theoretically there should be no difference, but the kernel has already gone through decades of optimization for the base case, whereas these containerized / namespaced data structures are relatively new. And certain security features can never be implemented with equal performance. For example syscall filtering just isn’t necessary for a VM, since the VM doesn’t interact with the host on that level. And these container security features don’t have hardware acceleration like modern VMs do.
Ironically all this hardware virtualization acceleration was done for the cloud, so you could argue that VMs are more literally “cloud native” than containers, as VMs are natively supported by all modern server hardware. 😉
Operating Proxmox (KVM,QEMU,ceph “Distribution”) I can say that it’s a breeze. Snapshots, Hardwareconfig, HA etc is really easy. So yeah, I’m a fan of VMs and if your cpu isn’t from 20 years ago, you’ll have no performance issues operating them.
I’m glad VMs are seeing a resurgence in popularity now, but one thing I am happy that came out of the Docker/containers world is consistent tooling, management, configuration and deployment. I don’t have to mount in a .conf file somewhere, a .yaml file somewhere else, ssh in and manually edit some file. For the most part, my deployments all look the same: a bunch of key-value environment variables, loaded the same way, written the same way and easily understood compared to many different configuration formats with their own syntax and rules.
So the main reason I love Fly.io is I can keep all of that consistency that Docker gave me while benefitting from Firecracker turning those container images into nice fast VM images. It’s also portable so I can test my containers locally with a simple docker command!
Assuming you mean “fast to start up” — can you quantify “fast”?
~125ms.
But I also mean fast as in low execution overhead. Any performance difference you see between a VM on a cloud provider and a bare metal core on the same hardware has a lot more to do with noisy neighbors or CPU frequency scaling limits than actual virtualization overhead.
Gotcha. This is fine for a ton of use cases. I usually work in environments where SLOs are defined in terms of roundtrip or time-to-first-byte latency, and the p99 targets are typically on the order of single-digit milliseconds. Alas.
They’ve written a bit about their Docker-container-to-VM architecture:
VMs got a bad rap when they were just traditional servers in a box.
It was easiest to minimize state in applications by changing technologies to containerization.
VMs have not yet hit their limits. Just don’t do “traditional server in a box”.
While the lightweightness was the original point, people have adopted “lightweight VMs” (like firecracker, as other commenters have pointed out).
It’s still the packaging / distribution method du jour, though, so it makes sense to keep them around.
I know this has been asked before, but what’s the comparison with rqlite and other approaches to distributing SQLite?
Tl;dr: you can either commit locally and then ship the data around (LiteFS, Litestream, Verneuil, et. al.) or force replication before commit (Raft-based backends like Rqlite and Dqlite.)
Longer version:
There are basically two main approaches to “distributed SQLite”: LiteFS, Litestream, and Verneuil all hook in at the filesystem/WAL level and ship those elsewhere for replication and snapshotting. You have to explicitly pull down one of those snapshots and load it into a local SQLite file (or do the equivalent with VFS trickery) to get a current view, but reads and writes are still 100% local (and therefore fast, as mentioned in the article.)
Rqlite and Dqlite (plus probably many others) instead bolt a distributed consensus algorithm (Raft, in this case) into the SQLite runtime to build a distributed transaction engine. Once a txn is committed on the Raft cluster, it’s fully replicated and ready to read on any node.
Which one works for your use case depends a lot on how you use SQLite. If you care more about consistency, a Raft backend will give you a “proper” multi-writer, transactional store. If performance and availability in the face of the network or server peers being down matter more, one of the async “ship these file changes somewhere after commit” will keep trucking along without being able to reach the outside world.
Being somewhat skeptical about the modern trend for replicating sqlite, was intrigued by the part about using https://github.com/superfly/litefs/blob/main/docs/ARCHITECTURE.md#ensuring-consistency-during-split-brain. One of the fun cases is where the old primary goes offline, and you need a secondary to come online and continue accepting writes.
Anyone know what guarantees it provides there? There are ways around that (eg: blocking on wal file flushes until sufficiently replicated), but I can’t imagine they’ll mix well with sqlite serializing COMMITs.
But then again, if you have a case where the risk of losing writes is too high, you’re probably out of scope for a system like this.
LiteFS author here. Right now, LiteFS provides async replication similar to Postgres async replication so writes to the primary can be lost if they haven’t yet been replicated to other nodes. LiteFS is able to automatically handle failover because it maintains a rolling checksum of the contents of the database so it can detect if an old primary tries to rejoin with unreplicated transactions. Async systems have limited guarantees as a trade-off to higher performance.
We do have plans for synchronous replication as well as time-bounded async replication in the near future. That will improve guarantees to ensure an upper bound of transaction loss in the face of catastrophic failure.
I do agree with your last point though. If you have a system where you require something stronger than synchronous replication (such as distributed consensus) then LiteFS probably isn’t for you. :)
Would it be correct to say “if any write is ever lost, that write is completely lost after the system fully heals”?
Yes, that’s correct. LiteFS uses physical replication so there’s no practical way to try to merge transactions that deviate. That’s common for async log shipping replication so the main way to avoid that is to add synchronous replication. :)
In particular, you want to avoid something like
Transactions are still atomic with LiteFS. Your transaction won’t commit locally until you hit the
COMMIT
. Only then can it be replicated out to the rest of the cluster.This suggests each application node processes both read and write transactions against its own local copy of the database, and synchronizes state, er, asynchronously with other nodes. Is that correct?
If so, this suggests that a partition can result in concurrent “causal histories” — modifying the same row on different sides of the partition in different ways would result in COMMIT OK responses for both actors, but when the partition heals, only one “lineage” would be accepted as the truth. Is that correct?
edit: (This may be a re-statement of riking’s comment.) If this is all correct, I’m surprised that it’s acceptable! This isn’t meant as a dig at all, I’m truly intrigued. My assumption has been that application code receiving a COMMIT OK from the state layer would treat that signal as sacrosanct, and would break catastrophically if it were violated. Is that not your experience, among your customers?
Not quite. There’s only one primary at any given time. It works pretty similarly to async Postgres replication except it supports automatic handoff of the primary to a replica. There’s no CRDT-style synchronization. It’s a single linear history streaming from the primary. Right now it only supports async replication so there can be a time window where transactions could be lost in the event of a catastrophic failure of the primary. That’s the same guarantees as async Postgres replication.
Regarding his comment, a transaction is still atomic. Either all
UPDATE
commands will succeed or fail depending on whether theCOMMIT
is successful. Again, since it’s async replication, that transaction could get lost if there is a catastrophic failure on the primary before the transaction gets replicated.I think local transactional integrity is important and LiteFS maintains that. However, async replication is pretty common for many applications that use Postgres or MySQL. There’s a performance and complexity trade-off to ensuring durability within a cluster and some applications are ok to relax those guarantees. Catastrophic failures are generally uncommon so it’s not like users are losing transactions left and right.
All that being said, async replication doesn’t work for many applications and we are adding synchronous replication in the near future.
I’m super confused. If you accept and commit write transactions at each replica, then two users who are connected to different replicas could update the same row, at the same time, with different values, and both would get success responses, right? Assuming so, wouldn’t (async) replication to the primary result in at most one of those updates being accepted, and the other(s) being rejected?
Only one primary node can accept writes at a time. The rest of the nodes stream changes from that primary node.
Ah okay! So
is incorrect — only read transactions are served from (non-primary) replicas, write transactions are proxied to the primary, I guess synchronously, and those updates get async replicated to the other nodes. Got it.
If you need to be able to keep writing transactions in the presence of failures, CAP is not especially forgiving of netsplits; you can’t have transactional consistency. Losing the primary is indistinguishable from a netsplit from the perspective of the client.
Yes, if you have a database that favoure constistency over partition tolerance, then it should stop accepting writes if it can’t communicate with it’s secondaries. The fault is that it may accept writes that may never be visible on a secondary.
You can’t sacrifice partition tolerance in the CAP perspective. Availability and Consistency are the only two tunable knobs.
Thanks–I meant availability :)
You can, by eliminating the possibility for network partitions, also known as running on a single node.
In which case CAP wouldn’t be applicable, right? (Since it only applies to distributed systems)
I wish there was a version of sqlite that was as strict with types as postgres is.
https://www.sqlite.org/stricttables.html ? :)
Neat, this mode seems to be a recent addition, but it is currently a bit restrictive though. I like the improved ANY behavior, but there is no json, coordinate or date type for it to constrain. Yet this is a big improvement! Thanks
You can enforce the syntax of column values (ensure they are valid JSON for example) using check constraints in a CREATE TABLE.
Here’s how to do that:
Yeah. Every time I see this stuff I think it’s cool, but I really don’t want to give up the features of postgres in terms of types and triggers.
This is a nifty project! It’ll enable good efficiency wins for a lot of applications.
It seems like if you use this, you need to be careful about eventual consistency. Forwarding writes to the master but processing reads locally means read-after-write will often return stale data. Totally possible to design around, of course, but treating this as a drop-in replacement for something like PostgreSQL could be a recipe for race conditions.
I wonder at what database size the economics stop making as much sense. For small databases, replicating to all the application server instances and ditching the central database server will probably save you money. But if your database is many terabytes, provisioning all your worker nodes with big disk arrays will get pricey fast. There’s clearly an inflection point somewhere between those extremes (and it’ll vary based on workload and so forth, of course).
But there are lots of applications where that’s just not a concern. This seems pretty interesting for those.
LiteFS author here. Yes, consistency is a factor for sure. LiteFS provides a transaction ID so you could have replicas wait to catch up or you could forward reads after writes back to the primary as needed.
As for economics, yes, database size is one factor. The main goal for the project is to make it dead simple to do geographic replication. Many apps are hosted in the US and that’s low latency to other US residents but that’s 100ms of latency to Europe and 250ms to Asia. That’s pretty brutal.
Right now, it’s a bit of a pain and pretty expensive to set up a globally replicated database or to pay for a managed service. Eventually, we hope that LiteFS can scale up easily and with no cost beyond your application server instances.
I guess you’d have to factor in the cost of employees doing database maintenance, too–since unless you have staff are part time, the cost of a managed database service (even at GBP400/mo) will be significantly less than the cost of staffing. Sadly, it’s also a harder question to answer (I mean, you’d need developers to produce absurdly detailed time sheets, for one).