1. 88
  1.  

  2. 23

    I agree with Ben; I believe that massively complex HA-clustered systems can easily be less reliable and require a hell of a lot more devops work than an application running in this single node architecture.

    90% of apps out there can put up with 15 minutes of downtime if you have an outage. Build it into your terms of service: “system can go down for up to 15 minutes during $MAINTENANCE_WINDOW”. Now you can also run database migrations without needing to implement some insane “zero downtime” policy that requires 10x as much engineering effort.

    1. 5

      user expectations continue to rise though. This isn’t the 90s Internet anymore. The big services have damn near zero downtime, so that’s just expected now.

      1. 13

        Really? Anecdotal, but Nintendo’s online store went down for over 7 hours last week for scheduled maintenance. Depending on the service (e.g., SaaS vs. storefront), scheduled downtime can be completely fine.

        1. 2

          Nintendo recently came out that it has been using a two decades old multiplayer tech so the 7 hour downtime is very thematic for a system that old.

          1. 2

            But people were ok with it.

        2. 7

          so that’s just expected now.

          That’s not the relevant question, though. The question is: Will it affect your business? And more to the point, how do the following two things compare:

          1. Your business with 0-downtime, and all the engineering and complexity costs that come with that, including their impact on bugs on velocity of feature development.
          2. Your business with X-minutes downtime, but with a better product with more features, and the ability to move faster.

          For most types of businesses, I don’t think it’s even a question. Sure, you have “old-fashioned” uptime states. But does anyone care?

          1. 4

            If I’m not mistaken my bank goes down every night. Certainly every late night attempt I’ve made it’s been down.

            1. 1

              Right but with the throughputs and resources available to large organizations, one would expect a 15 minute downtime window to be shorter for them because they can invest a lot of resources in making it smaller.

              On the other hand, building systems on the idea that dependencies will never go down, sounds positively foolish, as does trying to hit 100% uptime with no leeway. And indeed, we now have an incredibly complex and fragile stack intended for 100% uptime – that is much more error prone and more costly in terms of manhours spent fixing, documenting, learning, and building on it – than it would be to just allow a monthly or bi-monthy 15 minute downtime.

            2. 1

              Well then you can just create multiple clusters for redundancy ….

              https://news.ycombinator.com/item?id=26111314 (this quote appears to be real :-( :-( )

              1. 2

                Not sure if this is meant as irony or not. Having multiple instances of an app is a very real HA technique. And honestly, if you can isolate your services on a single box, it’s as simple as just starting multiple copies of the service.

                1. 1

                  Sure, having multiple instances of an app is conventional. Having multiple instances of a cluster manager is a different story. I’d say you need an ops team of at least 10 full-time people (and maybe 50 or 100 people) to justify that.

                  And justifying it because “well the entire cluster sometimes goes down when we upgrade” is a little ridiculous. The apps should be able to survive when components of the cluster going down.

                  1. 1

                    You can just use a managed solution.

            3. 6

              This combination of SQLite and consistent backups looks nice. What I still don’t understand is how you deploy your application or migrate to another node.

              Seems like SQLite supports read/write from multiple process so I can start another process and then terminate the previous one once it stops processing requests.

              How about migrating to another node? You might want to install the updates or AWS has scheduled your node to be restarted. Does it mean that in this case I would need to stop all requests then wait until no more writes are happening to the database and then restore it from the S3 on another machine?

              1. 3

                Does it mean that in this case I would need to stop all requests then wait until no more writes are happening to the database and then restore it from the S3 on another machine?

                That’s my understanding as well. One way to mitigate this would be to switch your application to read-only as you bring up a replacement node. For the duration of the restore process you’ll have degraded availability, but you won’t be “down”.

                1. 3

                  SQLite does support multi-process access to the same db file, but doing this over a network filesystem like NFS or SMB is strongly discouraged: most network filesystem implementations have problems with file locking, and improper locking can easily cause database corruption.

                  So this wouldn’t work for migrating to another host. It would however let you update your server code, on the same host, without downtime.

                  1. 1

                    Yep, that was my understanding as well. Being able to switch to another process on the same host.

                2. 6

                  I get the emotional appeal of iconoclastically rejecting a conventional stack for something simpler. But I think running everything on a single machine is going too far.

                  In 2015, my Christmas Day with my family was interrupted by the DDoS attack on Linode. I’ve also seen connectivity issues spanning multiple servers in the same data center at Vultr. So, while it causes cognitive dissonance (see my comments on the thread about colocating one’s own servers), I have better peace of mind relying on a redundant stack at AWS.

                  Also, a highly available multi-tier architecture doesn’t have to have the kind of insane complexity that we like to ridicule in our caricatures. Many applications can get very far talking directly to a client-server RDBMS without a caching layer. Also, there are projects like PostGraphile that leverage advanced features of Postgres to provide things like real-time notifications and work queues without adding extra services like Redis, Kafka, or RabbitMQ to the mix. With something like PostGraphile, one can get very far with a managed Postgres service, a managed load balancer, and simple stateless application processes.

                  1. 14

                    Not all web-applications require a distributed stack. We are living at the start of an era of federated services targeting small communities. We need software that is cheap to run and deploy. Litestream helps to fill that niche very well.

                    1. 5

                      I wouldn’t say it’s purely an emotional appeal, though I think there’s a lot of “alternative-en-vogue” on Lobsters. It depends on the scale your running your web service. I write five-nine services for work and they take a lot of engineering. But when our company was smaller, we wrote to three-nines. And if you’re running a Fediverse node that hosts maybe a hundred users, you can probably get away with two-nines or less. If you’re running a Fediverse node for yourself, your uptime is probably how you’re feeling that day. I’m happy to see DB solutions target the entire breadth of web development. A lot of my personal web services just leverage a Postgres or Redis instance on the box.

                      1. 2

                        I think if you’re running a small-scale single-node service, SQLite is an even better choice than Postgres, etc., since it’s basically zero-configuration; there are no extra processes or logins or points of vulnerability. All it needs to know is what directory to put the DB file in.

                      2. 1

                        I thought about this a few times and every time I ended up doing a rough math of a bill a DDoS attack could rack up in terms of bandwidth charges alone and it always brought me back to my senses: I will rely on as much protection $200 can buy me at Cloudflare and no more. Plus, there is always a chance to misconfigure an autoscaler and every time I see articles like https://news.ycombinator.com/item?id=25372336 (written by an ex-Googler), I think to myself: “Unless I am losing lots of money for every minute of downtime, I’d rather see my system go down for a few hours and be woken up by angry clients than wake up to see a $10k+ bill. It will be cheaper to offer them a discount next month in almost every case.”. If you think I am nuts, just check how AWS or GCP reimburses customers when they breach their own SLAs.

                        Edit: here is my previous back-of-the-napkin math https://news.ycombinator.com/item?id=23554896

                        1. 1

                          What do you use for the database? I would like a database-as-a-service (Postgres or MySQL) but haven’t used them or know which ones are good.

                          I think having the database be “managed” and have some flexibility in the compute could be a good compromise. So you can fail over the web processes but leave someone else to manage the database.

                          I used MySQL on Dreamhost a few times. It worked well but I remember having a performance issue. I did some trick in Python that made it fast, but it was slow in PHP (which surprised me). I guess I should try one of their dedicated plans, not shared plans. Although they may only let you connect from a single host – not sure.

                          From my experience the single Dreamhost box that I run https://www.oilshell.org on has more than enough uptime and performance, but it doesn’t use any database. (It does use a few dynamic processes, though it’s mostly static)

                          1. 2

                            Amazon RDS is working well for me.

                        2. 3

                          My question for SQLite users is how do you handle migrations affecting existing tables? I remember starting a project with it and creating a migration with an ALTER TABLE for an existing column and running into issues. I looked at the docs and it said you have to create a new table, copy everything from the old one, and then rename the new table, which is a major PITA. At that point I just switched to local postgres.

                          1. 4

                            This is a non-trivial operation in most production databases I’ve worked with. Schema mutation usually requires a synchronized deployment of new application code, with the attendant downtime/read-only lock while that push catches up. You also have to wait for your RDBMS to finish the arbitrarily long, multi-stage DML run(s) while you rewrite indices, change defaults/null-handling, or convert values, then verify that the conversion worked before you open up the write path again.

                            SQLite makes some things much harder or impossible (primarily multi-host access to one consistent database) but IME schema migrations aren’t a much bigger deal than they are for client/server DBs.

                            1. 5

                              It sounds like you’ve had much more experience with RDBMS’s than I have and I don’t deny what you’re saying is true, it’s just a bit besides the point I was making. When I use Postgres and want to alter a table, I just write a migration and tell flyaway to run it. It may have to handle stuff under the hood and take a long time, but for me it has “just worked” likely due to the DB size. With SQLite I would have to write all this sql to copy rename existing stuff to do the same thing, it’s more of an ergonomic problem.

                              1. 3

                                That’s a totally reasonable point; sorry if I jumped straight to “production trauma” stories. :)

                                If you remember from that previous project, I am curious which types of schema mutation limitations you’ve encountered? There are some limitations to the SQLite in-place ALTER options, but not that many given that you can’t e.g. add a new non-NULL column in most DBs without providing another default, violate existing foreign key constraints, etc.

                                https://www.sqlite.org/lang_altertable.html

                                (Not trying to play dumb to force a “gotcha” moment. I’m genuinely curious since as you noted, I’ve mostly worked with “big” databases like PostgreSQL and SQL Server in recent memory.)

                                1. 2

                                  AFAIR SQLite doesn’t support column alter or drop operations which means you would need to create a new table, copy the data over, drop the old table and rename the new one.

                                  1. 1

                                    @rcoder I’m pretty sure this was it. I think I was on an early iteration of a project and wanted to drop a column but was unable to.

                            2. 3

                              You just never drop a column, and whenever have need, alter the old column name to deprecated and add a new column. That is at least how people do this with SqlDelight.

                              Or you bite the bullet and do the data moves in one transaction, with some obvious downsides (double the size of the table, and need to call vacuum afterwards). This is how Core Data does it.

                              I have other solutions to this in my SQLite ORM :) https://github.com/liuliu/dflat#schema-evolution

                              1. 1
                                ALTER TABLE foo RENAME TO _foo;
                                CREATE TABLE foo (a INTEGER NOT NULL, b INTEGER DEFAULT 0);
                                INSERT INTO foo (a) SELECT a FROM _foo;
                                DROP TABLE _foo;
                                
                                1. 1

                                  Yes, which is rather verbose for the

                                    alter table foo add column b integer default 0
                                  

                                  (or similar) that many other systems support.

                                  1. 2

                                    SQLite supports ALTER TABLE ADD COLUMN. It just doesn’t support removing columns. (Changing a column’s type is moot because SQLite mostly ignores column types; it uses dynamic types per row.)

                                    1. 1

                                      yes sure, but SQLite does not support many ALTER commands, this is a generic way of modifying tables that works for everything.

                                2. 2

                                  This is great. For once, backup is taken care of by default. Unlike all these howtos that tell you to install Postgres and MySQL and conveniently skipping over that part. This is great for self-hosting.

                                  Now, all we need is some good old inetd-like interface to only start the service on demand so there is almost no resource consumption at rest.

                                  A lot of our industry is catering to Google-style scalability, thinking they need Kubernetes, load-balancers, … The space for smaller deployments hasn’t been explored as much and needs more polish. If the maintenance of self-hosting was simplified, many more people could afford to go down that route rather than relying on SaaS.

                                  1. 2

                                    But I need 100% uptime…

                                    Hah! We can’t really get that because of CAP.

                                    1. 2

                                      Who said anything about needing C? The vast majority of systems serves its users well if data is slightly stale or inconsistent between users. CAP is a red herring for many practical purposes.

                                      1. 4

                                        Consistency and unavailability under partitions is way easier to reason about than inconsistency under partition.

                                        1. 2

                                          Huh? C is probably the most important of the 3. If you’re running a service that can accept the latency of consistent reads and writes, it makes application logic a lot simpler.

                                      2. 2

                                        It seems like Litestream and DQLite/RQLite are perfectly complementary. You can shard your data into lots of Litestream databases, and use a central DQLite/RQLite replicated cluster to keep track of which data goes in which shard.

                                        1. 1

                                          This would be very interesting to build. It’s a type of architecture that I’ve been considering for Notion’s user-facing database feature, but unfortunately these systems don’t seem to support heterogeneous database schema management.

                                        2. 2

                                          When is data committed to S3? I assume it happens asynchronously since S3 has unreliable latency at p99 and Ben is claiming very low latency. If so, you’ll have data loss if your server ever does go down, even for maintenance.

                                          Edit: Yes, it’s susceptible, but not as badly as I was thinking. (Docs). The data gets replicated to S3 as part of the write-ahead log checkpointing process, so yes, that’s async. You’re still expected to open the SQLite DB against a local file. So as long as the host or disk aren’t completely trashed you shouldn’t lose any data. But if you did, you’d lose all transactions that haven’t been checkpointed.

                                          This is probably reasonable given how the database is presented. Still, users should be aware of the potential for data loss.

                                          One big caveat — during replication to S3 it places a global lock on the database while it replicates to S3. So if S3 is experiencing latency degradation you’ll see some significant latency spikes on write requests. In bad cases, I could see this causing enough requests to back up to cause an outage in your service.