1. 18
  1. 13

    Huh? For 10 users you need a separate database and for 100 users you need 2 web servers??? I’m not sure if those numbers are meant to be taken literaly, but they’re off by orders of magnitude.

    We’re going to take our new photo sharing website, Graminsta, from 1 to 100k users.

    Yeah this seems totally wrong. I think back in the day imgur did this all on one box, and that’s how they were able to economically provide free photo hosting for so long (it seems to have gone south now, but in the early days it was very good).

    1. 8

      At $big_streaming_company we got to ~1M users with a few servers before needing to do anything clever.

      1. 4

        Totally agreed – though it does depend a bit on what he means by “N users”. Is it “N users who are using your site flat-out at the same time” or “N users in your database, but only 1 or 2 use the site once a month or so”. Those are very different, but even for the first option I suspect he’s still out by an order of magnitude.

        At a previous company we didn’t really have users, but we handled millions of page views on a single, relatively small database and a monolithic application with about 10 server nodes. We could handle about 300 requests per second if I recall correctly.

        1. 3

          I think it really depends on what you’re doing.

          At home, I just made https://droid.cafe/starlink. It’s running off a free f1-micro instance from google. Each page load takes roughly 40 microseconds of cpu time and is just serving up a few static files. Eyeballing it I think I could probably get rid of cloudflare (which I added out of curiosity not necessity) and still handle more than 10m daily users (computers are really fast these days). By upgrading to a faster instance, and minimizing the html/css/js, I bet I could push that to well over a billion a day. Of course at those scales I’d be paying a lot for geocoding services, but it would be 1 server managing to serve a webpage to most of the world every day.

          At work I don’t know that anyone has counted, but I bet we have roughly 1 production server per user (user defined as someone who directly interacts with our apps), that’s because each of our users gives our servers a lot of work to do, and our servers provide that users company a lot of value.

          Anyways, the real point is to not focus on the numbers.

          1. 2

            Yeah, we’ve got a particularly slow rails app serving 6 million users/day. It runs on 6 AWS instances, but we could just buy a server 4x or 8x bigger and skip the load balancing.

            1. 1

              At the bottom of the article it mentions:

              This post was inspired by one of my favorite posts on High Scalability.

              If you read the High Scalability post, you can see that a lot of the numbers given were ranges, not hard limits. In “Scaling to 100k Users”, the headings are:

              • 1 User: 1 Machine
              • 10 Users: Split out the Database Layer
              • 100 Users: Split Out the Clients
              • 1,000 Users: Add a Load Balancer.
              • 10,000 Users: CDN
              • 100,000 Users: Scaling the Data Layer - caching, read replicas

              High Scalability’s article has:

              • 1 User - 1 machine
              • Users > 10 - Separate DB and app server
              • Users > 100 - Store data on RDS
              • Users > 1000 - Elastic Load Balancer with 2 instances , multi AZ for app servers, standby database in another AZ
              • Users > 10,000s - 100,000s - Read replica, CDN, maybe caching, autoscaling
              • Users > 500,000+ - Autoscaling groups, caching, monitoring, automation, decouple infrastructure, maybe Service Oriented Architecture (SOA)
              • Users > 1,000,000+ - Requires Multi-AZ, Load balancing between tiers, SOA, S3+CloudFront for static assets, caching in front of DB.
              • Users > 10,000,000+ - Data partitioning/sharding, moving some data to specialised DBs
              • Users > 11 Million - More SOA, Multi-region, deep analysis of entire stack.

              If you read Scaling to 100k Users with a > instead of an = then it is slightly more understandable (though splitting out the clients at 100 users doesn’t make a lot of sense to me).

              1. 6

                Right now - today - you can comfortably serve > 11 million users off one server in a good colo.

                There are smaller systems with decade+ uptime using this approach.

                IMO most of these practices are about altering your risk profile rather than your performance profile (which is also important!). The risk profile of a single server in a single DC is unacceptable for many, many businesses.

            2. 13

              One of the easiest ways to get more out of our database is by introducing a new component to the system: the cache layer.

              In my experience, caching should only be used to reduce latency NOT to reduce database load. If your database can’t operate without an external caching layer (redis/memcached) you’re in big trouble when your cache hit ratio drops (restarting instances, cache key version update, etc.) It’s very easy to fall into this trap, and requires heavy architecture changes to get out of it.

              1. 1

                this is a really interesting (and I’m guessing hard-won) perspective.

                I’ve never been bitten by this but also never hosted something where a bit of jank during warm-up was a problem

              2. 4

                All of this is effectively what AppEngine provided out of the box 12 years ago.

                1. 3

                  We’re also going to continue to bump up against limitations on the data layer. This is when we are going to want to start looking into partitioning and sharding the database.

                  Weird that sharding is only mentioned in the final section.

                  These both require more overhead, but effectively allow the data layer to scale infinitely.

                  I disagree. Even sharding runs into problems of hot partitions/keys and data locality.

                  I’d say, in fact, that all of these solutions are mechanisms to solve data locality. CPUs are infinity times faster than memory / network / storage these days. Therefore scaling is a “simple” matter of putting compute beside all the relevant data.

                  Simple. 🙄🤬😭

                  1. 3

                    Weird that sharding is only mentioned in the final section.

                    Sharding solves problems but makes life harder for ops people and, sometimes, provides no benefit. Say you guess wrong and only users starting with M get popular, your sharding system is basically wasted while everything still falls over.

                    1. 1

                      There’s a new generation of databases that have reduced toil for our ops people anyway. I’m thinking about things like Spanner, DyanmoDB, and TiDB. All with different models, but all promise horizontal scalability in their own way.

                  2. 3

                    Even though this has been possible on PaaS solutions for a while, I’d say the article is still valuable.

                    I find it a good rule of thumb to understand why the stack beneath is built the way it is. Maybe a limitation is an enabler for future growth, or it might just be a limitation of the PaaS platform, as the platforms often strive to be generic and usable for most workloads. Without knowing the reasoning behind it, you couldn’t know if your use case could benefit from something less generic.

                    Another good advice is to scale as you get users. Maybe the service isn’t gaining any traction because your app was a month or two too late to market because of scaling ahead. Or excessive scaling could kill the startup due to high costs.

                  3. [Comment removed by author]

                    1. 3

                      While (database) sharding is useful for scaling, I’ve found it even more useful for limiting the failure domain (aka blast radius). A bad DB migration only takes down 1/n shard. Rebooting the database to apply some config changes only affects 1/n of the users at a time.

                      It’s probably less efficient resource-wise, as you’ll eventually waste some CPU/Memory/Disk, but it offers a lot of flexibility for day to day operations with an upper bound on the risks of such operations.