1. 12
  1.  

  2. 8

    Relevant historical context: this page was published around the turn of the millennium; Wayback has its first version posted in May, 1999, at which point it said:

    You can buy a 500MHz machine with 1 gigabyte of RAM and six 100Mbit/sec Ethernet card for $3000 or so.

    Dan kept updating it for a while, but eventually stopped, I guess because people actually solved the problem. There’ve been a handful of updates since 2003, including one in 2011 to mention nginx.

    At KnowNow in 2000, we were in fact able to hit 10k concurrent Comet connections to a single thttpd process on a gigahertz/gigabyte machine, using select(); Robert Thau did the work to add Comet to thttpd. You could do this more efficiently with multiple different processes each handling a thousand or so connections. But you couldn’t do it with 10k separate processes because, among other things, the Linux scheduler wasn’t efficient when a significant number of threads were runnable. I think today you could very likely handle 10k concurrent connections in 10k separate processes on Linux with httpdito (see README), although you’d have to bump its limit from 2048; but since httpdito doesn’t do Comet, that’s probably not necessary.

    When KnowNow got Kleiner-funded, we hired a bunch of people, including a VP Eng who moved us to Solaris in 2001 (he thought Linux wasn’t serious still; Satish Ramakrishnan, a true visionary!) and a guy who wrote a new multithreaded server in C++. It had stability problems, and Solaris just got /dev/poll in February 2001, which was what was needed for better-than-select scalability. He used poll(), which is similar to select() in its poor scalability. But multiple processes with separate fd spaces could have worked. I didn’t have a chance to help fix that because Satish fired me.

    Java still didn’t have any better options for lots of concurrent connections than to run one thread for each connection. A popular Java benchmark of the time, VolanoMark, measured the VolanoChat server’s scalability: how many concurrent chat clients your JVM and machine could handle. Typical numbers were in the 100–200 range. java.nio came out that year, providing a more select()-like interface.

    Maybe there’s somebody around who’s implemented big IM servers who can comment?

    Nowadays you can get this stuff prepackaged: libevent or libev or Twisted comes with reactors out of the box for epoll(), kqueue, and Win32 I/O Completion Ports, all of which scale linearly to lots of connections, plus select() or poll() for backwards compatibility; and I assume there are similar java.nio options in Java. There are still things you may have to tweak, like maybe turning off SYN cookies, decreasing TCP buffer sizes, and increasing fd ulimits, but basically C10K is no longer the hard problem that it was when Dan wrote this page.

    So where are we now? Trying to stuff ten million connections into one machine, rather than ten thousand.

    1. 2

      Related: https://github.com/nemasu/asmttpd

      It’s fun that people do these types of projects. (saw this on reddit.com/r/tinycode)

      1. 1

        @kragen, the README of your project httpdito is a must read! You should be sleep deprived more often :) Before reading this, I had no idea the Linux kernel could schedule so much processes. Now, I understand the problem is not the scheduling; it’s the memory consumed by each process. Your experiment proves it very well.

        1. 1

          I’m glad you’re enjoying it! But httpdito probably very rarely, if ever, has 2048 child processes actually running in my tests. So my experiment would need to be more rigorous to prove it very well. It’s more a test of process spawning speed than of the scheduler.

          1. 1

            The README was clear about it. I’ve misread it. Thanks for the clarification!

      2. 3

        Classic read. I think Java has really done a disservice to a lot of developers by taking so long to adopt the ideas in here. Even now, AFAIK, most Java frameworks still prefer some kind of thread pool, although the granularity of what each thread does is single events rather than a whole process. This is a place where models like Erlang, Oz, and even what C# is offering are really great for developers, albeit the more mainstream a language is the longer it seems to take to adopt these solutions.

        1. 1

          Threads are getting cheaper and cheaper as schedulers improve. Thus, it’s not a big deal anymore to startup a few thousand threads with a smaller stack size in Java and actually serve short requests using it. Papers in 2003 even suggested this was viable: https://www.usenix.org/legacy/events/hotos03/tech/vonbehren.html I’m not sure whether or not “web socket” like requests fit a threading model – I’d speculate the fully event driven would be better there, though.

          I’m not sure I understand the “single events” vs “whole process,” argument. While you can do things with a series of threadpools, and even use non-blocking IO to do so (see SEDA), a better mechanism would be to just let the thread do everything and let the scheduler manage the when it happens.

          In practice, you’d probably want to have a pool of “acceptor” threads that do nothing but wait on accept() and queue the connected socket to be handled by a pool that’s ready to grow to some high number of threads….But, barring synchronization problems, which can be dealt with by utilizing lock-free algorithms, or just highly concurrent data structures, I’m not sure this is a big problem.

          edit - link to SEDA

          1. 3

            Threads are getting cheaper and cheaper as schedulers improve. Thus, it’s not a big deal anymore to startup a few thousand threads with a smaller stack size in Java and actually serve short requests using it

            Even still, you’re capping out a few thousand requests per some time unit, which isn’t even the 10k the linked article is aiming at (depending on what you mean by ‘short lived request’).

            a better mechanism would be to just let the thread do everything and let the scheduler manage the when it happens

            If the discussion is scalability, I think all the evidence points to mapping an entire request to a thread is less scalable than something like an Erlang implementation.

            1. 1

              If the discussion is scalability, I think all the evidence points to mapping an entire request to a thread is less scalable than something like an Erlang implementation.

              I’d love to see more evidence of this, as I continually find articles and papers debunking this. It seems obvious that event driven programming is a better approach, but in practice people keep suggesting otherwise, over and over, if you’re using a real language. In Python, it’s obviously better to prefer event driven over threading, because threads in Python don’t work very well.

              And, again, I’m mostly talking about a request driven application, not a chat application where persistent connections are key.

              However, if you see @kragen’s comment above, that suggests that Linux could probably do 10K “comet” style connections these days with 10k processes. Still, I maintain that I’m not sure about “comet” here.

              1. 1

                Papers in 2003 even suggested this was viable: https://www.usenix.org/legacy/events/hotos03/tech/vonbehren.html

                The threads discussed in this paper are not OS threads, so not comparable to the current Java solutions I think you were using as an example.

                that suggests that Linux could probably do 10K “comet” style connections these days with 10k processes

                That is still rather small. The highest concurrency I’ve benchmarks I’ve seen in Java is 500k connections from UrbanAirship which was implemented using NIO:

                http://urbanairship.com/blog/2010/08/24/c500k-in-action-at-urban-airship/

                I’m not precisely sure we are talking about the same thing, but I am specifically disgareeing with you on this claim:

                a better mechanism would be to just let the thread do everything and let the scheduler manage the when it happens.

                As counter evidence I submit:

                • Erlang
                • Go
                • Nginx
                • Akka
                • .Net w/ Async

                All of which separate the substrate on which the request is processed on (a thread) from the unit the of work done at any single point in time. In Erlang it’s reductions, in Go/Akka/.Net I believe the unit is anything between blocking I/O calls.

                Unfortunately I don’t have any benchmarks handy to provide you with.

                Now, if you are arguing that this is the distinction between long lived connections and short ones, I still disagree. Even if you have short lived connections, I have seen no evidence that a modern OS can handle 1 million concurrent OS threads or processes.

                1. 1

                  The threads discussed in this paper are not OS threads, so not comparable to the current Java solutions I think you were using as an example.

                  You’re right. They were user threads. However, I’m betting (and I don’t have proof of this) that 20-30K native threads wouldn’t be a problem on a modern Java. It’s getting to the point where I want to benchmark this…

                  Now, if you are arguing that this is the distinction between long lived connections and short ones, I still disagree. Even if you have short lived connections, I have seen no evidence that a modern OS can handle 1 million concurrent OS threads or processes.

                  c10k is not about 1 million connections, though.

                  1. 1

                    c10k is not about 1 million connections, though

                    I know, we are talking about scalability in general, I believe.

                    1. 1

                      In comparison to the systems you’ve mentioned (Erlang, Go, Nginx) I’ve never seen benchmarks for things other than HTTP, and this one is the most comprehensive I know of. There are many java powered setups in the top, including ones with Jetty. Jetty is unique because it can support NIO or plain ole synchronous IO depending on the workload. The plaintext benchmark uses ServerConnector which seems to use NIO, but utilizes a threadpool for work (as Jetty seems to always do). I don’t know anything about Resin which is the other high performing Java container. It’s possible that it utilizes a similar setup.

                      I don’t know of other benchmarks that pit pure sync vs async IO, so it’s a bit difficult to discuss further. I’ll leave with the point I originally made. Native threads and sync IO can solve the C10K problem.