1. 27

I have spent some time carefully evaluating my options on choosing the right platform for my use case. Would love to hear feedback and your thoughts.

  1.  

  2. 11

    When you mentioned channel costs, I wondered if there was communication via unbuffered channels, which can lead to traffic jams since the sender can’t proceed ‘til each recipient is ready. Looking at the old chat_handler.go that doesn’t seem to be the case, though. The three goroutines per connection thing isn’t without precedent either; I think at least the prototypes of HTTP/2 support for the stdlib were written that way.

    It looks like maybe the socketReaderLoop could be tied in with ChatHandler.Loop(): where socketReaderLoop communicates with Loop on a channel, just inline the code that Loop currently runs in response, then call socketReaderLoop at the end of Loop instead of starting it asynchronously. You lose the 32-message buffer, but the end-user-facing behavior ought to be tolerable. (If a user fills your TCP buffers, seems like their problem and they can resend their packets.) However, saving one goroutine per connection probably isn’t a make-or-break change.

    Since you talk about memory/thrashing at the end, one of the more promising possibilities would be(/have been) to do a memprofile to see where those allocs come from. A related thing is Go is bad about respecting a strict memory limit and its defaults lean towards using RAM to save GC CPU: the steady state with GOGC=100 is around 50% of the peak heap size being live data. So you could start thrashing with 512MB RAM once you pass 256MB live data. (And really you want to keep your heap goal well under 512MB to leave room for kernel stuff, other processes, the off-heap mmapped data from BoltDB, and heap fragmentation.) If you’re thrashing, GC’ing more often might be a net win, e.g. GOGC=50 to be twice as eager as the default. Finally, and not unrelated, Go’s collector isn’t generational, so most other collectors should outdo it on throughput tests.

    Maybe I’m showing myself not to be a true perf fanatic, but 1.5K connections on a Pi also doesn’t sound awful to me, even if you can do better. :) It’s a Pi!

    1. 2

      Thank you for such a detailed analysis and looking into code before you commented :) positive and constructive feedback really helps. I have received a great amount of feedback and would definitely try with your tips. BoltDB is definitely coming up every-time and I think it contributes to memory usage as well. Some other suggestions include use fixed n workers and n channels, backlog building up, and me not doing serialization correctly. I will definitely update my benchmark code and test it with new fixes; and if I feel like code is clean enough would definitely love to move back.

      1. 3

        Though publicity like this is fickle, you might get a second hit after trying a few things and then explicitly being like “hey, here’s my load test, here are the improvements I’ve done already; can you guys help me go further?” If you don’t get the orange-website firehose, you at least might hear something if you post to golang-nuts after the Thanksgiving holiday ends or such.

        Looking around more, I think groupInfo.GetUsers is allocating a string for each name each time it’s called, and then when you use the string to get the object out there’s a conversion back to []byte (if escape analysis doesn’t catch it), so that’s a couple allocs per user per message. Just being O(users*messages) suggests it could be a hotspot. You could ‘downgrade’ from the Ctrie to a RWLocked map (joins/leaves may wait often, but reads should be fastish), sync.Map, or (shouldn’t be needed but if you were pushing scalability) sharded RWLocked map. But before you put in time trying stuff like that, memprofile is the principled/right way to approach alloc stuff (and profile for CPU stuff)–figure out what’s actually dragging you down.

        True that there are likely lighter ways to do the message log than Bolt. Just files of newline-separated JSON messages may get you far, though don’t know what other functionality you support.

        FWIW, I also agree with the commenter on HN saying that Node/TypeScript is a sane approach. (I’m curious about someday using TS in work’s frontend stuff.) Not telling you what to use; trying to get Go to do things is just a hobby of mine, haha. :)

    2. 5

      TBH, it feels like you were looking only to change language and not architecture or paradigm here. You mentioned the disruptor pattern briefly then just moved on when you noted that it had no mature implementations. Why not make one? The reason I say this is because you have multiple budgets you are balancing: cognitive and performance. It is hard to get something simple that also scales really well on constrained hardware.

      FWIW, I think an event loop is the right thing here. I’d probably reach for Erlang over JS if you want high concurrency, however. Also, why not C?

      Edit: I think OTP only costs ~68 bytes per process.

      1. 1

        Changing language was one of the last choices, because I had to rewrite complete logic again. It’s a huge undertaking to redo everything in separate language. I could have implemented disruptor but I wanted stay on focused on problem at hand and get the results rather than going into library implementation mode. C/C++ again suffers same problem of writing complete event loop to hookup websockets myself; I explored options like uWebsockets (using that right now) and asio in Boost for example, after writing basic pubsub I felt like it’s too much code to do simple pubsub and the bang for buck would be low. I will definitely do a more detailed dive on OTP.

      2. 3

        I would be very curious to see if there was a way to design a go version of the pubsub server that didn’t require a go routine per socket.

        1. 6

          I have actually tried hard to do that, my conclusion was that you are effectively writing an event loop based system in that case. So if I have to choose an event loop system, why not choose one of the best implementations out there? Just like Node.js is built around event loop, Go lang is built around go-routines and channels. Even for synchronization people first try to use channels. Beam (or Erlang/Elixir) has similar theory with processes and messages.

          1. 3

            At that point, why write it in Go? It ceases to be idiomatic.

            1. 0

              What’s the argument against having a go routine per socket? I was under the impression that because there’s potentially many go routines per actual thread, you can have a ton of go routines without much performance penalty.

              At what scales does an event loop become more efficient than go routines?

              1. 2

                Did you not read the article we are discussing?

            2. 4

              Have you tried Ada? I never looked at it myself, but that article[1] posted today looks very interesting. And there seems to be a well supported web server with WS support[2]

              [1] http://blog.adacore.com/theres-a-mini-rtos-in-my-language [2] https://docs.adacore.com/aws-docs/aws/

              1. 4

                TBH I can’t believe Ada is still alive. I thought it is something that we did in Theory of Programming Languages course and called nothing other than obsolete systems use it. Would give it a shot for sure!

                1. 4

                  This article trying to use it for audio applications will give you a nice taste of the language:

                  http://www.electronicdesign.com/embedded-revolution/assessing-ada-language-audio-applications

                  This Barnes book shows how it’s systematically designed for safety at every level:

                  https://www.adacore.com/books/safe-and-secure-software

                  Note: The AdaCore website has a section called Gems that gives tips on a lot of useful ways to apply Ada.

                  Finally, if you do Ada, you get the option of using Design-by-Contract (built-in to 2012) and/or SPARK language. One gives you clear specifications of program behavior that take you right to source of errors when fuzzing or something. The other is a smaller variant of Ada that integrates into automated, theorem provers to try to prove your code free of common errors in all cases versus just ones you think of like with testing. Those errors include things like integer overflow or divide by zero. Here’s some resources on those:

                  http://www.eiffel.com/developers/design_by_contract_in_detail.html

                  https://en.wikipedia.org/wiki/SPARK_(programming_language)

                  https://www.amazon.com/Building-High-Integrity-Applications-SPARK/dp/1107040736

                  The book and even language was designed for people without a background in formal methods. I’ve gotten positive feedback from a few people on it. Also, I encouraged some people to try SPARK for safer, native methods in languages such as Go. It’s kludgier than things like Rust designed for that in mind but still works.

                  1. 2

                    I’ve taken a look around Ada and got quite confused around the ecosystem and which versions of the language are available for free vs commercial. Are you able to give an overview as to the different dialects/Versions/recommended starting points?

                    1. 4

                      The main compiler vendor for Ada is AdaCore - that’s the commercial compiler. There is an open source version that AdaCore helps to developed called GNAT and it’s part of the GCC toolchain. It’s licensed with a special GMGPL license or GPLv3 with a runtime exception - meaning you can use both for closed source software development (as long as you don’t modify the compiler that is).

                      There is also GNAT AUX which was developed by John Marino as part of a project I was part of in the past

                      1. 1

                        Thanks for clearing up the unusual license.

                      2. 2

                        I hear there is or was some weird stuff involved in the licensing. I’m not sure exactly what’s going on there. I just know they have a GPL version of GNAT that seemed like it can be used with GPL’d programs:

                        https://www.adacore.com/community

                        Here’s more on that:

                        https://en.wikipedia.org/wiki/GNAT

                2. 1

                  So what was the previously benchmarked performance and what’s the new benchmarked performance? I didn’t see that in the release notes and can only assume the difference must have been pretty compelling. Right?

                  1. 5

                    From the article:

                    Go version:

                    I was not able to get more than 1.5K parallel connections on my RPi with 512MB of RAM and system will come to crawl as load goes higher due to thrashing, and memory swapping.

                    Node version:

                    The current system gracefully handles 5K concurrent clients, sending messages back and forth on single room (imagine 5k chatty people in same room), and it does all of this under 150MB of memory

                    1. 1

                      Thanks, I guess I’m blind.