1. 1

    As you service is strongly dependent on a cache, you may find as an instance is booting it slurps in a copy of the cache from a neighbouring node. Once complete then your node can start advertising that it is healthy.

    If you are feeling really nifty, you may want to for cache misses to actually have nginx to cycle and proxy to neighbouring instances until it gets a hit and the front node locally caches that response. To be fancier still you can have nodes send around cache digests so you do not need to cycle all your proxies.

    …then of course your current solution is 20 lines of code.

    1. 4

      …or just use UNIX sockets.

      1. 1

        But that only works for comms b/w procs on the same machine!

        1. 1

          Which is what the article is about (as well as ephemeral ports).

          1. 1

            I thought it was about using WebSockets. Did I miss something?

            1. 6

              No more than the article is about Ruby.

              Ephemeral port exhaustion only happens when using TCP, if you are proxying to localhost then UNIX or anon sockets are a far better option; they also have less overhead.

              1. 2

                I was wondering, is there any downside of binding to UNIX sockets instead of regular TCP ones?

                1. 4

                  Other than it being a host local only socket, not really though portability to Windows might be important to you. Maybe you are fond of running tcpdump to packet capture the chit-chat between the front and backends and UNIX sockets would prevent this though if you are doing this you probably are just as okay with using strace instead.

                  From a developer perspective instead of connecting to a TCP port you just connect to a file on your disk, the listener when binding to a UNIX socket creates that file, nothing else is different. The only confusing gotcha is that you cannot ‘re-bind’ if the UNIX socket file on the filesystem already exists; for example the situation when your code bombed out and was unable to mop up. Two ways to handle this:

                  1. unlink() (delete) any previous stale UNIX socket file before bind()ing (or starting your code); most do this, as do I
                  2. use abstract UNIX sockets which works functionally identical but does not create files on the filesystem so no need to unlink. You need to take care though on the naming of the socket as all the bytes in sun_path contribute to the reference name, not just the bytes up to the NUL termination

                  Personally what I have found works with teams (for an HTTP service) is for development the backend presentation is a traditional HTTP server listening over TCP enabling everyone to just use cURL, their browser directly or whatever they like. In production though, a flag is set (well I just test if STDIN is a network socket) to go into UNIX socket/FastCGI mode.

                  As JavaScript/Node.js is a effectively a lingua franca around here, this is what that looks like:

                  $ cat src/server.js | grep --interesting-bits
                  const http = require('http');
                  const fcgi = require('node-fastcgi');
                  
                  const handler = function(req, res){
                    ...
                  };
                  
                  const server = fcgi.isService()
                    ? fcgi.createServer(handler).listen()
                    : http.createServer(handler).listen(8000);
                  
                  server.on('...', function(){
                    ...
                  });
                  
                  $ cat /etc/systemd/system/sockets.target.wants/myapp.socket 
                  [Unit]
                  Description=MyApp Server Socket
                  
                  [Socket]
                  ListenStream=/run/myapp.sock
                  SocketUser=www-data
                  SocketGroup=www-data
                  SocketMode=0660
                  Accept=false
                  
                  [Install]
                  WantedBy=sockets.target
                  
                  $ cat /etc/systemd/system/myapp.service
                  [Unit]
                  Description=MyApp Server
                  Before=nginx.service
                  
                  [Service]
                  WorkingDirectory=/opt/myorg/myapp
                  ExecStartPre=/bin/sh -c '/usr/bin/touch npm-debug.log && /bin/chown myapp:myapp npm-debug.log'
                  ExecStart=/usr/bin/multiwatch -f 3 -- /usr/bin/nodejs src/server.js
                  User=myapp
                  StandardInput=socket
                  StandardOutput=null
                  #StandardError=null
                  Restart=on-failure
                  ExecReload=/bin/kill -HUP $MAINPID
                  ExecStop=/bin/kill -TERM $MAINPID
                  
                  [Install]
                  WantedBy=multi-user.target
                  

                  The reason for multiwatch in production is you get forking and high-availability reloads. Historically I would have also used runit and spawn-fcgi but systemd has made this no longer necessary.

                2. 1

                  Agreed.

              2. 1

                Local load balancing is the motivating example, but I wrote it highlight the general problem when load balancing between a large number of connections between a small number of backends (potentially external machines).

                UNIX sockets might be a reasonable solution to the particular problem in the post. It’s not something I’ve tried with HAProxy before though, so I’m not sure how practical it would be.

          1. 2

            This reminds me of something we figured out when dealing with some ropey multi-threaded C++ code that no-one understood and repeatedly deadlocked. Of course, as these things do, it fell onto the sysadmin team to handle as apparently it was our fault developer code locked up at 3am and again our fault we were the ones who got the pager alerts…but I digress. :)

            v0 used strace to figure out if all the threads were either idle or stuck in ‘D’ state alongside a internal application side watchdog that touch’d a file on the filesystem so we trivially see if the main event loop was still moving. If things had stalled, the shell script pulled out the 9’bore and we left runit to mop up.

            v1 popped out, IIRC, when we saw we could side step strace and find if the application was stuck in a futex syscall via /proc/.../syscall.

            1. 1

              Today I was at an information security “summit” and in the next ballroom over was a “PERL Meeting”

              I wandered over and it was a meeting for nephrologists discussing Preventing Early Renal Loss.

              1. 2

                Did you find the PERL meeting more interesting though?

                1. 1

                  That’s why we call it “Perl” and never “PERL” :-)

                  1. 1

                    That’s a worthy goal though.

                  1. 1

                    Good old friend strcpy :)

                    In 2017, no less…

                    1. 2

                      The problem is that manuals do not tell you what to do, they just list warnings and do not supply any hints on how you should navigate those obstacles safely.

                      1. 3

                        One could always try reading the referenced pages in the see also section.

                        1. 1

                          “they just list warnings and do not supply any hints on how you should navigate those obstacles safely.”

                          It’s an arguable criticism where maybe they should link to that. I mostly agree if they describe the problem they should automatically point to a solution in a direct, obvious way. If space permits, describe it with an example immediately as in secure, coding guides. Alternatively, recommend a secure coding guide. :) Let’s test the counterpoint, though, that it should be easy enough to find for a worried developer willing to look into the matter.

                          “No bounds checking is performed. If the buffer dst is not large enough to hold the result, subsequent memory will be damaged. “

                          Direct implication: check the size of the input first if worried about preventing errors. That’s at least the obvious part. The worried programmer might also type strcpy into Google with words such as security, vulnerability, crash, or so on to understand what the problem might be. I just tested that in DuckDuckGo to find a search for “strcpy vulnerability” gives these links as the first, two results:

                          https://stackoverflow.com/questions/21121272/how-to-mitigate-the-strcat-and-strcmp-vulnerability

                          https://stackoverflow.com/questions/1258550/why-should-you-use-strncpy-instead-of-strcpy

                          That should give them plenty of information. So, maybe we just need to teach programmers to always type the thing they’re curious about followed by specific words into Google or DuckDuckGo that are likely to bring up relevant information. Words I’ve used to good effect include bugs, vulnerabilities, prevent, and mitigate/mitigation. People might have gotten too lazy to do this and/or forgotten how to search properly since Google is so good now. It’s not as good as people think, though, where my old techniques from old engines still pay off.

                          1. 2

                            The other point is the OpenBSD manual is not intended to be a learn to program in C tutorial. It describes the functions available, (hopefully) simply and accurately, noting caveats as necessary. I think our philosophy is closer to do no harm than solve all problems.

                            Really, I have a hard time imagining a novice programmer reading that page, but knowing nothing of the territory. If you don’t know what strcpy is, how did you find the man page in the first place? If you were pointed at strcpy by some other tutorial, shouldn’t that be the document corrected?

                            1. 1

                              You make a great point. Given that and what can Google finds, the reasonable assumption for the new programmer is probably that they learn the language, read some resources on secure coding, and then the docs are just giving a reminder of the risk. It’s not far from a formal spec, either, where it’s warning a precondition on size is necessary for correctness. We similarly wouldn’t expect a language tutorial in a formal spec.

                      1. 4

                        The Intel Galileo had serial over headphone jack.

                        1. 2

                          Was always my personal favorite router mod:

                          The TTL-232R-3V3-AJ cables were cheap to come by too.

                        1. 6

                          Some more possible things that could be learnt from this event.

                          • Don’t do things that can have a large scale effect without thinking about the worst that can happen. The for loop in that code is entirely to blame for the magnitude of the problem (17,000). You could have caught this by precomputing the list of emails to go out and failing if this exceeded the normal amount by a drastic amount.
                          • Perhaps you could have spread the checks across 24 hours instead of at a specific point in time to avoid being dependent on an external call that might not be available during an outage.
                          • External service calls that you rely on will fail. Consider using a retry strategies that has exponential backoff with jitter. https://www.awsarchitectureblog.com/2015/03/backoff.html
                          1. 3

                            The problem is these are all arguably over-defensive measures without the benefit of hindsight. In an “agile” setting, you won’t be able to justify writing code this defensively all the time. Even less so, since if you code like that, there won’t be any incidents to justify writing code like that :)

                            An exception could be if you managed to capture these patterns in abstract, reusable libraries, toolchains etc., then the up-front cost could be justifiable.

                            1. 6

                              I wouldn’t put that on “agile”. I worked in quite a lot of agile organisations where safe and defensive practice was exercised around all mission-critical and safety-related data. They just didn’t bother too much if something was released with a button two pixels too far left (which is an issue to be fixed, I don’t want to downplay frontend, but it can easily be fixed with another deployment and has no lasting repercussions).

                              “pressure” is the killer here.

                              1. 2

                                Or you just make all that irrelevent/impossible by picking append-only data structures?

                            1. 6

                              I am a big fan of LDAP but instructions on how to light up a particular vendor is really of not much interest, especially when maybe just improving the documentation readme could have been more valuable to all in this case.

                              Now why to use LDAP would be helpful, particularly for a blog with such a grandiose domain ;-)

                              1. -1

                                “News at ten, replacement car exhaust can be lojacked! ZOMGBBQLOLZ”

                                1. 3

                                  More like “replacement car exhaust is actually a blender in an exhaust-shaped can.”

                                1. 3

                                  This is depressing and horrible. No wonder everything we use is a tower of poor abstraction and riddled with bugs.

                                  1. 4

                                    Oh, no need to be like that, turn that frown upside down. :-)

                                    It is a great gig replacing all that webscale cruft with 10 lines of shell and making it faster in the process too.

                                    Usually it depresses the developers, but really they brought it upon themselves.

                                  1. 3

                                    “This is what makes a good developer: finding the right combination of libraries, keeping them up to date, and reducing self-written code to the absolute minimum. Ideally the only code you write should be the part that makes your application special.”

                                    I agree with the author. The part he misses is that modern day applications do a lot more. Small teams build exceedingly complex applications. So yes while you have more tools at your disposal (open source, APIs, managed services etc.) the scope of the work has also expanded. So yes the work has changed, easier in some ways, harder in others.

                                    1. 4

                                      In my experience it is that the solutions are more complex than need be.

                                      1. 2

                                        You’re right!

                                        Actually I had planned to cover the change in scope of modern applications, but it was that much text, that I decided to write an own article about that topic ;)

                                        1. 1

                                          welcome to lobsters!

                                          1. 1

                                            Thanks! :)

                                            Nice to be here.

                                      1. 4

                                        Why not just use the other databases that were designed to have transactions from the start instead of bolting it on later?

                                        1. 1

                                          There’s also nothing wrong with that.

                                          1. 1

                                            …but…shiny. SQUIRREL!

                                          1. 9

                                            They use CrateDB as a database for storing and searching product data. They chose CrateDB because it allows them to scale the webshop easily, according to Gestalten.de CEO Frank Rakow.

                                            With ~4.6M rows, they’d be well served by vanilla PostgreSQL. Then again, I don’t know all of their requirements, perhaps they have need of ElasticSearch’s features (CrateDB is a SQL+management layer on top of ElasticSearch). Hopefully Gestalten.de is using this database for analytics and not important data storage.

                                            That said, you couldn’t pay me enough to use ElasticSearch as a primary datastore. A search index? Absolutely. Temporary log storage? Sure. Analytics? I suppose. ElasticSearch has measurably improved over the years, but once bitten, twice shy as the saying goes.

                                            In any case, CrateDB is a remarkably immature database to be running a business on –And lets look at their marketing page:

                                            On the other hand, CrateDB may not be a good choice if you require:

                                            • Strong (ACID) transactional consistency
                                            • Highly normalized schemas with many tables and many joins

                                            Oh dear.

                                            1. 2

                                              Why hopefully? These numbers are embarrassing.

                                              I don’t understand why anyone would be proud of these numbers.

                                              product:([sku:3300?`5]; a:3300?0)
                                              xsell:asc ([] sku:4600000?`5; cross_sku:4600000?`5; tstamp:.z.p-4600000?0)
                                              \t select from xsell where not sku in exec sku from product
                                              87
                                              

                                              That’s msec; kdb is 100x faster than crate (on my macbook).

                                              I suppose I should be at least happy that crate.io puts some actual benchmarks up when they say it’s “fast” so that we know that they mean not at all fast.

                                              1. 0

                                                Grouping in a distributed database is much harder then on your local disk. (and yes, this can be a reason to not pick a distributed software, but if that’s not your main query, this is also okay)

                                                1. 6

                                                  …but it’s 4m rows, you don’t need a distributed database, you don’t even need one for 1bn rows. Christ, this is roughly what I would say is the upper bound for CSV files chewed with UNIX sort/join!

                                              2. 0

                                                With ~4.6M rows, they’d be well served by vanilla PostgreSQL. Then again, I don’t know all of their requirements, perhaps they have need of ElasticSearch’s features (CrateDB is a SQL+management layer on top of ElasticSearch). Hopefully Gestalten.de is using this database for analytics and not important data storage.

                                                Webshops have the problem that they are rarely written by yourself and come with their own share of issues. For example, a popular system is OXID. That means that you are often not so free to choose the database layer.

                                                That said, you couldn’t pay me enough to use ElasticSearch as a primary datastore. A search index? Absolutely. Temporary log storage? Sure. Analytics? I suppose. ElasticSearch has measurably improved over the years, but once bitten, twice shy as the saying goes.

                                                Elasticsearch themselves does not sell themselves this way.

                                                ES is very popular in the shop scene, where the databases of your shop software are often a pain to work with and most of your frontend is search anyways. So, what happens is that they still use the shop software to store all articles, user data and transactions and sync that to Elasticsearch, with which they drive their full frontend. I’ve seen that in a quite a number of deployments and it works well.

                                                Elasticsearch has good stability, just no guarantees. It’s perfectly fine to use it for something important, just be able to recreate it if it blows.

                                                On the other hand, CrateDB may not be a good choice if you require:

                                                • Strong (ACID) transactional consistency
                                                • Highly normalized schemas with many tables and many joins

                                                That’s perfectly fine if none of this is needed on that store.

                                              1. 2
                                                1. 0

                                                  Yea, I would have added some other labels with it, like kubernetes or monitoring/prometheus, but those don’t exist, so just put Erlang.

                                                1. 2

                                                  Instead of memfd_create() you can use the POSIX standard shm_open(), so

                                                  memfd_create("queue_region", 0)

                                                  becomes

                                                  shm_open("queue_region", O_RDWR|O_CREAT, 0600)

                                                  Add ‘-lrt’ to your LDFLAGS and remember to shm_unlink() it when you’re done. Everything else stays the same, including the performance.

                                                  1. 2

                                                    I vaguely recall it being less effort to simply open /dev/zero and use a private mmap()ing of that.

                                                    Of course if you are using this as an IPC between two processes you’ll have to use a regular file.

                                                    1. 1

                                                      I don’t think a private map would work here:

                                                      MAP_PRIVATE

                                                      Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file.

                                                      1. 1

                                                        Meant to say “use a regular file with MAP_SHARED”, good catch. :)

                                                      2. 1

                                                        It doesn’t seem like you can mmap /dev/zero. I get ENODEV “Operation not supported by device” when I try. (macOS)

                                                        Edit, showing my work:

                                                        #include <err.h>
                                                        #include <fcntl.h>
                                                        #include <stdlib.h>
                                                        #include <sys/mman.h>
                                                        int main() {
                                                            int fd = open("/dev/zero", O_RDWR);
                                                            if (fd < 0) err(1, "open");
                                                            void *map = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
                                                            if (map == MAP_FAILED) err(1, "mmap");
                                                            return 0;
                                                        }
                                                        
                                                        1. 1

                                                          You’re confusing /dev/null with /dev/zero.

                                                          1. 1

                                                            Oops, I had used /dev/zero when I first tried it then accidentally swapped it for /dev/null when I came back to give some code. Either way, the result is the same: ENODEV.

                                                            1. 1

                                                              Must be some MacOS specific breakage, because it works on Linux.

                                                    1. 1

                                                      The mmap voodoo seems a bit weird, though I don’t know the exact semantics we’re aiming for. Some of those calls would appear redundant or gratuitous.

                                                      1. 1

                                                        Take note of the difference in offset being used.

                                                      1. 1

                                                        I wish I understood how all this worked. Why exactly is a mod operation slow? Why exactly is it faster to do this via page tables? Is it because the kernel is already doing this and it effectively requires zero additional work? Is it because the CPU can handle this in hardware?

                                                        I guess I’ve got some research to do.

                                                        1. 4

                                                          Mod isn’t super slow, but you can avoid mod entirely without the fancy page tricks by defining your buffer to be a power of 2. For example, a 4KiB buffer is 4096 = 2^12, so you can calculate the wrap-around with ( cur + len ) & 4095 without using mod.

                                                          You would still need two separate memcpy’s, and a branch for the wrap-around non wrap-around cases (which is normally not a big deal except when you’re racing against the highly optimized hardware cache in your MMU…)

                                                          1. 3

                                                            Branches, conditionals such as if/switch statements, can cause performance problems so if you can structure things to avoid this sort of thing you can get a considerable bump in speed.

                                                            A lot of people look to software tricks to pull off speedups but this particular data structure can benefit directly from calling upon hardware baked into the CPU (virtual memory mapping).

                                                            Most of the time you have a 1:1 mapping of a 4kB continuous physical memory block to a single virtual 4kB page. This is not the only configuration though, you can have multiple virtual memory pages mapping back to the same physical memory block; most commonly seen as a way to save RAM when using shared libraries.

                                                            This 1:N mapping technique can also be used for a circular buffer.

                                                            So you get your software to ask the kernel to configure the MMU to duplicate the mapping of your buffer (page aligned and sized!) Immediately after the end of the initial allocation.

                                                            Now when you are at 100 bytes short of the end of your 4kB circular buffer and you need to write 200 bytes you can just memcpy()-like-a-boss and ignore the problem of having to split your writes into two parts. Meanwhile your offset incrementer remains simply:

                                                            offset = (offset + writelen) % 4096
                                                            

                                                            So the speedup comes from:

                                                            • removing the conditionals necessary to handle writes that exceed the end of the buffer
                                                            • doing a single longer write, rather than two smaller ones

                                                            So it is not really that the CPU is handling this in hardware and so it is faster, the hardware is doing actually no more work than it was before. The performance comes from more a duck-lining-up excercise.

                                                            1. 2

                                                              Modulo and division (usually one operation) are much slower than the other usual integer operations like addition and subtraction (which are the same thing), though I’m not sure I can explain why in detail. Fortunately for division by multiples of two, right shift >> and AND & can be used instead.

                                                              For why doing this with paging is so efficient, it is because the MMU (part of the CPU) does the translation between virtual and physical addresses directly in hardware. The kernel just has to set up the page tables to tell the MMU how it should do so.

                                                            1. 4

                                                              Awww man, no console.group()?

                                                              1. 1

                                                                group() is amazing! specially the groupCollapsed(). I considered them but then I thought I be opening the door to talk about count and all the other useful console functions so i focused only on .log(). Maybe next time :)

                                                              1. 2

                                                                And so it begins :)

                                                                I will miss “It’s All Text!”

                                                                I hope they’ll find an analogue, but I doubt it.

                                                                1. 2

                                                                  I could even live with out an adblocker, but for me it is the “I don’t care about cookies” extension.

                                                                  1. 7

                                                                    You don’t have to live without an adblocker… or other stuff really.

                                                                    uBlock, uMatrix, Privacy Badger, Cookie AutoDelete, HTTPS Everywhere, Smart HTTPS are WebExtensions already. There’s even Tree Tabs (though it doesn’t hide the normal tab bar for now, the API for that aren’t there yet).

                                                                    But yeah, I couldn’t find “I don’t care about cookies” :D

                                                                1. 1

                                                                  QUIC is cool, but is updating TCP really impossible? Multipath TCP is a thing, and Apple is already using it for Siri.

                                                                  1. 4

                                                                    MTCP does not solve the stalling flow problem (head of line blocking) due to packet loss and waiting on that retransmit to get through.

                                                                    QUIC also has a 0 (zero) RTT cryptographically secure(ish) HTTP request mode; akin to the whole TCP 3 way handshake, SSL handshake and HTTP request rolled up into a single compact UDP packet so the request is serviced immediately by the server.

                                                                    1. 5

                                                                      QUIC annoys me, because it’s a massive layering violation. It bakes in HTTP, when all I want is a better TCP – Something like CurveCP, which has zero roundtrips, and does encryption and authentication in one go.

                                                                      1. 5

                                                                        Guess you hate BTRFS and ZFS too?

                                                                        I don’t think Google designed QUIC for you, so kinda irrelevant what you want or what I want :-)

                                                                        Besides you said you have CurveCP so what is the problem?

                                                                        CurveCP violates ‘layering’ in the same way that QUIC does, so I am not sure what you mean here? Nothing about QUIC prevents you sending non-HTTP traffic over it. QUIC is more a DCCP+DTLS+TLSv1.3-RTT0 rolled into one; done outside of the IETF initially as it was a prototype that no-one knew what would work?

                                                                        What would you have done differently to avoid layering violation and to make your entire service (inc browser support) 0RTT in a span of months?

                                                                        If you wanted to gripe about something you should probably grumble about DCCP/SCTP, but then this ignores the sane technical reasons QUIC (and CurveCP) violate these layers is because of all the god awful middleware that make up the routing hops.

                                                                        If you have the time, read about RINA and also the fun presentation by John Day title Shortening the Dark Ages of Networking (.mov)…it might sway your opinion to maybe that layering is actually sometimes the problem. :-)

                                                                        1. 1

                                                                          maybe long-term public keys could be a replacement for ip numbers.

                                                                          1. 1

                                                                            What are your thoughts on routing and what does a public key improve on over an 2^80 IP block allocation?

                                                                            1. 1

                                                                              If it’s all public keys, encryption is the default (although with tiny keys).

                                                                              1. 1

                                                                                What advantages does this have over just setting a reverse DNS lookup record and including the public key in something like a TXT RR or maybe some formalised X509 RR?

                                                                                We could do this with IPv4 today, but as no one is I am not really sure what this would solve over host transport IPsec?

                                                                                Plus routing is the hard part of a protocol…not how many bits are in the address or if you slip something cryptographic in there, surely?

                                                                              2. 1

                                                                                Definitely not thought through: but public key can be self allocated and works well in a encrypt-by-default world. You’d need something like dns …

                                                                              3. 1

                                                                                networking ASIC designers would like to have a few words with you…