1. 9

    They use CrateDB as a database for storing and searching product data. They chose CrateDB because it allows them to scale the webshop easily, according to Gestalten.de CEO Frank Rakow.

    With ~4.6M rows, they’d be well served by vanilla PostgreSQL. Then again, I don’t know all of their requirements, perhaps they have need of ElasticSearch’s features (CrateDB is a SQL+management layer on top of ElasticSearch). Hopefully Gestalten.de is using this database for analytics and not important data storage.

    That said, you couldn’t pay me enough to use ElasticSearch as a primary datastore. A search index? Absolutely. Temporary log storage? Sure. Analytics? I suppose. ElasticSearch has measurably improved over the years, but once bitten, twice shy as the saying goes.

    In any case, CrateDB is a remarkably immature database to be running a business on –And lets look at their marketing page:

    On the other hand, CrateDB may not be a good choice if you require:

    • Strong (ACID) transactional consistency
    • Highly normalized schemas with many tables and many joins

    Oh dear.

    1. 2

      Why hopefully? These numbers are embarrassing.

      I don’t understand why anyone would be proud of these numbers.

      product:([sku:3300?`5]; a:3300?0)
      xsell:asc ([] sku:4600000?`5; cross_sku:4600000?`5; tstamp:.z.p-4600000?0)
      \t select from xsell where not sku in exec sku from product
      87
      

      That’s msec; kdb is 100x faster than crate (on my macbook).

      I suppose I should be at least happy that crate.io puts some actual benchmarks up when they say it’s “fast” so that we know that they mean not at all fast.

      1. 0

        Grouping in a distributed database is much harder then on your local disk. (and yes, this can be a reason to not pick a distributed software, but if that’s not your main query, this is also okay)

        1. 6

          …but it’s 4m rows, you don’t need a distributed database, you don’t even need one for 1bn rows. Christ, this is roughly what I would say is the upper bound for CSV files chewed with UNIX sort/join!

      2. 0

        With ~4.6M rows, they’d be well served by vanilla PostgreSQL. Then again, I don’t know all of their requirements, perhaps they have need of ElasticSearch’s features (CrateDB is a SQL+management layer on top of ElasticSearch). Hopefully Gestalten.de is using this database for analytics and not important data storage.

        Webshops have the problem that they are rarely written by yourself and come with their own share of issues. For example, a popular system is OXID. That means that you are often not so free to choose the database layer.

        That said, you couldn’t pay me enough to use ElasticSearch as a primary datastore. A search index? Absolutely. Temporary log storage? Sure. Analytics? I suppose. ElasticSearch has measurably improved over the years, but once bitten, twice shy as the saying goes.

        Elasticsearch themselves does not sell themselves this way.

        ES is very popular in the shop scene, where the databases of your shop software are often a pain to work with and most of your frontend is search anyways. So, what happens is that they still use the shop software to store all articles, user data and transactions and sync that to Elasticsearch, with which they drive their full frontend. I’ve seen that in a quite a number of deployments and it works well.

        Elasticsearch has good stability, just no guarantees. It’s perfectly fine to use it for something important, just be able to recreate it if it blows.

        On the other hand, CrateDB may not be a good choice if you require:

        • Strong (ACID) transactional consistency
        • Highly normalized schemas with many tables and many joins

        That’s perfectly fine if none of this is needed on that store.

      1. 4

        I’ve been working on a Kafka-like TCP accessible log service. Not distributed. It’s an append-only structured (protobuf) log with monotonically increasing message ids behind HTTP/RPC with a basic id/timestamps index.

        It pushes messages at subscribers rather than forcing them to poll. It also adds a small layer of message filtering to reduce network chattiness for clients that don’t care about all messages.

        It’s not designed for massive scale. It eschews some of the peak performance complexity for a simple model. It’s written in Go so that way my poor VPS isn’t using a lot of resources to run heavier services like Kafka and Zookeeper.

        Why? Because I wanted to and no other reason. Because NIH is fun when it’s your side projects.

        It’ll be the basis of my single node HTTP log analytics system.

        1. 5

          They rolled their own dependency management!? And at such a high cost (roughly 1000 engineer-hours, assuming a team of 4)? Why not just use Nix or Bazel? Do people not realize that these tools exist, or is it just NIH syndrome?

          1. 19

            I think saying “just” use Nix understates the difficulty there significantly. I say this as someone who uses Nix, but if someone’s looking for an off-the-shelf dependency management system and the first step to use it is “hey, learn this functional language,” they could be forgiven for assuming it will create more problems than it could possibly solve. Also, packaging new things really is a pain. The build environment is under-documented, it’s hard to pull up an interactive equivalent to the build env, and failed builds are, as far as I know, thrown out instead of saved for debugging.

            I haven’t used Bazel before, but its Github blurb claims it’s designed for “A massive, shared code repository,” which isn’t the problem these folks were trying to solve. Maybe it works fine for multi-repos as well, I don’t know.

            1. 5

              I’m a big fan of Bazel (I’ve helped two companies now transition to it), but it’s not a polyglot panacea; if you’re using the languages Google uses (Java, C/++, Python, Go) and building for a platform that Google releases on (Android/iOS) then it’s wonderful; for anything else, you’d have to do a massive amount of work to integrate it. The article mentions using C# and Perl; you’d have a tough time using Bazel for those.

              I think this gets at a serious difficulty of the very-in-vogue polyglot codebase: it’s all well and good to let programmers choose whatever tool they want, but it comes with a serious devops cost. I don’t know of any tools for building, dependency management or continuous integration that truly work well with a whole bunch of implementation languages. Bazel is the closest I’ve seen, but it doesn’t even support building on Windows.

              1. 2

                I agree with many of your assessments, I hope we can improve in all of these areas :) For what it is worth, you are able to keep failed builds by passing --keep-failed to nix-build.

                1. 1

                  It’s the external version of Blaze, Google’s build system.

                  Some changes can rebuild the world. :P but Blaze/Bazel makes it possible and usable.

              1. 15

                Battlestation and screenshot. i3wm, vim, (unseen) IntelliJ.

                Desktop is fairly old, rockin’ a first-gen Core i7. Screen was the best investment I ever made. Windows dual boot for games and a stand for my laptop (can’t work on work stuff from desktop).

                1. 3

                  Interesting, so you use IntelliJ for Java and vim for C++?

                  If so, isn’t that like a pretty hard context switch? Or do you get used to it? Do you use something to have vim keybindings in IntelliJ?

                  1. 3

                    I use IdeaVim while in IntelliJ which does help. That said, I don’t do a ton of C++, so it’s already a bit awkward of a switch. In the screenshot I was exploring how to add more protoc extension points to Java generated protobufs which does require some C++ work.

                  2. 2

                    What resolution screen is that? I’m thinking about getting an ultra wide for coding. I figure it’s got to be even better than 2 big monitors. Has it made much difference for you?

                    1. 3

                      3440x1440. It’s the Dell U3417W FR3PK 34-Inch (Amazon Link, no referral). I have to say, it’s changed my outlook on screens completely. I used to like two monitors side by side, but found myself tending to use one or the other primarily with misc stuff on one. This is like having two monitors but without the bezel in between, meaning I can have the two monitor experience but also have things right in the middle. I can have my editor fullscreen with multiple files open side-by-side, or Chrome+editor, or as you can see in the screenshot, editor, terminals, and Chrome.

                      It’s made programming so much better for me. Gaming too ;) – but it forced an upgrade to a 1080GTX to push that many pixels smoothly in games.

                      One thing that takes a few hours to get used to is the curve. I love it now, I had issues with it for the first hour or two. But now occasionally I’ll see a flat screen and it’ll look like it’s convex now since my brain is used to compensating for the curve. :)

                      1. 2

                        Nice! There’s a new AOC monitor that’s about $100 more (I think) but it has G-Sync, so I’ve been thinking about getting one of them. I have a theory that ultrawides will really catch on for programming - the productivity gain from having everything you need visible on one panel, with whatever layout you need, seems great.

                        1. 1

                          I have a single UHD at work. It’s pretty nice, but I find myself missing the ultrawide. I think they’re really catching on for programming and gaming. Plus, you can manipulate 1080p content stupid easily.

                          Some ultrawides have a curve, some don’t. I’d recommend the curve.

                    2. 1

                      What keyboard is that? While we’re at it what’s the mouse too?

                      1. 1

                        Original Logitexh G15 with LCD. Cyborg RAT7 mouse. I love my G15. Macro keys, and a switch to disable the meta key.

                    1. 1

                      As they mentioned, all but 18.5M requests were cached. On average, about 7 req/s though admittedly diurnal/workweek loads will make it vary a lot more.

                      They could run it a lot cheaper, but for a company without a lot of focus on that piece of software, $370/month seems like a fine investment to not care about scaling themselves or really touching it at all.

                      That said, an ASG would scale on CPU just fine. A single c4.large (~$75/mo) should be enough CPU for most applications. We served 2k/s per c4.large of fairly CPU light work (json parsing, filtering, Kafka write, json response generation).

                      You might be able to get away with a fleet of micros instead for even less :)

                      1. 29

                        Hmm. I have just spent a week or two getting my mind around systemd, so I will add a few comments….

                        • Systemd is a Big step forward on sysv init and even a good step forward on upstart. Please don’t throw the baby out with the bathwater in trying achieve what seems to be mostly political rather than technical aims. ie.

                        ** The degree of parallelism achieved by systemd does very good things to start up times. (Yes, that is a critical parameter, especially in the embedded world)

                        ** Socket activation is very nifty / useful.

                        ** There are a lot of learning that has gone into things like dbus https://lwn.net/Articles/641277/ While there are things I really don’t like about dbus (cough, xml, cough)…. I respect the hard earned experience encoded into it)

                        ** Systemd’s use of cgroups is actually a very very nifty feature in creating rock solid systems, systems that don’t go sluggish because a subsystem is rogue or leaky. (But I think we are all just learning to use it properly)

                        ** The thought and effort around “playing nice” with distro packaging systems via “drop in” directories is valuable. Yup, it adds complication, but packaging is real and you need a solution.

                        ** The thought and complication around generators to aid the transition from sysv to systemd is also vital. Nobody can upgrade tens of thousands of packages in one go.

                        TL;DR; Systemd actually gives us a lot of very very useful and important stuff. Any competing system with the faintest hope of wide adoption has a pretty high bar to meet.

                        The biggest sort of “WAT!?” moments for me around systemd is that it creates it’s own entirely new language… that is remarkably weaker even than shell. And occasionally you find yourself explicitly invoking, yuck, shell, to get stuff done.

                        Personally I would have preferred it to be something like guile with some addons / helper macros.

                        1. 15

                          I actually agree with most of what you’ve said here, Systemd is definitely trying to solve some real problems and I fully acknowledge that. The main problem I have with Systemd is the way it just subsumes so much and it’s pretty much all-or-nothing; combined with that, people do experience real problems with it and I personally believe its design is too complicated, especially for such an essential part of the system. I’ll talk about it a bit more in my blog (along with lots of other things) at some stage, but in general the features you list are good features and I hope to have Dinit support eg socket activation and cgroups (though as an optional rather than mandatory feature). On the other hand I am dead-set that there will never be a dbus-connection in the PID 1 process nor any XML-based protocol, and I’m already thinking about separating the PID 1 process from the service manager, etc.

                          1. 9

                            Please stick with human-readable logs too. :)

                            1. 6

                              Please don’t. It is a lot easier to turn machine-readable / binary logs to human-readable than the other way around, and machines will be processing and reading logs a lot more than humans.

                              1. 4

                                Human-readable doesn’t mean freeform. It can be machine-readable too. At my last company, we logged everything as date, KV pairs, and only then freeform text. It had a natural mapping to JSON and protocol buffers after that.

                                https://github.com/uber-go/zap This isn’t what we used, but the general idea.

                                1. 3

                                  Yeah, you can do that. But then it becomes quite a bit harder to sign, encrypt, or index logs. I still maintain that going binary->human readable is more efficient, and practical, as long as computers do more processing on the logs than humans do.

                                  Mind you, I’m talking about storage. The logs should be reasonably easy for a human to process when emitted, and a mapping to a human-readable format is desirable. When stored, human-readability is, in my opinion, a mistake.

                                  1. 2

                                    You make good points. It’s funny, because I advocated hard for binary logs (and indeed stored many logs as protocol buffers on Kafka; only on the filesystem was it text) from systems at $dayjob-1, but when it comes to my own Linux system it’s a little harder for me to swallow. I suppose I’m looking at it from the perspective of an interactive user and not a fleet of Linux machines; on my own computer I like to be able to open my logs as standard text without needing to pipe it through a utility.

                                    I’ll concede the point though: binary logs do make a lot more sense as building blocks if they’re done right and have sufficient metadata to be better than the machine-readable text format. If it’s a binary log of just date + facility + level + text description, it may as well have been a formatted text log.

                              2. 2

                                So long as they accumulate the same amount of useful info…. and is machine parsable, sure.

                                journalctl spits out human readable or json or whatever.

                                I suspect to achieve near the same information density / speed as journalctl with plain old ascii will be a hard ask.

                                In my view I want both. Human and machine readable… how that is done is an implementation detail.

                              3. 4

                                I’m sort of curious about which “subsume everything” bits are hurting you in particular.

                                For example, subsuming the business of mounting is fairly necessary since these days the order in which things get mount relative to the order in which various services are run is pretty inexorable.

                                I have doubts about how much of the networkd / resolved should be part of systemd…. except something that collaborates with the startup infrastructure is required. ie. I suspect your choices in dinit will be slightly harsh…. modding dinit to play nice with existing network managers or modding existing network managers to play nice with dinit or subsuming the function of network management or leaving fairly vital chunks of functionality undone and undoable.

                                Especially in the world of hot plug devices and mobile data….. things get really really hairy.

                                I am dead-set that there will never be a dbus-connection in the PID 1

                                You still need a secure way of communicating with pid 1….

                                That said, systemd process itself could perhaps be decomposed into more processes than it currently is.

                                However as I hinted…. there are things that dbus gives you, like bounded trusted between untrusted and untrusting and untrustworthy programs that is hard to achieve without reimplementing large chunks of dbus….

                                …and then going through the long and painful process of learning from your mistakes that dbus has already gone through.

                                Yes, I truly hate xml in there…. but you still need some security sensitive serialization mechanism in there.

                                ie. Whatever framework you choose will still need to enforce the syntactic contract of the interface so that a untrusted and untrustworthy program cannot achieve a denial of service or escalation of privilege through abuse of a serialized interface.

                                There are other things out there that do that (eg. protobuffers, cap’n’proto, …), but then you still in a world where desktops and bluetooth and network managers and …….. need to be rewritten to use the new mechanism.

                                1. 3

                                  For example, subsuming the business of mounting is fairly necessary since these days the order in which things get mount relative to the order in which various services are run is pretty inexorable.

                                  systemd’s handling of mounting is beyond broken. It’s impossible to get bind mounts to work successfully on boot, nfs mounts don’t work on boot unless you make systemd handle it with autofs and sacrifice a goat, and last week I had a broken mount that couldn’t be fixed. umount said there were open files, lsof said none were open. Had to reboot because killing systemd would kill the box anyway.

                                  It doesn’t even start MySQL reliably on boot either. Systemd is broken. Stop defending it.

                                  1. 3

                                    For example, subsuming the business of mounting is fairly necessary since these days the order in which things get mount relative to the order in which various services are run is pretty inexorable.

                                    There are a growing number of virtual filesystems that Linux systems expect or need to be mounted for full operation - /proc, /dev, /sys and cgroups all have their own - but these can all be mounted in the traditional way: by running ‘/bin/mount’ from a service. And because it’s a service, dependencies on it can be expressed. What Systemd does is understand the natural ordering imposed by mount paths as implicit dependencies between mount units, which is all well and good but which could also be expressed explicitly in service descriptions, either manually (how often do you really change your mount hierarchies…) or via an external tool. It doesn’t need to be part of the init system directly.

                                    (Is it bad that systemd can do this? Not really; it is a feature. On the other hand, systemd’s complexity has I feel already gotten out of hand. Also, is this particular feature really giving that much real-world benefit? I’m not convinced).

                                    I suspect your choices in dinit will be slightly harsh…. modding dinit to play nice with existing network managers or modding existing network managers to play nice with dinit

                                    At this stage I want to believe there is another option: delegating Systemd API implementation to another daemon (which communicates with Dinit if and as it needs to). Of course such a daemon could be considered as part of Dinit anyway, so it’s a fine distinction - but I want to keep the lines between the components much clearer (than I feel they are in Systemd).

                                    I believe in many cases the services provided by parts of Systemd don’t actually need to be tied to the init system. Case in point, elogind has extraced the logind functionality from systemd and made it systemd-independent. Similarly there’s eudev, the Gentoo fork of the udev device node management daemon which extracts it from systemd.

                                    You still need a secure way of communicating with pid 1…

                                    Right now, that’s via root-only unix socket, and I’d like to keep it that way. The moment unprivileged processes can talk to a privileged process, you have to worry about protocol flaws a lot more. The current protocol is compact and simple. More complicated behavior could be wrapped in another daemon with a more complex API, if necessary, but again, the boundary lines (is this init? is this service management? or is this something else?) can be kept clearer, I feel.

                                    Putting it another way, a lot of the parts of Systemd that required a user-accessible API just won’t be part of Dinit itself: they’ll be part of an optional package that communicates the Dinit only if it needs to, and only by a simple internal protocol. That way, boundaries between components are more clear, and problems (whether bugs or configuration issues) are easier to localise and resolve.

                                  2. 1

                                    On the other hand I am dead-set that there will never be a dbus-connection in the PID 1 process nor any XML-based protocol

                                    Comments like this makes me wonder what you actually know about D-Bus and what you think it uses XML for.

                                    1. 2

                                      I suppose you are hinting that I’ve somehow claimed D-Bus is/uses an XML-based protocol? Read the statement again…

                                      1. 1

                                        It certainly sounded like it anyway.

                                  3. 8

                                    Systemd solves (or attempts to) some actually existing problems, yes. It solves them from a purely Dev(Ops) perspective while completely ignoring that we use Linux-based systems in big part for how flexible they are. Systemd is a very big step towards making systems we use less transparent and simple in design. Thus, less flexible.

                                    And if you say that’s the point: systems need to get more uniform and less unique!.. then sure. I very decidedly don’t want to work in an industry that cripples itself like that.

                                    1. 8

                                      Hmm. I strongly disagree with that.

                                      As a simple example, in sysv your only “targets” were the 7 runlevels. Pretty crude.

                                      Alas the sysv simplicity came at a huge cost. Slow boots since it was hard to parallelize, and Moore’s law has stopped giving us more clock cycles… it only gives us more cores these days.

                                      On my ubuntu xenial box I get… locate target | grep -E ‘^/(run|etc|lib)/.*.target$’ | grep -v wants | wc 61 61 2249

                                      (Including the 7 runlevels for backwards compatibility)

                                      ie. Much more flexibility.

                                      ie. You have much more flexibility than you ever had in sysv…. and if you need to drop into a whole of shell (or whatever) flexibility…. nothing is stopping you.

                                      It’s actually very transparent…. the documentation is actually a darn sight better that sysv init ever was and the source code is pretty readable. (Although at the user level I find I can get by mostly by looking at the .service files and guessing, it’s a lot easy to read than a sysv init script.)

                                      So my actual experience of wrangling systemd on a daily basis is it is more transparent and flexible than what we had before…..

                                      A bunch of the complexity is due to the need to transition from sysv/upstart to systemd.

                                      I can see on my box a huge amount of crud that can just be deleted once everything is converted.

                                      All the serious “Huh!? WTF!?” moments in the last few weeks have been around the mishmash of old and new.

                                      Seriously. It is simpler.

                                      That said, could dinit be even simpler?

                                      I don’t know.

                                      As I say, systemd has invented it’s own quarter arsed language for the .unit files. Maybe if dinit uses a real language…. (I call shell a half arsed language)

                                      1. 11

                                        You are comparing systemd to “sysv”. That’s a false dichotomy that was very agressively pushed into every conversation about systemd. No. Those are not the only two choices.

                                        BTW, sysvinit is a dumb-ish init that can spawn processes and watch over them. We’ve been using it as more or less just a dumb init for the last decade or so. What you’re comparing systemd to is an amorphous, distro-specific blob of scripts, wrappers and helpers that actually did the work. Initscripts != sysvinit. Insserv != sysvinit.

                                        1. 4

                                          Ok, fair cop.

                                          I was using sysv as a hand waving reference to the various flavours of init /etc/init.d scripts, including upstart that Debian / Ubuntu have been using prior to systemd.

                                          My point is not to say systemd is the greatest and end point of creation… my point is it’s a substantial advance on what went before (in yocto / ubuntu / debian land) (other distros may have something better than that I haven’t experienced.)

                                          And I wasn’t seeing anything in the dinit aims and goals list yet that was making me saying, at the purely technical level, that the next step is on it’s way.

                                    2. 3

                                      Personally I would have preferred it to be something like guile with some addons / helper macros.

                                      So, https://www.gnu.org/software/shepherd/ ?

                                      Ah, no, you probably meant just the language within systemd. But adding systemd-like functionality to The Shepherd would do that. I think running things in containers is in, or will be, but maybe The Shepherd is too tangled up in GuixSD for many people’s use cases.

                                    1. 7

                                      Reminds me of when Google asked them to stop using their timeservers as a default.

                                      1. 2

                                        Although google seems to have changed their mind, since now they say you can use their servers.

                                        1. 3

                                          Cloud Platform changes a lot of things. It’s now useful to have the same smeared concept of time as Google since you’re interacting with storage systems which also have the same smeared time. Additionally I assume it means SREs support public NTP now vs. potentially a SWE team.

                                          I still doubt they’d want it as the default time server for a distro.

                                      1. 25

                                        This really struck home, I’ve been on both sides of the fence, and made a career specializing in collecting shitloads of data.

                                        Here’s the real world tiers of big data:

                                        First you get to: “this hundred megabyte excel spreadsheet loads slowly” scale. Next level up is: “I have to run a couple awk/grep processes and split my file” scale. You might be lucky enough to reach the: “$100 hard drive could hold the entire history of our company worth of data” scale.

                                        Somewhere, several orders of magnitude later, you get to “statistically a piece of hardware often fails while I’m processing data on thousands of nodes and I need to schedule/retry idempotent jobs”.

                                        Most people put on their big data hats and shove their overgrown spreadsheet into a Cassandra cluster with R=N=3, pat themselves on the back and update their resume.

                                        However the article does miss one thing. Success at one order of magnitude doesn’t always preclude usefulness at a lower order of magnitude. Some solutions scale down quite well.

                                        1. 15

                                          Somewhere between is “this is uncomfortably large and I need more IO devices to load it/query it at a reasonable rate.” 4TB isn’t a lot of data, but it is a lot of data to query quickly and load in <6 hours. At that point you already need a few spindles/SSDs and that’s before replication.

                                          I’m not accusing your comment of this (or even the article), but one thing that bothers me is the argument that goes along the lines of, “Well, if your data can fit onto a 1TB disk…” no! Data density is THE ENEMY in non-archival data storage. A 1TB disk at 100MB/s will take 2.7 hours to completely read. Retrieval speeds aren’t increasing on spinning rust, and SSDs while better don’t (yet) fundamentally change the equation. SSDs are lightyears better for random retrieval, but at the end of the day throughput is a scarce quantity.

                                          If you need to store lots of data and only ever need a subset of it (e.g. a well-indexed table, infrequently accessed data), great. But at soon as you’re pushing the terabyte level analytics get pretty painful, as does normal OLTP – with your index in RAM it’s still ~1 IO per query (at least). Even 100GB full tablescan is 16-17 minutes assuming 100% of 100MB/s. Of course, if you can load it once and cache in RAM it’ll be plenty fast.

                                          1. 8

                                            I thought the point was you can keep the whole 4TB in ram and only reload from disk on total system failure.

                                            1. 4

                                              Exactly, instead of engineering a crazy solution with lots of i/o etc… Just buy a bigger server with more ram. Can get servers with ~6TiB of ddr4 and nvme drives.

                                              Then you just need to worry about reboots, and even then if you have nvmex4 drives at read speeds of about lets lowball and expect 4 nvmex4 drives at 2GiB/s read each you can read in 4TiB in 8 ish minutes. Lets round up to 10 minutes. Hell add a factor of 3 on to be a jerk but the xeons that have this hardware should do 8GiB/s easy.

                                              Buy two of the large servers and I’d say you’re pretty much set with throwing hardware at the problem instead of getting cute with a crazy solution that scales horizontally.

                                            2. 7

                                              Excellent points, but you’re also reminding me that there’s another common syndrome in this world. It’s basically the “your data is only big data because you’re storing it in a brain dead way”.

                                              Compression and smart binary encoding can often reduce a big data problem to to a laptop-sized problem. I’ve seen a ton of companies that only have big data because they write out extremely verbose JSON to long term storage. The alternative is usually something like protobufs + gzip/snappy/etc.

                                              [edit]

                                              I seem to have gotten a little off-track with my rambling. I guess the overall point is that big data frameworks and many of the google sized solutions solve real problems that, at a certain scale, do eventually have to be dealt with. I just recommend people try to smartly avoid that added complexity for a long time, not jump into it head first when their excel instance starts to bog down.

                                              1. 2

                                                This is a great point. It’s one of the reasons things like Netezza took off where they actually split the data across many discs attached to small computers to reduce how much data had to go through the disk or the system as a whole to achieve a result.

                                            1. 9

                                              This has been 5 years in the making for me. We first started when I joined the company around 5+ years.

                                              1. 7

                                                I’ve only been at Heroku for a year but this has been more-or-less the only project I’ve worked on. So happy to see it live and I’m excited to help improve healthcare vendor app development. The healthcare industry in my hometown (Pittsburgh) is pretty large and I really hope local projects adopt this as an alternative to their more traditional deployment models.

                                                I’ve worked at a few shops that had to be HIPAA compliant. Having PaaS/IaaS as a viable alternative would have been huge for us back then. We spent so much time and money reinventing the wheels of other projects/products because they were deemed as a security risk.

                                                1. 5

                                                  You should request a Heroku hat.

                                                  1. 1

                                                    As a Heroku hat wearer, I second this.

                                                  2. 2

                                                    As a mental health worker turned software engineer, HIPAA compliance is both near to my heart and a very difficult (but important!) problem for software. Thank you for this.

                                                    1. 1

                                                      So cool. Congrats ya’ll, this is a gamechanger for sure.

                                                      1. 1

                                                        Great news! Being in a city with a large Health IT and Medtech ecosystem (Houston), I’ve felt bad for those companies when they find themselves at some tech talks and some startup-y-leaning events. I remember one particular talk that focused heavily on “work your team shouldn’t be doing” and described a handful free-tier and affordable SaaS tools for every imaginable need, plus a smattering of IaaS and PaaS, basically to help engineers focus more time on their company’s core product.

                                                        About a third of the audience was in Healthcare and so a third of the Q&A boiled down to “So how is this data stored?” and “Oh, I guess we can’t use that.”

                                                        1. 1

                                                          This is really cool! Would you mind my asking a few questions?

                                                          1. What was involved in making this service HIPAA compliant? As an addendum, was LetsEncrypt integration related? Sounds like a huge project!
                                                          2. Are there any common development patterns that shouldn’t be used on Shield?
                                                          3. And maybe a question for a lawyer, but, say Heroku had a bug that made the service non-compliant with HIPAA, would that expose me as the app developer/company to legal difficulties?

                                                          I read through the blog post, but I didn’t click through to the more detailed docs. My apologies if these questions are answered there.

                                                          1. 1

                                                            Hey, the project touched all teams and orgs. My involvement was fairly minimal as I didn’t directly work on Shield.

                                                            What was involved in making this service HIPAA compliant?

                                                            A ton of stuff. One thing you can see on the Heroku buildpack is when someone makes a PR there is a section there for “compliance”. This is where another engineer has to check that they’ve reviewed the change and that it won’t introduce a security vulnerability and guards against someone slipping in some kind of a backdoor.

                                                            There are quite a few other things, but that one touched all engineers and codebases. HIPAA is as much about having a papertrail and being able to prove that you’re compliant as much as actually being compliant.

                                                            Someone who worked more on the actual details might be able to say more, or maybe not depending on our policies, but since that one is publicly visible I figure it’s fine to mention.

                                                            Would you mind my asking a few questions?

                                                            I would say that these would be better answered by one of our specialists from https://www.heroku.com/private-spaces#contact.

                                                        1. 12

                                                          If this is the scalability method you plan to use, why not just deploy more copies of your monolith behind a load balancer?

                                                          I’ll never understand why this isn’t the first option for scaling for most organizations. Am I just old fashioned?

                                                          1. [Comment removed by author]

                                                            1. 8

                                                              Why are your features so numerous or expensive that the cost of having an idle feature in a horizontally-scaled monolith outweighs the cost of managing microservices? Idle features should be nearly free. There’s a slight memory cost to having the code loaded, but that should be negligible, and there should be no other meaningful costs.

                                                              1. 3

                                                                Out of curiosity, what does your software do? Like, what are all these features?

                                                                1. 7

                                                                  Here’s an example from my workplace.

                                                                  We have a mostly-monolithic application (running several instances behind LB) that we’ve been slowly decomposing into microservices. One of those microservices handles PDF generation, but sees very bursty usage patterns. Both time of day and time of year usage varies dramatically compared to most of the monolith.

                                                                  Because it’s a separate service, the PDF generator can scale up and down based on actual usage of that feature and it won’t bog down any of the other application features.

                                                                  1. 4

                                                                    But this would be true of a monolith as well; if every process in your backend could satisfy any request, you would just start more of those processes to handle increased load for any API endpoint. “Put more monoliths behind your load balancer” is still a solution to this, unless I’m missing something.

                                                                    1. 7

                                                                      Not necessarily, scaling a monolithic application is likely to consume more resources (memory, sockets) and take longer to initialize. With microservices you can scale exactly what you need, which saves you operation costs. The benefits are offset by the cost of managing the microservices, which is where a monolithic application shines because there’s just one type of application/service to deal with.

                                                                      1. 4

                                                                        With microservices you can scale exactly what you need, which saves you operation costs.

                                                                        With microservices, you’re likely sending 2 or 3 requests to backend systems and throwing the slower one away to reduce latency. You’re parsing and serializing messages. You’re sending more network traffic between nodes. And you’re dealing with expensive things like distributed lock management. I’d expect operations costs to go up with a switch to microservices.

                                                                        There are times when a whole application stops fitting onto a reasonably sized box, or where you need to distribute things for reliability reasons, but if you can fit everything onto one node in one process, your resource usage would probably drop.

                                                                        Separate (micro-)services have their use, but take care.

                                                                    2. 4

                                                                      Whoa, wait a minute- in what universe is “generating a PDF” a microservice? That’s just a service. I’ve built loads of applications that have services dedicated to a specific feature, often in cases where the feature itself is stateless- like PDF generation.

                                                                      Apparently, I was doing microservices back in the early 00s.

                                                                      1. 3

                                                                        Apparently, I was doing microservices back in the early 00s.

                                                                        Sure. I don’t see anyone claiming otherwise. All the microservice gurus I’ve heard acknowledge that microservices existed long before the name.

                                                                        The newish[1] thing isn’t microservices as such, but a “microservice architecture” where everything is a microservice.

                                                                        That’s just a service.

                                                                        A small service. Perhaps even micro? :-)

                                                                        [1] Yes, even that’s not entirely new. “SOA done right,” etc.

                                                                        1. 2

                                                                          I consider a PDF generator pretty “macro”, personally. It’s not a small feature in any reasonable sense of the word “small”.

                                                                      2. 2

                                                                        Interestingly enough, these arguments in favor of micro services are advertising the same benefits as microkernels like QNX. Easy to upgrade despite version issues (hide in VM w/ IPC). Limit failures to one node. Selectively scale individual components based on usage (eg single CPU, SMP, AMP). Too bad most companies or projects don’t just go the rest of the way to use or extend microkernel OS’s to have those benefits everywhere. :)

                                                                        Note: The containers, clouds, etc can approximate them pretty well just with more complexity and possibly less reliability.

                                                                        Edit: Also, isolating a PDF reader was an early proposal of mine for separation kernels like Nizza Security Architecture. It would be a combo of secrets on microkernel, trusted GUI (Nitpicker) that leveraged, app-specific virtual screens, PDF reader in isolated process on virtual screen, and most GUI functions in Linux VM. Hacks will be limited to passing BS through virtual screen or IPC which are handled by tiny, simple components.

                                                                  2. 3

                                                                    It usually is the first option. However, if their monolith is anything like ours was by the end at $dayjob-1, startup would take minutes. So new capacity was constrained by startup speed of the app. Tests likewise took as long.

                                                                    1. 5

                                                                      So you wait minutes. Get a coffee.

                                                                      there are times you want to split services up – I have worked with systems that took 24 hours to come up to speed and collect enough data before they were ready to take traffic. Those were best isolated from the rest of the system.

                                                                      Other candidates include giant lookup tables. You can have just 2 instances (for redundancy) and not use the hundreds of gigs of RAM in each process that queries them.

                                                                      But a couple of minutes to start? Meh, I would gladly take that over the utter hell that making a large distributed system respond reliably with consistent low latency and low resource use.

                                                                      1. 4

                                                                        By my reading, Xorlev was talking about taking minutes to add new capacity to production.

                                                                        If you have a single feature with bursty demand, that could cause a full service outage in a monolithic app.

                                                                        Putting that feature on its own hardware would help in that case (although you can do that without switching to microservices - by running the monolith on some servers which are reserved for handling that bursty feature.

                                                                        1. 5

                                                                          In that case, I’d be curious to hear what kind of service is so bursty that minutes to spin things up is unacceptable, with no hot spares or slack capacity at all in the existing instances that can soak it up. And, with few enough dependencies that bringing spare capacity on those doesn’t also take a significant amount of time.

                                                                          1. 2

                                                                            When we were on the monolith we were an early startup selling an API with very variable traffic and at the time the request cost wasn’t uniform, making it fairly difficult to forecast load. Now, we eventually solved that by running multiple clusters of the monolith. Then eventually decomposed the monolith.

                                                                            Most of the things we broke out I was really happy about. They didn’t have deep request chains, most were two levels at most (service -> auth + service -> db). Microservices were fairly messy for other products we had and it took a while before we built up the tooling to really be happy with it.

                                                                            I was also happy with how fast local tests ran. Services usually ran their suites in 10s of seconds even with functional tests enabled. Led to a faster development cadence, but you always had to be careful to find dependents and run their tests too.

                                                                        2. 1

                                                                          Alternatively, have the thing start coming up when the developers are first walking into the building or on the ride there. It can be programmed to do it at a specific time or a signed message from their phone.

                                                                      2. 4

                                                                        People in our industry don’t understand what is needed to scale, and have no intuitions for how much resources a user doing normal work will consume.

                                                                        1. 1

                                                                          it is the first option; most places start exploring other options after that one breaks horribly.

                                                                        1. 3

                                                                          I wrote this because I kept reading “don’t use JWT” and getting pushback with “so what should we use instead?” Hopefully now I can point to this.

                                                                          I hope the idea of “single purpose single implementation JWT library” catches on more widely. It would really be much better as just HMAC-SHA256.

                                                                          1. 3

                                                                            It’s important to note that my experiment is not JWT. When you reduce JWT to a thing that is secure, you give up the “algorithm agility” that is a proud part of the specification.

                                                                            This part rang true for me. JWT is in use on one project at work (because others pushed for it). It is used in the most basic sense though. The backend generates all the JWT tokens, and validates the tokens – allowing only the one hmac algo we specify (HS256). The frontend just treats them like opaque tokens.

                                                                            I presume we ticked a box somewhere that said “we use JWT”.

                                                                            1. 1

                                                                              We are using JWT also but I saw that there is a security problem and restricted them to a specific algorithm server side. Basically they become a signed JSON payload, which is fine for our purposes.

                                                                            2. 1

                                                                              What are your general thoughts on Macaroons? I always liked the model better and seems to be just chained hmacs.

                                                                              1. 1

                                                                                never heard of them!

                                                                            1. 1

                                                                              The tl;dr of this story is “we had issues with memory fragmentation and punted on solving it by dropping in jemalloc”. Also, “The right answer to \“malloc is slow\” is to make it faster.”

                                                                              C++ is one of the few languages where you can solve this properly! They talk about worker threads abusing tiny allocations, which is a fairly easy case to deal with. Since each worker thread can only work on one task at a time, you can give each worker a large chunk of memory to use as a stack, and wipe it before you start the next job. It’s very fast, it will never fragment, and it will never leak.

                                                                              The downside is that your worker threads have a hard memory cap, but honestly they have one if you use malloc too. When you set the cap yourself, you can abort the single job that blew its budget. When your OS/RAM sets the cap, you swap and everything grinds to a halt or you get OOM killed and drop everything.

                                                                              Have you ever actually looked at C++ STL code?

                                                                              Hahaha. I was looking in our standard library a few weeks ago and found a sort implementation which is probably not optimal.

                                                                              “I heard std::sort is solid, maybe we can use that!”

                                                                              template<typename _InputIterator1, typename _InputIterator2,
                                                                                 typename _OutputIterator, typename _Compare>
                                                                              inline _OutputIterator
                                                                              merge(_InputIterator1 __first1, _InputIterator1 __last1,
                                                                                _InputIterator2 __first2, _InputIterator2 __last2,
                                                                                _OutputIterator __result, _Compare __comp)
                                                                              {
                                                                                // concept requirements
                                                                                __glibcxx_function_requires(_InputIteratorConcept<_InputIterator1>)
                                                                                __glibcxx_function_requires(_InputIteratorConcept<_InputIterator2>)
                                                                                __glibcxx_function_requires(_OutputIteratorConcept<_OutputIterator,
                                                                                  typename iterator_traits<_InputIterator1>::value_type>)
                                                                                __glibcxx_function_requires(_OutputIteratorConcept<_OutputIterator,
                                                                                  typename iterator_traits<_InputIterator2>::value_type>)
                                                                                __glibcxx_function_requires(_BinaryPredicateConcept<_Compare,
                                                                                  typename iterator_traits<_InputIterator2>::value_type,
                                                                                  typename iterator_traits<_InputIterator1>::value_type>)
                                                                                __glibcxx_requires_sorted_set_pred(__first1, __last1, __first2, __comp);
                                                                                __glibcxx_requires_sorted_set_pred(__first2, __last2, __first1, __comp);
                                                                                __glibcxx_requires_irreflexive_pred2(__first1, __last1, __comp);
                                                                                __glibcxx_requires_irreflexive_pred2(__first2, __last2, __comp);
                                                                              
                                                                                return _GLIBCXX_STD_A::__merge(__first1, __last1,
                                                                              			__first2, __last2, __result,
                                                                              			__gnu_cxx::__ops::__iter_comp_iter(__comp));
                                                                              }
                                                                              

                                                                              for thousands of lines, maybe not.

                                                                              1. 7

                                                                                The author goes on to say that jemalloc didn’t solve the problem, but they deployed it anyways to reduce CPU and memory usage.

                                                                                The problem being gnu libstdc++ overloads new and pools allocations on top of malloc.

                                                                                If that is what you also took away from it, it may be worth updating your tl;dr to be move favorable to the author.

                                                                                1. 4

                                                                                  Besides the ugly names, what don’t you like about that code snippet?

                                                                                  See this SO answer for more info, but long story short, none of those _glibcxx* functions cause any code to be emitted.

                                                                                  I also suspect the forwarding to call to _GLIBCXX_STD_A::__merge would get automatically inlined.

                                                                                  1. 1

                                                                                    It’s not horrible if you really try to read it but it’s a huge amount of code for something that doesn’t need it

                                                                                  2. 3

                                                                                    “we had issues with memory fragmentation and punted on solving it by dropping in jemalloc”.

                                                                                    As Xorlev says, that’s definitely not what happened. That was one of many attempts to solve the problem… that didn’t work.

                                                                                  1. 2

                                                                                    I didn’t go quite as far as SpatialOS, but I also found myself wanting “type-safe” network services internally that are also exposed externally to browsers. I built https://github.com/xorlev/grpc-jersey to bridge gRPC Java services with HTTP/1.1 Jersey/REST/JSON APIs.

                                                                                    I’m extremely excited to give gRPC-Web a try (leveraging TypeScript and HTTP/2) and to watch the forthcoming gRPC-Web spec.

                                                                                    1. 1

                                                                                      The article left me curious about what release_sock() actually does if not notify when the socket has been released. From the Linux Foundation wiki, https://wiki.linuxfoundation.org/networking/socket_locks, it seems like it’s to release a lock on a socket as opposed to being a handler. There’s also lots of usages of that particular identifier that seem to uphold that idea: http://lxr.free-electrons.com/ident?i=release_sock

                                                                                      That said, I’m not sure if I’m in the right ballpark.

                                                                                      1. 43

                                                                                        From #lobsters:

                                                                                        < zedgoat> ransomware of the future will be an electron app that does nothing but run 24/7 until you pay five bucks to close it for an hour.

                                                                                        1. 15

                                                                                          I already feel like that with Slack.

                                                                                        1. 5

                                                                                          As an aside, I scrolled down a bit accidentally so I scrolled back up to start from scratch. Motherboard proceeded to then skip back through 8-9 different articles (from what I could see in the URL) without leaving any history/back state.

                                                                                          I love the modern web. /s

                                                                                          1. 18

                                                                                            I don’t feel about this as strongly as @michaelochurch (I don’t know that I feel about anything as much as he does about everything) but yeah, this is pretty clearly at best a neutral fact, and an artifact of the truth that, to a first approximation, nobody knows how to deliver good software.

                                                                                            1. 17

                                                                                              nobody knows how to deliver good software

                                                                                              This is provably false with data available, but with a caveat I think you’d probably agree with me on.

                                                                                              We definitely know how to deliver good software, and we figured it out somewhere in the mid-80s. Capers Jones even has a textbook called “The Economics of Software Quality” where it’s even cheaper in the long-term to produce high quality software over low quality software with less definition and explicit effort on quality. They also ship sooner, with less bugs according to the data.

                                                                                              EXCEPT: We don’t know how to define our work, and most people in management fear committing to any outcome, as they can be judged by it. Without a definition of what we’re trying to achieve with any serious amount of design work, any other attempts at quality are largely a joke. When management says something like “we should be flexible in the ever changing landscape of <insert completely static requirements>”, call them out.

                                                                                              Finally, individuals have made enormous sums of money shipping absolute piles of steaming shit because they are defining new markets of which they are monopolies. I strongly believe they could have made MORE with higher quality software.

                                                                                              1. 5

                                                                                                You’re absolutely right on the point that we know how to make good software. It’s not mathematically impossible. It’s just infeasible with on the schedule and under the resource budget that corporations, even when software is their main product, expect.

                                                                                                I think the main issue pertains to where the costs and risks are put. The people who thrive in the corporate world are those who externalize negatives (costs, risks, embarrassments) in a way that they’ll stick to someone else. One easy way to do this is to externalize into the future, since any politically capable manager will be promoted away from whatever he’s working on before anything bad happens.

                                                                                                With good software, the risks are that schedules will slip, hiring will be slow because it’s hard to find good people, and the engineers might end up knowing more than their bosses. These threaten a middle manager’s position in the short term. With bad software, the costs and risks are greater but the probability that they manifest in a way that harms the manager’s career is very low. More likely than not, the shortfall will be detected years later, and that will provide enough time for the savvy corporate social climber (i.e. the middle manager) to blame someone else– if nothing else, he can always blame his subordinates… and he can always say, “I may not have communicated the importance of quality clearly enough, but that was 5 years ago and I’ve grown as a manager; I’m an SVP now!”

                                                                                                The only thing that business executives have insight into is how long things take, not how well they were done. Consequently, they’re always going to favor the shitty thing delivered early over the great thing delivered late.

                                                                                                If you care about quality software, your best bet is to work for a government or for a corporation that has been around for 50 years and plans to be around for another 50. This doesn’t guarantee technical excellence, but there is a shot at it.

                                                                                                1. 1

                                                                                                  Well, if you want to go this way, you could go all the way. If a company is committed to ship high-quality software, that means that some plan can fail to be adopted because it is incompatible with shipping quality software. And most likely the person who can determine this incompatibility would be more technical than the current set of top managers. This means one more person with effectively company-wide veto power, and that slightly devalues the power of the top managers, not just threatens the middle managers.

                                                                                                2. 2

                                                                                                  I agree. We know how to deliver quality software. See the CMM level 5 companies on this list:

                                                                                                  http://seir.sei.cmu.edu/pml/

                                                                                                  1. 4

                                                                                                    I don’t know that an SEI rating can actually prove that a company knows how to deliver quality software. It proves that they can work in a prescriptive manner. But that’s just process, not output.

                                                                                                    1. 2

                                                                                                      Specifically they must prove that they have metrics to measure the quality of their output, especially with defect rates per function point (line of code, etc), for 4 or 5 levels.

                                                                                                      Maybe I don’t understand your point, but I don’t feel that it’s “just process” if they have requirements around what you measure (as opposed to what you do).

                                                                                                      1. 7

                                                                                                        Accenture is CMM level 5 per the above link. I would not hire them to come within 1000 yards of a quality software project.

                                                                                                        In practice this process model doesn’t correlate well with anyone actually being satisfied with the quality of software produced.

                                                                                                        1. 4

                                                                                                          Note that it’s a very specific division of Accenture. Not trying to argue your personal experience.

                                                                                                        2. 2

                                                                                                          Yes, exactly. I think many on here are probably not old enough to remember the CMM craze. There was a LOT to getting CMM certified, and IMO attaining a level 5 CMM certification has weight.

                                                                                                  2. 16

                                                                                                    That’s not true. Airplanes, air traffic control systems, space craft, and most other safety critical embedded systems all have “good” software that doesn’t fail very often. The problem is that it takes a lot of time and is very expensive, and most people choose cheap and broken over expensive and correct.

                                                                                                    Open office plans make all kinds of development more expensive because nobody can concentrate for long and that makes everything take longer.

                                                                                                    1. 5

                                                                                                      Speaking with embedded dev hat on. One reason for a lot of the systems working properly is very limited interaction surface which is easy enough to test semi-comprehensively. Another is limited, conservative functionality: e.g. PLCs use ladder logic since forever.

                                                                                                      That said, there’s rarely rocket science inside all that, and it’s not particularly pretty. I’ve seen an industrial Ethernet switch vendor run Node, seizing the switching fabric when you refresh the stats page; a certain large PLC/automation vendor who can’t TCP/IP properly and so on. The reason one would think things are smooth there is one never sees the ugly side of it :)

                                                                                                      1. 4

                                                                                                        I mean, I did say, “to the first approximation.” Sure, we could all work like NASA did on shuttle flight control software, but then much less software would be produced. I think I’d probably be OK with that, but there are other imperatives at work.

                                                                                                        And yes, open plan offices are a catastrophe.

                                                                                                        1. 1

                                                                                                          I think that proves the point right? We do know how to deliver quality software, it’s just not a tradeoff we’re willing to make for software that isn’t critical to safety.

                                                                                                          You can argue that OpenSSL et. all are critical to safety too, but it isn’t in direct control of a car or the space shuttle.

                                                                                                          1. 1

                                                                                                            Sure. But the incentive structure we labor under is such that the choice is between shitty software and no software, realistically. I do wonder about the implicit warranty waiver and other shitty language in license agreements, as this problem is not one of science, but rather of law.

                                                                                                        2. 4

                                                                                                          A lot of those systems are riddled with bugs that are avoided not because the software is robust but because the only users of the software are highly trained quite literally in what not to do with the software. You don’t have Joe Random User just sitting down in the front of an A340, you have a guy/gal who’s been told specifically “if you put these inputs into the flight computer you cannot trust the output, so don’t do that”

                                                                                                          It’s gotten better as time has gone on but it’s not as perfect as you’d think.

                                                                                                        3. 7

                                                                                                          I’d take it a step further. For the most part we don’t really have an agreement on what good software is, which is pretty much a pre-requisite for reliably delivering it.

                                                                                                          Is the facebook android app “good” software?

                                                                                                          Is the linux kernel “good” software?

                                                                                                          Einstein once said time is what you measure with a clock and space is what you measure with a stick. For most of us, good software is defined to be software our superiors think is good and our customers are willing to use.

                                                                                                          1. 1

                                                                                                            I think that this is exactly right, and a better way to state the problem.

                                                                                                          2. 6

                                                                                                            That’s provably wrong. There’s companies that have been delivering good software for years. The high-assurance field all do it. Then there’s companies in competitive industries that charge a bit more for delivering better stuff. Comments downthread act like you need a NASA budget. They didn’t see the evidence clearly.

                                                                                                            Cleanroom’s empirical results had it delivering very, low defects at an overhead ranging from slightly less than broken software to not much more:

                                                                                                            http://infohost.nmt.edu/~al/cseet-paper.html

                                                                                                            Old, high-assurance systems showed that reaching level of formal verification of design and near-exhaustive testing of system’s security (their focus) was between a 30-40% premium over regular, software process:

                                                                                                            https://cryptosmith.files.wordpress.com/2014/10/lock-eff-acmp.pdf

                                                                                                            A modern shop that does similar stuff claimed a 50% premium for software with almost no defects whose TCB could often provably avoid common defects:

                                                                                                            http://www.anthonyhall.org/c_by_c_secure_system.pdf

                                                                                                            Then there’s niche operators doing things such as using logic programming to encode the specs of the problem in a way that bypasses much of the coding and flexibility problems:

                                                                                                            https://dtai.cs.kuleuven.be/CHR/files/Elston_SecuritEase.pdf

                                                                                                            So, yes, there’s companies delivering good software. Some Cleanroom companies and Altran/Praxis even warranty their software to a specific defect rate. They fix any other problems at their own cost. Most of the problems are from either (a) industry not knowing the methods available to deliver low-defect software or (b) bad management making sure that doesn’t happen. Michael writes about the latter. My stuff is about the former mostly. There’s a lot of both going on.

                                                                                                            1. 2

                                                                                                              hey NickP, I’ve seen you around, and you are pretty knowledgeable about this stuff. (Didn’t you post on Schneier’s blog a lot a few years ago?)

                                                                                                              Do you perchance have an essay where you could elucidate “How To Do The Job Right”, along with sources?

                                                                                                              1. 1

                                                                                                                Yep, it’s me. I switched it to nickpsecurity because NickP wasn’t available on many sites. Stayed for years posting my designs on Schneier’s blog since it used to have many talented engineers and businessmen delivering great peer review. Many left with multiple trolling operations, one maybe state-sponsored, drowning out all signal. Had to mostly back off. Now on Hacker News and here.

                                                                                                                Far as the paper, Im not sure what that title would mean. Are you talking about corporate issues? I recommend PeopleWare book as a start on that. Are you talking about one of my essays on security engineering, certification, or embedding into businesses? With more detail, I can try to dig it up. I have a text file with links to many of them.

                                                                                                          1. 11

                                                                                                            A little trite, but worth emphasizing: avoid complexity so long as you absolutely can. If you can only take on complexity relevant to your problem, you are so much better off.

                                                                                                            It comes much later on that something like Kubernetes actually does simplify things. Lower global complexity at the cost of some central complexity.

                                                                                                            It’s also worth saying that you should always keep your mind open. Maybe etcd/Zookeeper actually drastically simplifies your solution because you need leader election.

                                                                                                            1. 11

                                                                                                              Maybe etcd/Zookeeper actually drastically simplifies your solution because you need leader election.

                                                                                                              Sure but if you’re making, for example, an e-commerce site handling relatively low volume and you find yourself needing leader election, you should probably think up the stack a few frames and question why you have multiple nodes that need to elect a leader.

                                                                                                              I’d be interested in seeing a set of tables (because, as a reformed mechanical engineer, I loves me some tables) that match various workloads with industry standard infrastructure.

                                                                                                              Things like:

                                                                                                              • If you do X transactions/hour in Y type of business, you only need a database of size Z.
                                                                                                              • If you have a customer base across X countries, you should have Y datacenters or availability zones.
                                                                                                              • If the average value of a transaction is X, and a downtime of Y results in Z missed transactions, consider HA solution W.

                                                                                                              I know that’s all horribly boring, but I’m getting the feeling nowadays that new developers (most of our industry, especially in startups) just have no idea what is a reasonably-sized deployment for their problem.

                                                                                                              It’s like bringing a new workman in to hang a picture and because it needs a nail they bring in an air compressor, hose, nailgun, and all the rest–when a tack hammer would have sufficed.

                                                                                                              Anybody feel similarly?

                                                                                                              1. 10

                                                                                                                Bro, once I get done handing out these sweet stickers at SXSW, we’re going to be doing like a billion requests per second.

                                                                                                                This article didn’t address why this happens. There’s the category of “problems it’s good to have”, and people seek out aspirational solutions to these problems, in the hope of then having the problem.

                                                                                                                1. 2

                                                                                                                  I think that having a set of tables or something similar that tries to capture the common types of software and what usage patterns one should expect would be pretty handy. I’d be interested in seeing something like this, but I lack the knowledge and experience to complete it by myself. In the spirit of shipping things, however, I wouldn’t mind starting a wiki or distributed spreadsheet on it.

                                                                                                                  I think the closest thing we have is the TechEmpower server benchmarks, and/or the discussions on sites like https://lowendbox.com/. http://highscalability.com/ seems like it might be relevant, but it also seems to focus on heavily engineered things like Uber or Amazon, which I suspect wouldn’t be the focus here.

                                                                                                                  1. 6

                                                                                                                    I would say you need one extra moving part for every factor of 10 users. There’s some question of how to count at the beginning, but I’d say if you have a web server, framework, and app (nginx, rails, whatevs) that’s 3 and should be good for 1000. Add memcached? 10000. I’ll count SQLite as nonmoving. This arithmetic implies not needing a separate database server or load balancer until another factor of 100, so perhaps a million users.

                                                                                                                    1. 2

                                                                                                                      How would you count “users” in this case? There is a large difference between passive consumers and active users. A blog may only have a single active user if it has no comments, but many readers.

                                                                                                                      It’s funny (and very probably accurate) to count SQLite as a non-moving part. That is what forms both the appeal and trade-offs of using it.

                                                                                                                      1. 2

                                                                                                                        I think you can count any way (and any thing) you want. Passive users, daily actives, requests/sec, etc. The scaling math works out about the same, just different minimum thresholds. Like parts required (req/s) = log(req/s) + 2.

                                                                                                                  2. 1

                                                                                                                    In my commercial experience, the people who would read those tables are unfortunately the ones who least need them.

                                                                                                                    1. 1

                                                                                                                      Sure but if you’re making, for example, an e-commerce site handling relatively low volume and you find yourself needing leader election, you should probably think up the stack a few frames and question why you have multiple nodes that need to elect a leader.

                                                                                                                      Totally agreed. Though sometimes it comes from external requirements, e.g. with GNIP. You can only have N consumers at a given time (where usually N=1), but want to ensure that if one fails another starts consuming. Maybe this is a really social ecommerce platform. :)

                                                                                                                      I’d be interested in seeing a set of tables (because, as a reformed mechanical engineer, I loves me some tables) that match various workloads with industry standard infrastructure.

                                                                                                                      That’d be great! I do think it’d need to be published every 6-12 months to keep pace with software and hardware evolution. What used to take a sharded, complicated setup with memcached and the whole nine yards in 2006 is now doable with a single server (+ a slave to failover to).

                                                                                                                      I have a site that’s been running on vBulletin for ~10-11 years. Each year it’s grown at a pretty steady rate. Yet, beyond the first few years, I’ve been able to shrink the amount of hardware it uses. It used to soak up 90% of a dedicated server, now I’m on a fairly cheap ($40/mo) Linode VPS. In that time I’ve added lots of functionality and tons of home-rolled data collection (to avoid sending to Google Analytics). Had I been doing what I do today with the amount of traffic I have back in 2006 I’d easily be on 3-5 (or more!) servers. At this point there’s (basic) machine learning+model serving, home-rolled analytics, a Rails app fronting an ElasticSearch process with millions of items in it, MySQL for vBulletin, Postgres for everything else, and a Java API backing a mobile app.

                                                                                                                      That’s only possible due to improvements in CPUs, RAM, and the general rollout of SSDs. What used to take a air compressor and nailgun is now as simple as a tack hammer.

                                                                                                                  1. 15

                                                                                                                    Traditional configuration files have two large virtues compared to ‘configure in a language’:

                                                                                                                    • they are (or should be) totally pure and thus guaranteed to be free of side effects from evaluating the configuration file. If determining configuration settings executes programs, calls out to databases, or the like, it should at least be completely discoverable before you evaluate it.
                                                                                                                    • their properties and settings can be (relatively) statically determined and are subject to strong introspection to do things like determine where they came from.

                                                                                                                    Creating configurations in a full programming language with full mutable state and full access to the outside world is a comparative nightmare. Your only way of taking a configuration file and determining what settings it results in is actually evaluating it, which may have arbitrary side effects and involve arbitrary information sources, so you don’t even know that the settings you get from evaluating it now it are the same settings you will get in, say, your production environment. And you don’t even know if you’ll get the same settings two days in a row in your production environment (which may be a feature, but until you fully decode the configuration file you don’t know how and why it varies).

                                                                                                                    Or the short version: having to understand a real program in a real programming language in order to know what configuration you are going to get is not really a feature. It does look tempting and easy, though.

                                                                                                                    (I am a system administrator, which very much biases my position here.)

                                                                                                                    1. 2

                                                                                                                      Binaries can also be signed and verified, yet be used in multiple environments without a new binary. I want to used the exact same binary in staging and prod, since that’s the one I verified.

                                                                                                                      If your config is code and you’re not a scripting language or a Lisp, this is challenging.

                                                                                                                    1. 7

                                                                                                                      Merry Christmas! You’re all glorious folk.