1. 23

  2. 9

    Now, I’m not a Real Engineer, merely one of these computer-wrangling pretenders, but I am actually familiar with those things.

    I’ve now interviewed 16 people who started in trad engineering and moved over to software, and they unanimously agree that software is Real Engineering.

    1. 7

      I’m looking forward to the follow-up study where you interview people who crossed over in the other direction. And maybe some statistical rigor, if you want us to draw conclusions from your empirical research.

      1. 2

        I’ve so far talked to two people who crossed the other way, but they’re a lot harder to find :(

      2. 3

        From my perspective[1], I would agree that yes, the field certainly can be done as Real Engineering, but in prevailing widespread practice it’s far more often not.

        [1] I say this as someone with an undergraduate degree in engineering, but of the computer variety, so maybe only partway there.

        1. 2

          The challenge is there but as an industry we’re immature and generally bad at our jobs. That’s what people mean when they say this isn’t real engineering–because nobody’s doing the work. It’s slapdash.

          1. 0

            This doesn’t bear out with the first-hand experience of crossovers. “Real engineering” is also incredibly slapdash; I have heard too many horror stories from people in fields we idolize as “real”. And while we’re “immature”, there’s also a lot of innovations we’ve made in process that “real engineers” wish they did in their old jobs. Version control is the big one I hear again and again.

            There’s a lot we can do better, yes, but there’s also a lot we could be doing much, much worse.

        2. 4

          I think her previous post implied another thing worth drawing out explicitly: even with super fancy control-theory algorithms, some basic knowledge about your business still has to inform your design.

          That is, someone has to know that the goal is not to stay near the max acceptable utilization as if undershoot and overshoot were equally costly, but to have the right amount of spare capacity. In economics terms, the ideal is to hit the point where the marginal cost of adding more capacity is about equal to the expected $ the extra capacity would save through reduced downtime risk and perhaps through faster responses while the service is “underloaded”. And then you have to also factor in that full-time-committed capacity can be cheaper.

          More realistically, you can hope to end up in the economically reasonable zone, e.g. you rarely approach overload and also aren’t paying for too many times the in-use capacity. Effort spent fine-tuning is costly, too.

          Someone also needs to “teach” the system that when workload suddenly becomes 1% of or 100x what it usually is, there’s a good chance something’s fishy and it should require human sign-off to scale way up (costing money or, if you have a fixed hardware footprint, potentially crowding out other services) or scale way down (leading to the kind of outage she described at the end of the previous post). Or you may want to deliberately scale up/down in advance of expected loads–initial launch, peaks on certain days/hours, etc.

          There are probably control-theoretic ways to factor things like this in; people have had to deal with asymmetric costs and anomalies for a long time. But even then, the designers need to know to reach for those tools, not settle for an approach that maximizes utilization in the simple, happy case.

          (Unrelatedly, totally worth introducing yourself PID controllers if you haven’t, even if you don’t need them for precise scaling. It’s pretty remarkable how a fairly simple set of rules can manage such a range of processes, from keeping the right temperature in a kiln to super-precise positioning of Blu-Ray read heads.)

          More directly about this post, one of the cleverest things I’ve seen out of hyperscalers (specifically in the Borg paper) is how a single machine can run “non-production”, preemptible jobs using spare capacity. Resource limits, priorities, and a job scheduler all work together to ensure the non-prod jobs don’t get in the way of prod work. It turns spare capacity into a useful resource, provided folks can author some kinds of jobs (data analyses, etc.) with the expectation that chunks of work might get shut down anytime.

          Finally, my work is definitely in the regime she references where web apps in interpreted languages get the job done and throwing (some) hardware at (some) problems is a clearly correct business decision, given the relative costs of machines, staff, and downtime to us. Take the above with that in mind, I guess.

          1. 1

            Great to see recognition that we need to move in this direction. What we’re doing in ops these days does have theory behind it, and it’s not something found in our typical discrete math and language theory CS background.