1. 28
    1. 11

      Some of this is quite interesting. I didn’t know that Ganpati supported capability-style delegation. (I suspect that it doesn’t, and it’s SAML-style quasi-delegation which doesn’t necessarily compose.)

      Though many places equate SRE with ops and/or on-call, do not expect on-call compensation without negotiation, and possibly not even then.

      Yes. Be prepared to explain to every potential employer, during interviews and pre-employment negotiations, that on-call is not free. Many of us will be employed through states like California and Washington which make this easier to negotiate; be sure to get legal advice from an employment lawyer before signing any employment contract.

      In particular, devinfra / releng is not a bad place to start: no one ever complains about faster builds/less flaky tests, and improving time-to-prod-from-change is generally a metric which is understood and valued, even if it’s not maintained.

      Indeed. My first projects are usually lowering CI/CD times, often by improving the unit-test harness. It accelerates the entire team and almost never has downsides.

      [Google] cares about reliability more than most places, and measures it. Both of these are not default behaviors elsewhere.

      This is the most important takeaway.

      Be humble.

      …Was this really written by Xooglers? Now I’m not so sure…

      1. 4

        My first projects are usually lowering CI/CD times, often by improving the unit-test harness. It accelerates the entire team and almost never has downsides.

        I have been learning that you have to have a test suite to reduce build times this way

        1. 3

          You also need a CI/CD system.

    2. 4

      I’ve never worked at Google, but I’ve now worked at three different companies trying to take Google-internal security tools and make versions that can be consumed by the outside world. So I find these stories fascinating.

    3. 2

      This was the most interesting chunk for me:

      Conversely, there will be things you miss that will surprise you. We found the largest one of these was the entire AAA suite: LOAS/Ganpati/LsAclRep/RpcSecurityPolicy et cetera. It is unsurprising they are missing in the outside world, since the combination of homogeneity of environment and NIH-spirit doesn’t really apply anywhere else. But we strongly miss the ability to look at what access the team-mate beside you has by looking at a small set of tools, duplicating that, and getting on with your day. Or even providing patches to a tool to compare what you have versus what you should have. There’s just no equivalent that we’re aware of, and the cloud provider IAM systems are all gigantic tire fires no-one is coming to put out. Prepare for a future where finding out what you have access to, why, and why not, is an exercise in effort and determination.

      I found this super interesting on a number of counts.

      I’m a recently ex-Amazonian, and my perspective on NIH culture is just the opposite of what they’re citing here. To my mind, many of the internal only tools WERE best of breed ~10 years ago when they were built but the unending emphasis on MVP means that very often they’re built and then left to rot, leading to a really unpleasant experience for those of us who are still manacled to them, largely unchanged, 10 years later.

      From what I know of GOOG culture things are very different over there, and there’s much more emphasis on taking the time to do it right and pay down technical debt so it’s interesting to see that made manifest this way.

      Also? They’re right. I know they’re trying not to say the quiet bit out loud but I’ll say it: IAM is a raging tire fire. It’s incredibly powerful, but human brains just aren’t designed to handle that many layers of overlapping complexity. It’s FAR too difficult to reason about.

      Sure, there’s a TON of tooling out there to combat this, but it’s a problem IMO that we need it.

      Interesting article, great food for thought!

      1. 4

        I’ve been at both, and in my experience, AMZN has far better internal tools than GOOG. One thing about Google is that at every technical decision point, I think Google has made the wrong technical decision. But then they have executed on that flawed decision flawlessly. There’s so much stuff that is invented at Google specifically because their previous decisions forced them to invent something new. And that’s seen as great, internally, but it’s really, really not.

        1. 2

          It’s all about what your forcing factors are, right?

          There is some wisdom in allowing profit motive to drive your engineering decision making if you’re a for-profit company.

          Personally I’m delighted to be out of that particular game for the moment. It’s not my thing :)

        2. 2

          My impression at Google was that there definitely was a tendency to produce highly complex, highly integrated vertically systems:

          • a good, thoughtful, polished design doc is produced for a complex, all-encompassing solution (the perf process incentivizes design docs to become a kind of performative art).
          • then the outlined solution gets executed, solidly and professionally, with due repayment of tech debt and iteration.
          • it becomes both too good and too complex to replace, so it ossifies, until a Great Deprecation comes.

          With perf happening every half a year and some distance to actual customer needs, this gets waterfall-ish, producing solutions that in many cases could be better with more ad-hoc and more chaotic process. Actual engineering feedback and true iteration might get lost in the process.

    4. 2

      Nowhere is SRE defined.

      1. 8

        This is a document for people in SRE who presumably don’t need a definition of what SRE is.

      2. 2

        SRE = Site Reliability Engineering/Engineer