I really appreciate the mentality shown here. I have had to do something similar with a Perl monolith that everyone wanted to ignore and re-write, but 6 years after the “re-write” started and 2 teams later, the Perl is still running the show. Just buckling up and making things better is underrated but very satisfying.
For the code part, yes. Though the endnote seems to undermine the whole piece:
The biggest fuss that happened early on was that I dared change it from “INCIDENT” (as in the Jira ticket/project prefix, INCIDENT-1234) to “SEV”. It was amazing. People came out of the woodwork and used all of their best rationalization techniques to try to explain what was a completely senseless reaction on their part.
I tried to look up “SEV”, as I’ve never heard the term before. All I found was this Atlassian page. It doesn’t even appear on the Wikipedia disambiguation page. Opposing change for change’s sake doesn’t seem like a rationalization; change has cost. Am I missing something? Is this standard or common in some circles?
I wouldn’t say it undermines the entire piece; I’d say that Rachel, like literally every other dev I’ve ever met, gets some things right, and some things wrong. SEV, as noted elsewhere, is common amongst FAANG-derived companies, and not elsewhere. And, yes, the most consistent (and IMVHO correct) path would’ve been to leave the terminology, too. But I don’t think the entire piece goes out the window because she did one thing inconsistent with the rest of what she did, and the overall point is spot-on.
I believe that the terminology comes from BBN’s NOC. Specifically, every operations ticket was assigned a severity, with the numbers running from 5 (somebody said this would be nice) through 1 (customer down and/or multiple sites impaired) to 0 (multiple customers down). Everybody lived in fear of a “sev zero”.
That terminology was in use by 1997, and was probably there by 1992 or so. I have asked on the ex-BBN list if anyone can illuminate it further.
People familiar with that NOC (or who worked there or around it) populated an awful lot of other organizations over the years.
‘Sev’ is short for an incident with a particular ‘severity’. In a large enterprise using something like ITIL you would hear ‘there is a sev’ and know there is an incident that requires attention soon.
A sev 3 might require a team to look at it in working hours. 2 might mean an out of hours on-call team. 1 is the worst and is likely a complete outage of whatever you’re running.
Just buckling up and making things better is underrated but very satisfying.
I have seen many, many, ground up big bang rewrites over the course of 21 years in software development. And very few of them produced a better outcome than would have been obtained by incremental improvement or replacement of the older systems.
Rewriting just introduces new, unknown bugs, no matter how good the team(s) writing the new software is.
I’ve worked as a software tester for 7 years in a few different settings (good and bad teams and organisations) and I can remember one rewrite that improved things. It was a C++ middleware that was rewritten in Java, which made it accessible for more developers in that team. The middleware was multi-threaded, talking to hardware devices & a network backend (both partially async and sync in nature…) and was important to get right (it was handling physical money).
Eventually it was refactored to also work with a PoC Android based terminal, so that the common bits were put in a common code base. It worked great, and when doing this PoC the amount of unknown bugs were most likely smaller than if we’d rewritten it in Kotlin (or what have you) again.
The most amazing thing to me is that apparently none of these people had heard of the term “bikeshed” before. They were unintentionally really good at it
None of the most gifted bikeshedders I’ve met had heard of the term before I gently shared it with them.
So they reimplemented PagerDuty? I.e. they built something which (I assume) is not their core business i.e. what they sell to their customers. I wonder if no one suggested buying rather than building.
Buying can (and often does) have a greater cost than building just the right amount of utilities for yourself. It’s always good to make at least a back-of-a-napkin approximation to check if buying is the right thing to do.
OK but they had at least three full-time employees working on it. According to https://www.pagerduty.com/pricing/ the Pro plan is $19/mo per employee. Let’s say they have a hundred employees on it, that’s $23k/yr, which is cheaper than three full-time engineers, unless Silicon Valley ain’t what it used to be. I mean even if you multiplied by ten it’s almost certainly still cheaper than even a single FTE.
I really appreciate the mentality shown here. I have had to do something similar with a Perl monolith that everyone wanted to ignore and re-write, but 6 years after the “re-write” started and 2 teams later, the Perl is still running the show. Just buckling up and making things better is underrated but very satisfying.
For the code part, yes. Though the endnote seems to undermine the whole piece:
I tried to look up “SEV”, as I’ve never heard the term before. All I found was this Atlassian page. It doesn’t even appear on the Wikipedia disambiguation page. Opposing change for change’s sake doesn’t seem like a rationalization; change has cost. Am I missing something? Is this standard or common in some circles?
I wouldn’t say it undermines the entire piece; I’d say that Rachel, like literally every other dev I’ve ever met, gets some things right, and some things wrong. SEV, as noted elsewhere, is common amongst FAANG-derived companies, and not elsewhere. And, yes, the most consistent (and IMVHO correct) path would’ve been to leave the terminology, too. But I don’t think the entire piece goes out the window because she did one thing inconsistent with the rest of what she did, and the overall point is spot-on.
I believe that the terminology comes from BBN’s NOC. Specifically, every operations ticket was assigned a severity, with the numbers running from 5 (somebody said this would be nice) through 1 (customer down and/or multiple sites impaired) to 0 (multiple customers down). Everybody lived in fear of a “sev zero”.
That terminology was in use by 1997, and was probably there by 1992 or so. I have asked on the ex-BBN list if anyone can illuminate it further.
People familiar with that NOC (or who worked there or around it) populated an awful lot of other organizations over the years.
Did anyone ex-BBN reply?
Nobody replied.
‘Sev’ is short for an incident with a particular ‘severity’. In a large enterprise using something like ITIL you would hear ‘there is a sev’ and know there is an incident that requires attention soon.
A sev 3 might require a team to look at it in working hours. 2 might mean an out of hours on-call team. 1 is the worst and is likely a complete outage of whatever you’re running.
Atlassian have an easy to understand interpretation written up here: https://www.atlassian.com/incident-management/kpis/severity-levels
It’s short for “Site EVent”, not SEVerity. Rachel discusses that in the post (and years ago I worked for FB where they also used the term).
https://response.pagerduty.com/before/severity_levels/
It’s terminology I often encounter in FAANG circles. I’m believe FB, Google, and Amazon use it. We use it Square.
Ah, interesting. Making that change makes more sense if the company is based in the bay.
The hilarious thing is that she felt she had to explain what a ‘SEV’ was in this post but she didn’t need to explain what an ‘INCIDENT’ was.
It’s probably short for “severe”
Or “site event”?
I have seen many, many, ground up big bang rewrites over the course of 21 years in software development. And very few of them produced a better outcome than would have been obtained by incremental improvement or replacement of the older systems.
Rewriting just introduces new, unknown bugs, no matter how good the team(s) writing the new software is.
I’ve worked as a software tester for 7 years in a few different settings (good and bad teams and organisations) and I can remember one rewrite that improved things. It was a C++ middleware that was rewritten in Java, which made it accessible for more developers in that team. The middleware was multi-threaded, talking to hardware devices & a network backend (both partially async and sync in nature…) and was important to get right (it was handling physical money).
Eventually it was refactored to also work with a PoC Android based terminal, so that the common bits were put in a common code base. It worked great, and when doing this PoC the amount of unknown bugs were most likely smaller than if we’d rewritten it in Kotlin (or what have you) again.
What does Rachel dislike about python?
Performance/resource use: https://rachelbythebay.com/w/2020/03/07/costly/
Oh wow: we are still learning lessons from the 90s aren’t we?
None of the most gifted bikeshedders I’ve met had heard of the term before I gently shared it with them.
So they reimplemented PagerDuty? I.e. they built something which (I assume) is not their core business i.e. what they sell to their customers. I wonder if no one suggested buying rather than building.
Buying can (and often does) have a greater cost than building just the right amount of utilities for yourself. It’s always good to make at least a back-of-a-napkin approximation to check if buying is the right thing to do.
OK but they had at least three full-time employees working on it. According to https://www.pagerduty.com/pricing/ the Pro plan is $19/mo per employee. Let’s say they have a hundred employees on it, that’s $23k/yr, which is cheaper than three full-time engineers, unless Silicon Valley ain’t what it used to be. I mean even if you multiplied by ten it’s almost certainly still cheaper than even a single FTE.
Unless these employees were on it in a minor weekly commitment