I don’t want to say DORA is bad and shouldn’t be used but the advice given here relies on a bit of sleight of hand:
High-performing teams generally ship more often, and changes in DF tell you if your team is slowing down or speeding up.
The language choices here reflect what I see across the industry. The term “high-performing” is especially pervasive. Whenever I dig into what people mean by this it usually means “hits management’s projections” or in the most general cases “ships often”. The causality is rarely stated. Is shipping often a key feedback mechanism to keep teams on track toward what are exceptionally accurate projections from management or does management see those teams as high performing because they ship often? It’s rare that management is as transparent with their projections as they want their teams to be. The team examples given don’t explicitly say they were considered high-performing teams but assuming they are we aren’t given any reason why they are considered high-performing. Furthermore while most high-performing teams might ship more often is it also true that teams that ship most often are the highest performing teams? That part is rarely made explicit yet it is implied by the suggestion that you should seek it as an outcome.
There’s a generally accepted cycle in industry of ideate, build, iterate. As Gabe Newell put it once in an AMA:
The most important thing you can do is to get into an iteration cycle where you can measure the impact of your work, have a hypothesis about how making changes will affect those variables, and ship changes regularly. It doesn’t even matter that much what the content is – it’s the iteration of hypothesis, changes, and measurement that will make you better at a faster rate than anything else we have seen.
Our nearly universal agreement on the fundamental nature of this cycle in software comes from two sources, I think.
The first source is that computers are really good at saying no. A lot of junior engineers get frustrated because they try a hundred different things and never get their program to work. A large part of learning software engineering in that period is finding ways to test ideas such that a negative result still yields useful insights. You can’t spend a lot of time writing out detailed plans for a software design and then implement it start to finish because you’re going to get something wrong and if you’ve put 100 pieces into place and the system doesn’t work you have very little to based a hypothesis on regarding the root cause. Designing through building allows each of those debugging cycles to be infinitesimally small. We often don’t even call them debugging.
One of the reasons shipping more often is correlated with many types of success is that shipping is a way to ask for feedback from the system. Feedback is a necessary input. Just like we run our programs often as we write them (either in full or partially, in the form of tests) we also need to ship often to get feedback from the fully-integrated system in real-world use. However there’s a bit of a paradox here because just like junior engineers often run a program without having designed a way to extract a useful insight from a failure, teams also often ship without a way to extract a useful insight. However since the point of shipping is in part to extract insights someone will try to extract insights even if the process has been setup such that the insights are noise.
The second source is that computers are really good at doing the same thing over and over. So once you get to the iterate step, you can feel pretty confident that what you’ve built in the last cycle will continue to work as it did when you decided it was finished. The other implication of this though is that software teams should almost always be doing something that is new. If it has been done before then it should be possible to make the computer do it the same way as the last time it was done. Obviously there’s complexities with regards to IP, companies trying to replicate the success of other companies by hiring talent to build the same thing, etc. In general though there’s a quick point of diminishing returns for having a software team build something that has already been built. Therefore most software teams should not know how to do what they need to do before they start.
I think DORA metrics (like most metrics) have a really high tendency to disguise sampling aliasing as signal. Avoiding sample aliasing requires knowing what the fundamental frequency is and/or building in a filter. Due to the second point above, we don’t know what the fundamental frequency of a software team’s delivery is. Or, to put it another way, a team likely doesn’t have a fundamental frequency. Thus we don’t know how to design a sampling strategy to avoid aliasing. I’ve never once seen management advice for how to filter inputs to avoid this aliasing either.
Like I said at the top, DORA isn’t useless. However I would encourage anyone adopting DORA to first commit to resetting the your approach to DORA with each ideate, build, iterate cycle. If you truly believe that cycle is so short that resetting with each cycle is impractical then there’s a really good chance your team is already very high-performing. If you’re using a fixed length sprint system then give up all hope because the fixed length sprint is designed and intended to detach your team from their “natural” ideate, build, iterate cycle and thus DORA will definitely be noise.
These seem somewhat useful from a DevOps/SRE standpoint, but less useful from a Dev standpoint. Is that the intention? It would be great to see Devs caring about and measuring important things, like how long it takes to fix bugs and implement new features, and how long it takes for a new team members to become productive. If this was standard practice, maybe there would be less BS programming advice, or at least fewer people willing to accept it on faith.
It would be cool if there was an open source listing of real team’s DORA metrics and their practices to achieve those metrics. You could crowd source techniques for good management.
While I like continuous deployment as much as the next guy, using production deployment frequency as a performance metric fails the smell test for me. That metric effectively claims that if I push a feature out by deploying it in unfinished form to production 500 times, once for each incremental work-in-progress commit, and then enabling it once it’s done, my team is vastly more productive than a team of the same size that writes the same feature in the same amount of time and only deploys it when it’s ready.
Also, I think a lot of these metrics are highly contingent on what you’re building. Optimizing your work style to maximize some of the metrics would probably not be an ideal way to boost performance of teams building safety-critical software, for example.
That metric effectively claims that if I push a feature out by deploying it in unfinished form to production 500 times
I think Goodhart’s law (“When a measure becomes a target, it ceases to be a good measure”) applies here. If you are trying to game the metric, you can game the metric (if there is no other checks and balances such as a colleague asking why you are doing this). That doesn’t mean it’s a bad metric, it’s just being mis-used. (The article touches on this by warning that setting targets should be approached very carefully if at all).
A significant failure mode for development is “siloing”, where large amounts of code are produced which are not regularly integrated with the rest of the system/team. I think there is a general awareness that this is bad in the context of a long-lived branch in a repo, but even if we’re all merged to the same branch, what really matters is the code all running together.
I think this is because “it is possible to develop more quickly if you don’t pay attention to how your work affects everybody else”. Regular, frequent integration (at the deepest level possible on your team, which is “push to production” for most) is a public proof point that you have paid the integration costs for your work, and not deferred them.
Relevant review covering many of the issues with these metrics: https://lobste.rs/s/vtiozt/review_accelerate_science_lean_software
I don’t want to say DORA is bad and shouldn’t be used but the advice given here relies on a bit of sleight of hand:
The language choices here reflect what I see across the industry. The term “high-performing” is especially pervasive. Whenever I dig into what people mean by this it usually means “hits management’s projections” or in the most general cases “ships often”. The causality is rarely stated. Is shipping often a key feedback mechanism to keep teams on track toward what are exceptionally accurate projections from management or does management see those teams as high performing because they ship often? It’s rare that management is as transparent with their projections as they want their teams to be. The team examples given don’t explicitly say they were considered high-performing teams but assuming they are we aren’t given any reason why they are considered high-performing. Furthermore while most high-performing teams might ship more often is it also true that teams that ship most often are the highest performing teams? That part is rarely made explicit yet it is implied by the suggestion that you should seek it as an outcome.
There’s a generally accepted cycle in industry of ideate, build, iterate. As Gabe Newell put it once in an AMA:
Our nearly universal agreement on the fundamental nature of this cycle in software comes from two sources, I think.
The first source is that computers are really good at saying no. A lot of junior engineers get frustrated because they try a hundred different things and never get their program to work. A large part of learning software engineering in that period is finding ways to test ideas such that a negative result still yields useful insights. You can’t spend a lot of time writing out detailed plans for a software design and then implement it start to finish because you’re going to get something wrong and if you’ve put 100 pieces into place and the system doesn’t work you have very little to based a hypothesis on regarding the root cause. Designing through building allows each of those debugging cycles to be infinitesimally small. We often don’t even call them debugging.
One of the reasons shipping more often is correlated with many types of success is that shipping is a way to ask for feedback from the system. Feedback is a necessary input. Just like we run our programs often as we write them (either in full or partially, in the form of tests) we also need to ship often to get feedback from the fully-integrated system in real-world use. However there’s a bit of a paradox here because just like junior engineers often run a program without having designed a way to extract a useful insight from a failure, teams also often ship without a way to extract a useful insight. However since the point of shipping is in part to extract insights someone will try to extract insights even if the process has been setup such that the insights are noise.
The second source is that computers are really good at doing the same thing over and over. So once you get to the iterate step, you can feel pretty confident that what you’ve built in the last cycle will continue to work as it did when you decided it was finished. The other implication of this though is that software teams should almost always be doing something that is new. If it has been done before then it should be possible to make the computer do it the same way as the last time it was done. Obviously there’s complexities with regards to IP, companies trying to replicate the success of other companies by hiring talent to build the same thing, etc. In general though there’s a quick point of diminishing returns for having a software team build something that has already been built. Therefore most software teams should not know how to do what they need to do before they start.
I think DORA metrics (like most metrics) have a really high tendency to disguise sampling aliasing as signal. Avoiding sample aliasing requires knowing what the fundamental frequency is and/or building in a filter. Due to the second point above, we don’t know what the fundamental frequency of a software team’s delivery is. Or, to put it another way, a team likely doesn’t have a fundamental frequency. Thus we don’t know how to design a sampling strategy to avoid aliasing. I’ve never once seen management advice for how to filter inputs to avoid this aliasing either.
Like I said at the top, DORA isn’t useless. However I would encourage anyone adopting DORA to first commit to resetting the your approach to DORA with each ideate, build, iterate cycle. If you truly believe that cycle is so short that resetting with each cycle is impractical then there’s a really good chance your team is already very high-performing. If you’re using a fixed length sprint system then give up all hope because the fixed length sprint is designed and intended to detach your team from their “natural” ideate, build, iterate cycle and thus DORA will definitely be noise.
These seem somewhat useful from a DevOps/SRE standpoint, but less useful from a Dev standpoint. Is that the intention? It would be great to see Devs caring about and measuring important things, like how long it takes to fix bugs and implement new features, and how long it takes for a new team members to become productive. If this was standard practice, maybe there would be less BS programming advice, or at least fewer people willing to accept it on faith.
It would be cool if there was an open source listing of real team’s DORA metrics and their practices to achieve those metrics. You could crowd source techniques for good management.
While I like continuous deployment as much as the next guy, using production deployment frequency as a performance metric fails the smell test for me. That metric effectively claims that if I push a feature out by deploying it in unfinished form to production 500 times, once for each incremental work-in-progress commit, and then enabling it once it’s done, my team is vastly more productive than a team of the same size that writes the same feature in the same amount of time and only deploys it when it’s ready.
Also, I think a lot of these metrics are highly contingent on what you’re building. Optimizing your work style to maximize some of the metrics would probably not be an ideal way to boost performance of teams building safety-critical software, for example.
I think Goodhart’s law (“When a measure becomes a target, it ceases to be a good measure”) applies here. If you are trying to game the metric, you can game the metric (if there is no other checks and balances such as a colleague asking why you are doing this). That doesn’t mean it’s a bad metric, it’s just being mis-used. (The article touches on this by warning that setting targets should be approached very carefully if at all).
A significant failure mode for development is “siloing”, where large amounts of code are produced which are not regularly integrated with the rest of the system/team. I think there is a general awareness that this is bad in the context of a long-lived branch in a repo, but even if we’re all merged to the same branch, what really matters is the code all running together.
I think this is because “it is possible to develop more quickly if you don’t pay attention to how your work affects everybody else”. Regular, frequent integration (at the deepest level possible on your team, which is “push to production” for most) is a public proof point that you have paid the integration costs for your work, and not deferred them.